Remove internal-to-word bookmarks
Internal Word bookmarks are currently extracted as links (<a>
s), but they cause problems when they're imported into Wax, and don't serve any useful purpose. Specifically, they prevent Wax from loading at all. Instead of passing them through, we should eliminate them altogether. My vote would be to not even extract them in the first place, rather than extracting them, then scrubbing them out in Typescript. I can't see that they do anything useful in the HTML, so it would be good not to clutter the clean HMTL with them.
Bookmarks begin with a w:bookmarkStart
tag and end with a w:bookmarkEnd
. Everything between these two tags should be deleted.
This is an example of the XML from the docx:
<w:p w14:paraId="39355185" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr><w:pStyle w:val="ChapterTitles"/><w:suppressAutoHyphens/>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/><w:bCs/></w:rPr>
</w:pPr><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica"/><w:b/><w:bCs/></w:rPr>
<w:t xml:space="preserve">CHAPTER 1
</w:t>
</w:r>
</w:p>
It is initially extracted like this, and remains like this all the way through the editoria basic step:
<h3 class="ChapterTitles" style="font-family: Arial Unicode MS; font-size: 12pt; margin-bottom: 6pt; text-align: center; text-decoration: underline">
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
<b>CHAPTER 1 </b>
</h3>
Then, at the editoria reduce step, it's transformed to this:
<h3><a id="docx-bookmark_0"/><a href="#docx-bookmark_0"/>CHAPTER 1</h3>
I recommend that the initial extraction completely drop these bookmarks, so the extraction looks like this:
<w:p w14:paraId="39355185" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr><w:pStyle w:val="ChapterTitles"/><w:suppressAutoHyphens/>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica"/><w:b/><w:bCs/></w:rPr>
<w:t xml:space="preserve">CHAPTER 1
</w:t>
</w:r>
</w:p>