Reconcile font size from Word style and inline font style
It seems there needs to be an additional check in the scrub step to ensure that any inline font size specifications (perhaps other formatting too?) being removed as redundant match the formatting on the paragraph. Currently, it looks like there's only a check that all the inline specifications are consistent among themselves, and named Word styles can mess that up.
Here's an illustrative example. In Bakker Ch 1, there is a heading that reads "The heroic migrant and the end of migration". In the docx, this is 12pt font, just like the rest of the file, but this ends up as 11pt font in the HTML.
This text is Word style "Default". In the styles.xml
sheet, "Default" text is 11pt font:
<w:style w:type="paragraph" w:customStyle="1" w:styleId="Default">
<w:name w:val="Default"/>
<w:rPr>
<w:rFonts w:ascii="Helvetica" w:hAnsi="Arial Unicode MS" w:cs="Arial Unicode MS"/>
<w:color w:val="000000"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
</w:style>
In the docx, this text was styled "Default" but then changed to font size 12. In the docx xml, font size 12 is specified at the paragraph level, and on the runs themselves (I've included the Word xml at the end of this ticket).
In the initial html extraction, the paragraph specifies font size 11 (which comes from the style), but everything in the p
specifies font size 12 inline:
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<span style="font-size: 12pt">
<i><iCs>The heroic migrant and the end of migration</iCs></i>
</span>
<span style="font-size: 12pt">
<tab/>
</span>
</p>
The inline 12pt specification gets stripped from the whitespace-only element after the join step:
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<span style="font-size: 12pt">
<i>The heroic migrant and the end of migration</i>
</span>
<tab/>
</p>
With the collapsed step, the 12pt inline specification disappears, leaving only the 11pt font specified on the p
. That's where the error is introduced:
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<i>The heroic migrant and the end of migration</i>
<tab/>
</p>
I think the inline font size is removed because all the inline styling in the p
is consistent. But what's missing is a step to check that the inline formatting matches the p
-level formatting.
Is that something that you could add to the scrub step?
- if inline formatting is about to be removed because it can be handled at the
p
level, confirm that the inline formatting matches thep
-level formatting - if there are conflicts, the inline formatting should take precedent, and the
p
formatting should be updated to reflect the inline formatting.
For reference, here's the original Word xml:
<w:p w14:paraId="5617B83D" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr>
<w:pStyle w:val="Default"/>
<w:tabs>
<w:tab w:val="left" w:pos="720"/>
</w:tabs>
<w:spacing w:after="120" w:line="480" w:lineRule="auto"/>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:i/>
<w:iCs/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
<w:t>The heroic migrant and the end of migration</w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
<w:tab/>
</w:r>
</w:p>