Word styles with ' + Not [bold/italics/etc.]" extract incorrectly
I'm hopeful that this will fix a lot of the remaining formatting problems.
In Word, named styles can be modified by adding or removing formatting, e.g.:
- "Normal + Bold"
- "Emphasis + Not Italic"
The way these modifications are applied are by specifying a w:val
on the formatting tag. A value of 1 turns the formatting on; a value of 0 turns it off. For example, this means "No Italics": <w:i w:val="0"/>
.
Adding formatting on top of styling works correctly, but whenever "+ Not[format]" is specified, it's extracted as just the opposite: that formatting turns on when it should turn off. The extraction doesn't take the val
property into account, but should.
Whenever w:val="0"
appears in any of the following tags...
<w:i w:val="0"/>
<w:b w:val="0"/>
- `<w:u w:val="none"/>
...then XSweet should turn that formatting off. The formatting could be specified on the paragraph level, or in a span inside a paragraph, so finding it and turning it off might be the trickiest part. Perhaps we could preserve these until all the styling should have been promoted to the <p style=>
, and then check for the formatting to turn off. Or, is there somet tag or way to turn off inline formatting regardless of whether it's already applied or not? If so, that might be simplest.
Here are 2 examples from Braggs Ch 1:
Example 1
This paragraph is erroneously extracted as italic. In the Word doc, there's no visible change of formatting between this paragraph and the previous one, but in fact the named Word style change from "Normal (Web) + Auto" to "Emphasis + Not Italics":
<w:p w14:paraId="3E1B900F" w14:textId="6A38EA12" w:rsidR="00CB4328" w:rsidRPr="004F60F8" w:rsidRDefault="00CB4328" w:rsidP="0036427E">
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto"/>
<w:jc w:val="both"/>
</w:pPr>
<w:r w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t>Throughout his works, Césaire</w:t>
</w:r>
<w:r w:rsidR="004A368E" w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t xml:space="preserve">claimed</w:t>
</w:r>
In the above, <w:rStyle w:val="Emphasis"/>
specifies italics (in the CSS extracted to the top of the HTML). Then, the <w:i w:val="0"/>
(not italics) should turn the italics back off, but it doesn't:
<p style="text-align: justify">
<span style="font-style: italic">
<span class="Emphasis">Throughout his works, Césaire</span>
</span>
<span style="font-style: italic">
<span class="Emphasis"> claimed</span>
</span>
Example 2
Footnote 15 comes through in bold in the HTML but not in Word. Its Word style is "Heading1 + Times New Roman, 10pt, Not Bold". So, the paragraph inherits bolding from its "Heading1" style. Then, whereas <w:b w:val="0"/>
should remove the bold formatting, it doesn't.
Word XML:
<w:endnote w:id="15">
<w:p w14:paraId="7FAB95D6" w14:textId="2E2795A2" w:rsidR="0005708E" w:rsidRPr="008B0C6C" w:rsidRDefault="0005708E" w:rsidP="00EF6AD7">
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:spacing w:before="0" w:beforeAutospacing="0" w:after="0" w:afterAutospacing="0"/>
<w:jc w:val="both"/>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B0C6C">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">Bryan Wagner introduces an insightful analysis</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">of
</w:t>
</w:r>
And the HTML, incorrectly bolded:
<div class="docx-endnote" id="en15">
<p class="Heading1" style="margin-top: 0pt; margin-bottom: 0pt; xsweet-outline-level: 0; font-family: Times; font-weight: bold; font-size: 24pt; text-align: justify">
<span style="font-size: 10pt; font-family: Times New Roman">
...
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>Bryan Wagner introduces an insightful analysis </lang>
</span>
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>of </lang>
</span>