XSweet issueshttps://gitlab.coko.foundation/XSweet/XSweet/-/issues2018-04-24T05:21:50Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/105Incorrect fonts in html - coming from w:rFonts attributes?2018-04-24T05:21:50ZAlex ThegIncorrect fonts in html - coming from w:rFonts attributes?Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts:...Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts: Times, Menlo Regular, and Helvetica. It looks like it has to do with the `w:rFonts` attributes: `w:cs`, `w:eastAsia`, `w:ascii` and `w:hAnsi`. These specify the font to use for certain character types.
The word "Qaradawi" is extracted as Helvetica:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
```
And " assume that " is extracted as Menlo Regular:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
```
The html doesn't specify different fonts for different character types in the same way. How does XSweet handle these `w:rFonts` attributes? Since this displays in the original Word as all Times, I am guessing that Word doesn't consider any of the characters in these runs to be of the type `w:eastAsia` or `w:cs`, but I'm not sure how it decides what kind of character it's looking at. Do you have a better idea what's going on here?
Here's the full XML:
```xml
<w:p w14:paraId="3E8B35BD" w14:textId="77777777" w:rsidR="00DE7EE7" w:rsidRPr="009337E2" w:rsidRDefault="00DE7EE7" w:rsidP="00DE7EE7">
<w:pPr><w:widowControl w:val="0"/>
<w:tabs><w:tab w:val="left" w:pos="560"/><w:tab w:val="left" w:pos="1120"/><w:tab w:val="left" w:pos="1680"/><w:tab w:val="left" w:pos="2240"/><w:tab w:val="left" w:pos="2800"/><w:tab w:val="left" w:pos="3360"/><w:tab w:val="left" w:pos="3920"/><w:tab w:val="left" w:pos="4480"/><w:tab w:val="left" w:pos="5040"/><w:tab w:val="left" w:pos="5600"/><w:tab w:val="left" w:pos="6160"/><w:tab w:val="left" w:pos="6720"/></w:tabs><w:autoSpaceDE w:val="0"/><w:autoSpaceDN w:val="0"/><w:adjustRightInd w:val="0"/><w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr><w:rFonts w:cs="Times"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Times New Roman"/>
<w:color w:val="000000"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve">Even though</w:t>
</w:r>
<w:r w:rsidR="00BA3E1D">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t>‘ulama’</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> like </w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> images of certain objects </w:t>
</w:r>
```
Here's how it's extracted:
```html
<p>
<span style="font-family: Times New Roman"><tab/></span>
<span style="font-family: Times">Even though </span>
<span style="font-family: Times">‘ulama’</span>
<span style="font-family: Times"> like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects </span>
```
And here's the final html
```html
<p><span class="tab"><!-- tab --></span>
<span style="font-family: Times">Even though ‘ulama’ like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/78Buchbinder Ch 2 headings don't get promoted2018-03-14T20:41:17ZAlex ThegBuchbinder Ch 2 headings don't get promotedCompare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
...Compare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
[c_Buchbinder_Chap_2.docx](/uploads/623a4ca2c779cd0571cac9bd4b0507fa/c_Buchbinder_Chap_2.docx)
[output_Buchbinder_Chap_2.zip](/uploads/e0e37f742260d239b6e89a7129733a90/output_Buchbinder_Chap_2.zip)
Header promotion works well in Ch 1, but misses all the headings beyond the very top level in Ch 2. In Ch 1, nothing that gets promoted isn't a header.
In Ch 2, the Chapter title and subtitle get caught as h1s, but none of the other headings get promoted. There are 25 h2 promotions, but they are all either entirely empty, filled only with tabs, or a visual divider made of asterisks.
Any ideas as to the differences between chapters 1 and 2 that cause 1 to work well but 2 to be less accurate? It doesn't appear to be Word styles.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/154Extract math from Word2021-01-21T11:56:02ZAlex ThegExtract math from WordIt looks like there are 2 main ways of embedding math into .docx files (other than plain text):
1. Using the built-in equation editor. This uses a tag XML structure - no binaries, it's all inline:
```xml
<m:oMathPara>
<m:oMath>
```
2. ...It looks like there are 2 main ways of embedding math into .docx files (other than plain text):
1. Using the built-in equation editor. This uses a tag XML structure - no binaries, it's all inline:
```xml
<m:oMathPara>
<m:oMath>
```
2. MathType, the most common math add-on for Word, which uses math binaries that need to be extracted.
For both of these, we should be representing these in MathML (as the standard for HTML5). It looks like we will have to define the mapping for the first option, which could be pretty time consuming. For MathType, we'll need to convert the binaries. @jure's made a ruby gem that converts from MathType to MathML. It may be that we'll need to do a rewrite of this to use it, but it could be a helpful resource.Alex ThegAlex Theg