XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-07-26T22:13:56Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/106Handle lists2018-07-26T22:13:56ZAlex ThegHandle listsLet’s see what we can do with lists!
In Word, there’s no tag that wraps an entire list. Lists are implemented as a sequence of consecutive `<p>` tags of style “ListParagraph”:
```xml
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777...Let’s see what we can do with lists!
In Word, there’s no tag that wraps an entire list. Lists are implemented as a sequence of consecutive `<p>` tags of style “ListParagraph”:
```xml
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>seven</w:t>
</w:r>
</w:p>
```
To me, here's what makes sense to me to handle lists:
1. look for paragraphs with “ListParagraph” style in the `<w:pPr>` (so this would have to happen before this info gets thrown away)
2. once you find one, keep checking subsequent paragraphs until it finds a p without the “ListParagraph” style
3. wrap all the consecutive ListParagraph items in a list tag
In the `<w:numPr>` tag, there are definitely one and maybe two attributes that we’d care about: ilvl and numId.
ilvl indicates whether and how the list item is nested below other items or top-level. Top level items have an ilvl of 0, a sub-list item has an ilvl of 1, a list item nested under that would have a ilvl of 2, etc:
* Top level list item // ilvl=0
* Nested list item // ilvl=1
These define the structure of the list, and should ideally be reflected in the html.
The numId attribute refers to the numbering schema for any given list, contained in the numbering.xml file. Each time a list is made, its style info is added to the “numbering.xml” file under the same numId reference. That means there’s no indication of how a list is styled in the document.xml itself. To get the list style info (numbered, bulleted, undecorated, etc.), XSweet would have to follow the reference into the numbering.xml file. Is that possible, or too hard to do?
* If it is possible, the html more faithfully reflect the Word doc, and Typescript could port list into Editoria as the correct type (numbered, unnumbered, or bulleted).
* If it’s too difficult to start with this, though, then numId could probably just be dropped.
It seems like this should be its own xsl sheet, early in the pipeline. Really curious to hear what you think!
Here’s a test Word doc, along with the xml.
[list_test.docx](/uploads/fd5b331e9d5f75e71d5630edb2cb3cd8/list_test.docx)
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:mv="urn:schemas-microsoft-com:mac:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w14:paraId="3BB77D17" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00D02C23">
<w:r w:rsidRPr="00D02C23">
<w:t>Although states and other interested parties sponsor and partially control the content of religious messages, individuals do react variously to those messages. Viewers interpret and appropriate words and images in particular ways, and according
to their own situations, since shared television reception is also an individualized experience.</w:t>
</w:r>
</w:p><w:p w14:paraId="7E7F6913" w14:textId="77777777" w:rsidR="00260CBA" w:rsidRDefault="00260CBA" w:rsidP="007D2792"/>
<w:p w14:paraId="77D8D9E3" w14:textId="0BC2F129" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Here are some example lists</w:t>
</w:r>
<w:r w:rsidR="00DB59FD">
<w:t>:</w:t>
</w:r>
</w:p><w:p w14:paraId="552BC6C9" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23"/>
<w:p w14:paraId="0B75E776" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Bulleted list</w:t>
</w:r>
</w:p>
<!-- bullet -->
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>seven</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5D8D1F85" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>fourteen</w:t>
</w:r>
</w:p>
<w:p w14:paraId="09FF0743" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>three</w:t>
</w:r>
</w:p><w:p w14:paraId="17C655EE" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23"/>
<!-- mulitlevel bullet -->
<w:p w14:paraId="2EDB54F8" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Multilevel</w:t>
</w:r>
</w:p>
<w:p w14:paraId="2F5768D7" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point A</w:t>
</w:r>
</w:p>
<w:p w14:paraId="475DC32B" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B</w:t>
</w:r>
</w:p>
<w:p w14:paraId="731511E6" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B.1</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0A5C9720" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B.2</w:t>
</w:r>
</w:p>
<w:p w14:paraId="1D1E4FC3" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point C</w:t>
</w:r>
</w:p><w:p w14:paraId="2EE6BAF7" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23"/>
<!-- numbered -->
<w:p w14:paraId="649F6860" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Numbered list</w:t>
</w:r>
</w:p>
<w:p w14:paraId="49E83B01" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Stop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="6D2C65FC" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Drop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5B515F78" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Roll</w:t>
</w:r>
</w:p><w:p w14:paraId="3AD4CE3C" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3"/>
<!-- multilevel numbered -->
<w:p w14:paraId="581E2232" w14:textId="4791D278" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:r>
<w:t xml:space="preserve">Multilevel numbered
</w:t>
</w:r>
<w:r>
<w:t>list</w:t>
</w:r>
</w:p>
<w:p w14:paraId="4450BE7A" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Stop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="68A823BD" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Drop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="482394C4" w14:textId="2CE4A683" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>To the ground</w:t>
</w:r>
</w:p>
<w:p w14:paraId="3367B702" w14:textId="250D58D1" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Or the ceiling</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0685DA8E" w14:textId="7F5507ED" w:rsidR="003808A3" w:rsidRPr="00D02C23" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Roll</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p>
<w:sectPr w:rsidR="003808A3" w:rsidRPr="00D02C23" w:rsidSect="00227C1D"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr>
</w:body>
</w:document>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/105Incorrect fonts in html - coming from w:rFonts attributes?2018-04-24T05:21:50ZAlex ThegIncorrect fonts in html - coming from w:rFonts attributes?Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts:...Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts: Times, Menlo Regular, and Helvetica. It looks like it has to do with the `w:rFonts` attributes: `w:cs`, `w:eastAsia`, `w:ascii` and `w:hAnsi`. These specify the font to use for certain character types.
The word "Qaradawi" is extracted as Helvetica:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
```
And " assume that " is extracted as Menlo Regular:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
```
The html doesn't specify different fonts for different character types in the same way. How does XSweet handle these `w:rFonts` attributes? Since this displays in the original Word as all Times, I am guessing that Word doesn't consider any of the characters in these runs to be of the type `w:eastAsia` or `w:cs`, but I'm not sure how it decides what kind of character it's looking at. Do you have a better idea what's going on here?
Here's the full XML:
```xml
<w:p w14:paraId="3E8B35BD" w14:textId="77777777" w:rsidR="00DE7EE7" w:rsidRPr="009337E2" w:rsidRDefault="00DE7EE7" w:rsidP="00DE7EE7">
<w:pPr><w:widowControl w:val="0"/>
<w:tabs><w:tab w:val="left" w:pos="560"/><w:tab w:val="left" w:pos="1120"/><w:tab w:val="left" w:pos="1680"/><w:tab w:val="left" w:pos="2240"/><w:tab w:val="left" w:pos="2800"/><w:tab w:val="left" w:pos="3360"/><w:tab w:val="left" w:pos="3920"/><w:tab w:val="left" w:pos="4480"/><w:tab w:val="left" w:pos="5040"/><w:tab w:val="left" w:pos="5600"/><w:tab w:val="left" w:pos="6160"/><w:tab w:val="left" w:pos="6720"/></w:tabs><w:autoSpaceDE w:val="0"/><w:autoSpaceDN w:val="0"/><w:adjustRightInd w:val="0"/><w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr><w:rFonts w:cs="Times"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Times New Roman"/>
<w:color w:val="000000"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve">Even though</w:t>
</w:r>
<w:r w:rsidR="00BA3E1D">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t>‘ulama’</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> like </w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> images of certain objects </w:t>
</w:r>
```
Here's how it's extracted:
```html
<p>
<span style="font-family: Times New Roman"><tab/></span>
<span style="font-family: Times">Even though </span>
<span style="font-family: Times">‘ulama’</span>
<span style="font-family: Times"> like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects </span>
```
And here's the final html
```html
<p><span class="tab"><!-- tab --></span>
<span style="font-family: Times">Even though ‘ulama’ like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/104Incorrect font family applied to a paragraph2018-03-28T17:08:25ZAlex ThegIncorrect font family applied to a paragraphSee Best, References:
Almost all the references are in ps or spans that designate `font-family: Ariel`. But, there is one entry that displays as Times New Roman in the html.
From the final html (rinsed): this snippet shows 2 bibliograp...See Best, References:
Almost all the references are in ps or spans that designate `font-family: Ariel`. But, there is one entry that displays as Times New Roman in the html.
From the final html (rinsed): this snippet shows 2 bibliography entries. The first entry displays (incorrectly) as Times New Roman in the browser, while the second entry is a typical entry that displays correctly:
```html
<p style="margin-left: 36pt; padding-left: 36pt; text-indent: -36pt">Aud, Susan, William, Hussar, Frank Johnson, Grace Kena, Erin Roth, Eileen Manning, Xiaolel Wang, and Jijun Zhang. 2012. <i>The Condition of Education 2012</i>. Washington: National Center for Education Statistics. (http://nces.ed.gov/pubs2012/2012045.pdf--retrieved May 15, 2013)<span style="font-family: Times New Roman; font-size: 10.5pt"> </span></p>
<p style="font-family: Arial; font-size: 12pt; margin-left: 36pt; padding-left: 36pt; text-indent: -36pt">Aud, Susan, William, Hussar, Grace Kena, Kevin Bianco, Lauren Frohlich, Jana Kemp, and Kim Tahan. 2011. <i>The Condition of Education 2011</i>. Washington: National Center for Education Statistics. (http://nces.ed.gov/pubs2011/2011033.pdf--retrieved May 16, 2013). </p>
```
The correct entry above specifies `<p style="font-family: Arial;` but the first one doesn't.
Looking at the initial extraction shows the reason. While the paragraph consists mostly of spans with `style="font-family: Arial"`, there's one empty span at the very end with a `style="font-family: Times New Roman"`.
```html
<p style="margin-left: 36pt; text-indent: -36pt; padding-left: 36pt">
<span style="font-family: Arial; font-size: 12pt">Aud, S</span>
<span style="font-family: Arial; font-size: 12pt">usan, William, Hussar, Frank</span>
<span style="font-family: Arial; font-size: 12pt"></span>
<span style="font-family: Arial; font-size: 12pt">Johnson, Grace Kena, Erin Roth, Eileen Manning, Xiaolel Wang,
</span>
<span style="font-family: Arial; font-size: 12pt">and
</span>
<span style="font-family: Arial; font-size: 12pt">Jijun Zhang.
</span>
<span style="font-family: Arial; font-size: 12pt">2012.</span>
<span style="font-family: Arial; font-size: 12pt"></span<name />
<span style="font-family: Arial; font-size: 12pt"></span>
<span style="font-family: Arial; font-size: 12pt">
<i>The Condition of Education 2012</i>
</span>
<span style="font-family: Arial; font-size: 12pt">. Washington:
</span>
<span style="font-family: Arial; font-size: 12pt">National Center for Education Statistics.
</span>
<span style="font-family: Arial; font-size: 12pt">
(</span>
<span style="font-family: Arial; font-size: 12pt">http://nces.ed.gov/pubs2012/2012045.pdf--retrieved May 15</span>
<span style="font-family: Arial; font-size: 12pt">, 2013)</span>
<span style="font-family: Times New Roman; font-size: 10.5pt"></span>
</p>
```
It seems the fact that there are multiple font families used is enough to keep the font-family from being added into paragraph styles.
Can you update this to ignore fonts that are applied to empty tags? That would keep invisible tags sprinkled into a Word doc from causing incorrect fonts in the html.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/94Inconsistent indentation2018-03-15T16:27:37ZAlex ThegInconsistent indentationSee Gilbert ch 2: [b06_ch02.docx](/uploads/6bd821eeabee201b9cd021c00aff5382/b06_ch02.docx)
The indented spot starting with "Gramma?", has the same indentation level as the dialogue right below it:
![Screen_Shot_2017-04-19_at_11.09.3...See Gilbert ch 2: [b06_ch02.docx](/uploads/6bd821eeabee201b9cd021c00aff5382/b06_ch02.docx)
The indented spot starting with "Gramma?", has the same indentation level as the dialogue right below it:
![Screen_Shot_2017-04-19_at_11.09.32_AM](/uploads/c383012829026bab0f8fe39033c0547c/Screen_Shot_2017-04-19_at_11.09.32_AM.png)
But in the HTML, there are 3 lines that have a "padding-left: 18pt," which makes them appear more indented than they do in the Word doc:
“How come some of them have their shawls over their heads?”
“Nisht shawl,” Gramma corrects, “tallis.”
“But why are they bowing up and down?”
Low priority as indentation doesn't port into Editoria.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/92Header promotion example for Alex to check2019-07-07T22:40:39ZAlex ThegHeader promotion example for Alex to checkIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotionIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotion1.0.0Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/90Some indentations not reflected in HTML2018-03-22T19:18:06ZAlex ThegSome indentations not reflected in HTMLGarcia ch 4: [d_Chapter4GGv5.docx](/uploads/394c8e688977abb7158be3801f068a21/d_Chapter4GGv5.docx)
The first 3 paragraphs are indented in the HTML as `<p style="text-indent: 36pt">`, but starting with the 4th paragraph ("The urban struct...Garcia ch 4: [d_Chapter4GGv5.docx](/uploads/394c8e688977abb7158be3801f068a21/d_Chapter4GGv5.docx)
The first 3 paragraphs are indented in the HTML as `<p style="text-indent: 36pt">`, but starting with the 4th paragraph ("The urban structures..."), the indentation stops. This appears to come from the styles. The first 3 paragraphs are "Normal + First Line: 0.5" but the 4th on are just "Normal." However, these paragraphs are indeed indented when displayed in Word.
Is there a way to respect these 1st-line indentations that aren't coming through? This is related to #81 in that we want to recreate what's actually displayed in Word.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/89Some bib entries formatting coming through inconsistently2018-03-15T16:28:13ZAlex ThegSome bib entries formatting coming through inconsistentlyGarcia bib (file on #87)
The bibliography entry for "Haring, C. H." is not indented like the other entries. The other entries all have margin-left, text-indent, and padding-left properties:
```html
<p style="margin-left: 36pt; text...Garcia bib (file on #87)
The bibliography entry for "Haring, C. H." is not indented like the other entries. The other entries all have margin-left, text-indent, and padding-left properties:
```html
<p style="margin-left: 36pt; text-indent: -36pt; padding-left: 36pt; font-family: Times New Roman; font-size: 12pt">
```
But this one only has:
```html
<p style="font-family: Times New Roman; font-size: 12pt">
```
Can you tell why this is extracted differently than the others? There are a few more similar cases in the bib. However it's implemented under the hood, we'd want the conversion to reflect the actually display as much as possible.
Incidentally, it also has a spurious line break in it, but this was introduced by the author so nothing to do about it.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/87Don't distinguish between otherwise equivalent style signatures on "text-alig...2019-07-07T22:39:46ZAlex ThegDon't distinguish between otherwise equivalent style signatures on "text-align: left" aloneGarcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differentl...Garcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differently (#81) will stop some of the spurious promotions.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/83Improve header level guesses by considering order of appearance2018-03-14T20:27:18ZAlex ThegImprove header level guesses by considering order of appearanceAs mentioned on the 4/15/17 comment on ticket #81:
How can header promotion consider the order of appearance of the headers to improve the heading level it guesses? How is it currently deciding which level to promote things to? Does...As mentioned on the 4/15/17 comment on ticket #81:
How can header promotion consider the order of appearance of the headers to improve the heading level it guesses? How is it currently deciding which level to promote things to? Does it consider the resulting structure of the document?
We'll probably want to implement some checks for this. My first thought is that once header promotion has identified all the different heading levels (say there are 3 different formats, so it knows for sure there should be 3 different heading levels), it could then order them according to whatever looks like it produces the most credible heading structure.
As an example, let's say it finds 3 levels of headers and initially promotes them as follows:
h2
h1
h1
h3
h1
h3
h3
h1
A final step would realize that changing the order around to get this structure makes better sense:
h1
h2
h2
h3
h2
h3
h3
h2
and make the change.
We'd need to give it a few rules it can use to score the structures and choose the best one. A good start could be to say that *generally* (not rigidly) 1) lower level headers should be nested under higher level ones, and 2) sequential heading levels should either stay the same or increment or decrement by 1 level.
For improving the accuracy of heading levels, I don't think the formatting itself will ever tell us much about what level a group of headings should be. That's because authors use formatting in so many different and entirely inconsistent ways. Considered in a vacuum, there's no reasonable way to say something like "bolding denotes a higher level heading than underlining, which denotes a higher level than italics". I think the best way to improve header level inferring will be by looking at their order of appearance and what that might say about the structure.
What do you think?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/78Buchbinder Ch 2 headings don't get promoted2018-03-14T20:41:17ZAlex ThegBuchbinder Ch 2 headings don't get promotedCompare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
...Compare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
[c_Buchbinder_Chap_2.docx](/uploads/623a4ca2c779cd0571cac9bd4b0507fa/c_Buchbinder_Chap_2.docx)
[output_Buchbinder_Chap_2.zip](/uploads/e0e37f742260d239b6e89a7129733a90/output_Buchbinder_Chap_2.zip)
Header promotion works well in Ch 1, but misses all the headings beyond the very top level in Ch 2. In Ch 1, nothing that gets promoted isn't a header.
In Ch 2, the Chapter title and subtitle get caught as h1s, but none of the other headings get promoted. There are 25 h2 promotions, but they are all either entirely empty, filled only with tabs, or a visual divider made of asterisks.
Any ideas as to the differences between chapters 1 and 2 that cause 1 to work well but 2 to be less accurate? It doesn't appear to be Word styles.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/76Brinton Ch 2: an incorrect h4 promotion2019-07-07T22:22:05ZAlex ThegBrinton Ch 2: an incorrect h4 promotionBrinton Ch 2: [b02_Brinton_Chapter_2.docx](/uploads/6fbc8fffc4a354e2f90e8c6cbdfae434/b02_Brinton_Chapter_2.docx)
[output_brinton_ch_2.zip](/uploads/fe7a849192953b6e043e6dc1681709b8/output_brinton_ch_2.zip)
This bit gets promoted to an ...Brinton Ch 2: [b02_Brinton_Chapter_2.docx](/uploads/6fbc8fffc4a354e2f90e8c6cbdfae434/b02_Brinton_Chapter_2.docx)
[output_brinton_ch_2.zip](/uploads/fe7a849192953b6e043e6dc1681709b8/output_brinton_ch_2.zip)
This bit gets promoted to an h4 but it's not:
`<h4 class="FootnoteText">According to an interview done with Sha‘rawi:</h4>`
Can you tell why? It's listed as FootnoteText, not sure if that contributes to it or not.
There is also another empty tag that gets promoted to an h4, but since it's empty it gets removed in typescript (there's a request in to move the empty header removal into header promotion itself - #55).1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/73Bragg Ch 3 headings2019-07-07T22:24:41ZAlex ThegBragg Ch 3 headingsBragg, chs 1-3: [braggs_ch_1_2_3.zip](/uploads/92cd9991310a13f4c4b4546d5ac324e0/braggs_ch_1_2_3.zip)
[output_braggs_ch1_2_3.zip](/uploads/ab8be891d47acd05da2bdf2c6b81624f/output_braggs_ch1_2_3.zip)
Bragg, ch. 1 and 2, most of the headin...Bragg, chs 1-3: [braggs_ch_1_2_3.zip](/uploads/92cd9991310a13f4c4b4546d5ac324e0/braggs_ch_1_2_3.zip)
[output_braggs_ch1_2_3.zip](/uploads/ab8be891d47acd05da2bdf2c6b81624f/output_braggs_ch1_2_3.zip)
Bragg, ch. 1 and 2, most of the headings get caught and promoted (not perfect but fine), but the headings don't get caught for ch. 3. It's hard for me to tell exactly why, can you?
Across some chapters, there are a few erroneously applied class="Heading1", which don't affect the display of the text in Word, but may affect the header promotion:
* Ch 1, a note "Bryan Wagner introduces..."; not promoted to a header. Header promotion worked well in this chapter
* Ch 2, no erroneous "Heading1"s; header promotion worked pretty well
* Ch 3, 6 paragraphs labeled as "Heading1". There's only one header promotion in this doc, and it's an empty header labeled class="Heading1". None of the actual headings are caught.
Anyway, I don't know if the class="Heading1" labels contribute to what we see in Chapter 3 - do they? What might fix the Ch 3 header promotion issue?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/70Contradictory styles make a quote display as bold in the browser2018-03-28T21:54:45ZAlex ThegContradictory styles make a quote display as bold in the browserBowen, ch 6: [b_Bowen_Chapter_6.docx](/uploads/cf42db0827d247ab201bae29d39cf463/b_Bowen_Chapter_6.docx)
[output_bowen_ch_6.zip](/uploads/b2dbbd8c07fb2a8b29f2e2f2d88b4bd4/output_bowen_ch_6.zip)
Low priority issue; let's come back to it ...Bowen, ch 6: [b_Bowen_Chapter_6.docx](/uploads/cf42db0827d247ab201bae29d39cf463/b_Bowen_Chapter_6.docx)
[output_bowen_ch_6.zip](/uploads/b2dbbd8c07fb2a8b29f2e2f2d88b4bd4/output_bowen_ch_6.zip)
Low priority issue; let's come back to it later:
There's one small inline quotation that displays as bold in the browser ("a blatant example..."). This is surely because it was copy/pasted into Word from somewhere else; while the surrounding text is style "Normal," this quote is style "Strong + Not Bold" (lol). As a result, the quote is bolded when viewed in the browser.
This is low priority, because
1. once this goes through Typescript, the aberrant styling gets removed as one that's not important, so it would only help improve the html
2. the solution seems complicated for very small payoff
So let's keep this on the backburner for now.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/68Note on end of header stops promotion2019-07-07T22:32:26ZAlex ThegNote on end of header stops promotionSee #67 for files, Bowen Ch 1
The heading "The War on Terroir" isn't getting promoted to an h1, like all the other headings in the chapter (which, apart from the very 1st chapter title, are all the same level). This is because there'...See #67 for files, Bowen Ch 1
The heading "The War on Terroir" isn't getting promoted to an h1, like all the other headings in the chapter (which, apart from the very 1st chapter title, are all the same level). This is because there's an note at the end of the heading:
![Screen_Shot_2017-04-09_at_4.58.12_AM](/uploads/a421fc609d8d6dc2361af6eabb229839/Screen_Shot_2017-04-09_at_4.58.12_AM.png)
Removing the note and re-running the conversion fixes the issue. Is there an easy fix for this, so a note at the end of the header doesn't stop promotion?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/67Check that headers promoted to the same level are internally consistent2018-03-15T17:44:14ZAlex ThegCheck that headers promoted to the same level are internally consistentb Bowen Chapter 1: [b_Bowen_Chapter_1.docx](/uploads/1d451dda0fc0a1b5a2c5afc8f93eeb60/b_Bowen_Chapter_1.docx)
Output: [output_b_Bowen_Chapter_1.zip](/uploads/fd1d441a66ac8f328a7cf3922661a6a0/output_b_Bowen_Chapter_1.zip)
Overall this ...b Bowen Chapter 1: [b_Bowen_Chapter_1.docx](/uploads/1d451dda0fc0a1b5a2c5afc8f93eeb60/b_Bowen_Chapter_1.docx)
Output: [output_b_Bowen_Chapter_1.zip](/uploads/fd1d441a66ac8f328a7cf3922661a6a0/output_b_Bowen_Chapter_1.zip)
Overall this chapter comes through quite nicely, and it offers an interesting case for header promotion.
The only headers this has are 1) a chapter title, and 2) 11 same-level headings in the body of the text.
1. The chapter title is bold and centered
2. The other headings are bold but not centered
Beyond that, the text is the same as the rest of the content. Word styles don't seem to factor into this one. So, here we have two levels of headers that are different from each other only in that one is centered and one is not. I think this is an important clue that these are different heading levels.
I've just proposed a step to check whether different heading levels are formatted the same, and combine them if they are (#64). The other side of that coin would be a step that ensure that all the headings promoted to the same level are *internally* consistent on some key criteria, and I think text alignment is one of those points.
What do you think?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/66Table conversion2018-03-28T22:02:35ZAlex ThegTable conversionBowen, "Captions": [a_Bowen_Captions.docx](/uploads/0fb3bc9288a45e7b52fc925845714a89/a_Bowen_Captions.docx)
Outputs: [output_Bowen_captions.zip](/uploads/b6210ad20f25425704bdc86ada9fc57f/output_Bowen_captions.zip)
This table come thro...Bowen, "Captions": [a_Bowen_Captions.docx](/uploads/0fb3bc9288a45e7b52fc925845714a89/a_Bowen_Captions.docx)
Outputs: [output_Bowen_captions.zip](/uploads/b6210ad20f25425704bdc86ada9fc57f/output_Bowen_captions.zip)
This table come through very nicely - w00t! I don't see how this particular example could be improved.
Nothing to do about this issue right now. I will use it as a place to put examples of tables as I find them.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/64Promotion step to combine similarly formatted headings to same level2018-03-15T17:46:13ZAlex ThegPromotion step to combine similarly formatted headings to same levelBakker Ch 1:
[b_02_ch_1_Bakker.docx](/uploads/eb8fb4e69594a4462a16f781d74d2379/b_02_ch_1_Bakker.docx)
Bakker Ch 1 conversion:[output_b02_Ch_1_Bakker.zip](/uploads/c96603d3d4db9be2f485c75d80715269/output_b02_Ch_1_Bakker.zip)
See ...Bakker Ch 1:
[b_02_ch_1_Bakker.docx](/uploads/eb8fb4e69594a4462a16f781d74d2379/b_02_ch_1_Bakker.docx)
Bakker Ch 1 conversion:[output_b02_Ch_1_Bakker.zip](/uploads/c96603d3d4db9be2f485c75d80715269/output_b02_Ch_1_Bakker.zip)
See issue #56 for a full description. In this chapter, one of the heading levels, consistently formatted as italic but using different Word styles, comes through marked as 2 different header levels. This suggests there could be a step at the end of header promotion to compare the formatting of each heading level it's identified to the formatting on other heading levels and combine similarly formatted headers to the same level.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/62Handling whitespace-only formatting2019-07-07T21:34:20ZAlex ThegHandling whitespace-only formattingBakker ch1, see #56 for files
There are 5 headers of the same level, but one of theme doesn't get promoted like the others. Seems to be caused by a `<tab>` at the end of the heading "The heroic migrant and the end of migration".
T...Bakker ch1, see #56 for files
There are 5 headers of the same level, but one of theme doesn't get promoted like the others. Seems to be caused by a `<tab>` at the end of the heading "The heroic migrant and the end of migration".
These all get promoted to h1:
* Keeping the monies flowing the times of crises
* The limits of migrant inclusion
* Migration, state-led transnationalism, and development
* The Washington Consensus and beyond: the continuing significance of market fundamentalism in development policy and practice
This one doesn't:
* The heroic migrant and the end of migration
Here's one that gets promoted, just after join-elements and before the header promotion steps:
````html
<p class="Default" style="font-size: 12pt; font-style: italic; margin-bottom: 6pt"><i>The limits of migrant inclusion</i></p>
````
This is the offending tab (at least, I think it's the tab keeping this from being recognized as a header):
````html
<p class="Default" style="font-size: 12pt; margin-bottom: 6pt"><i>The heroic migrant and the end of migration</i>
<tab/>
</p>
````
Perhaps a cleaning step that strips out trailing tabs before promotion? I can't think where a trailing tab would ever be meaningful.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/60Invisible-to-the-eye changes in Word style cause paragraphs to be promoted to...2019-07-07T22:10:58ZAlex ThegInvisible-to-the-eye changes in Word style cause paragraphs to be promoted to headersSee #59 for the word files and outputs for Berry
In Berry, invisible (not-format changing) changes in Word style are causing paragraphs to be promoted to headers.
Most of the main content is styled in Word as: "Normal + (Latin) G...See #59 for the word files and outputs for Berry
In Berry, invisible (not-format changing) changes in Word style are causing paragraphs to be promoted to headers.
Most of the main content is styled in Word as: "Normal + (Latin) Garamond". These examples from Berry chapter 2 show the style changes that cause the erroneous promotion:
* Comment Text: The 3 paragraphs following "It is certainly true"
* Body Text: "Many new threads"
* Body Text 2: "In 1906, amid" is
* Block Text: "The Mountaineers is an association"
And in ch 3, 2 regular paragraphs are styled as headings behind the scenes and thus get promoted.
Although a length filter would solve this (#58), this also relates to the policy about what Word styles do and don't matter, and how to handle them (soon to be documented on the wiki). These sound like styles we'd want to ignore for header promotion. Are these specific Word styles important to header promotion, or is it rather simply that the style has changes from the rest of the content?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/59Some epigraphs, not others, being promoted to headers2019-07-07T22:35:47ZAlex ThegSome epigraphs, not others, being promoted to headers[berry.zip](/uploads/33c200a740753230a1f3c0f2f8c3b08c/berry.zip)
[output_berry.zip](/uploads/e2f33cf0126f7681c096caef3b2ae834/output_berry.zip)
In Berry, each chapter begins with an epigraph. Sometimes the epigraph is promoted to ...[berry.zip](/uploads/33c200a740753230a1f3c0f2f8c3b08c/berry.zip)
[output_berry.zip](/uploads/e2f33cf0126f7681c096caef3b2ae834/output_berry.zip)
In Berry, each chapter begins with an epigraph. Sometimes the epigraph is promoted to a heading, but not always. The epigraph is promoted in the intro, ch1, ch2, and the conclusion. It is not promoted in ch3 or ch4.
Can you tell what's causing the difference, and how to keep these from being labeled headings? I can't see that it's a Word styles issue. A filter to not tag a header if it's too long (#58) could stop most of these, but fixing whatever else is going on could stop future errors too.1.0.0Wendell PiezWendell Piez