XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-07-26T22:13:56Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/106Handle lists2018-07-26T22:13:56ZAlex ThegHandle listsLet’s see what we can do with lists!
In Word, there’s no tag that wraps an entire list. Lists are implemented as a sequence of consecutive `<p>` tags of style “ListParagraph”:
```xml
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777...Let’s see what we can do with lists!
In Word, there’s no tag that wraps an entire list. Lists are implemented as a sequence of consecutive `<p>` tags of style “ListParagraph”:
```xml
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>seven</w:t>
</w:r>
</w:p>
```
To me, here's what makes sense to me to handle lists:
1. look for paragraphs with “ListParagraph” style in the `<w:pPr>` (so this would have to happen before this info gets thrown away)
2. once you find one, keep checking subsequent paragraphs until it finds a p without the “ListParagraph” style
3. wrap all the consecutive ListParagraph items in a list tag
In the `<w:numPr>` tag, there are definitely one and maybe two attributes that we’d care about: ilvl and numId.
ilvl indicates whether and how the list item is nested below other items or top-level. Top level items have an ilvl of 0, a sub-list item has an ilvl of 1, a list item nested under that would have a ilvl of 2, etc:
* Top level list item // ilvl=0
* Nested list item // ilvl=1
These define the structure of the list, and should ideally be reflected in the html.
The numId attribute refers to the numbering schema for any given list, contained in the numbering.xml file. Each time a list is made, its style info is added to the “numbering.xml” file under the same numId reference. That means there’s no indication of how a list is styled in the document.xml itself. To get the list style info (numbered, bulleted, undecorated, etc.), XSweet would have to follow the reference into the numbering.xml file. Is that possible, or too hard to do?
* If it is possible, the html more faithfully reflect the Word doc, and Typescript could port list into Editoria as the correct type (numbered, unnumbered, or bulleted).
* If it’s too difficult to start with this, though, then numId could probably just be dropped.
It seems like this should be its own xsl sheet, early in the pipeline. Really curious to hear what you think!
Here’s a test Word doc, along with the xml.
[list_test.docx](/uploads/fd5b331e9d5f75e71d5630edb2cb3cd8/list_test.docx)
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:mv="urn:schemas-microsoft-com:mac:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w14:paraId="3BB77D17" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00D02C23">
<w:r w:rsidRPr="00D02C23">
<w:t>Although states and other interested parties sponsor and partially control the content of religious messages, individuals do react variously to those messages. Viewers interpret and appropriate words and images in particular ways, and according
to their own situations, since shared television reception is also an individualized experience.</w:t>
</w:r>
</w:p><w:p w14:paraId="7E7F6913" w14:textId="77777777" w:rsidR="00260CBA" w:rsidRDefault="00260CBA" w:rsidP="007D2792"/>
<w:p w14:paraId="77D8D9E3" w14:textId="0BC2F129" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Here are some example lists</w:t>
</w:r>
<w:r w:rsidR="00DB59FD">
<w:t>:</w:t>
</w:r>
</w:p><w:p w14:paraId="552BC6C9" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23"/>
<w:p w14:paraId="0B75E776" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Bulleted list</w:t>
</w:r>
</w:p>
<!-- bullet -->
<w:p w14:paraId="5D2E8ABB" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>seven</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5D8D1F85" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>fourteen</w:t>
</w:r>
</w:p>
<w:p w14:paraId="09FF0743" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
</w:pPr>
<w:r>
<w:t>three</w:t>
</w:r>
</w:p><w:p w14:paraId="17C655EE" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23"/>
<!-- mulitlevel bullet -->
<w:p w14:paraId="2EDB54F8" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Multilevel</w:t>
</w:r>
</w:p>
<w:p w14:paraId="2F5768D7" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point A</w:t>
</w:r>
</w:p>
<w:p w14:paraId="475DC32B" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B</w:t>
</w:r>
</w:p>
<w:p w14:paraId="731511E6" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B.1</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0A5C9720" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point B.2</w:t>
</w:r>
</w:p>
<w:p w14:paraId="1D1E4FC3" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Point C</w:t>
</w:r>
</w:p><w:p w14:paraId="2EE6BAF7" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23"/>
<!-- numbered -->
<w:p w14:paraId="649F6860" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="007D2792">
<w:r>
<w:t>Numbered list</w:t>
</w:r>
</w:p>
<w:p w14:paraId="49E83B01" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Stop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="6D2C65FC" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Drop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5B515F78" w14:textId="77777777" w:rsidR="00D02C23" w:rsidRDefault="00D02C23" w:rsidP="00D02C23">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="3"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Roll</w:t>
</w:r>
</w:p><w:p w14:paraId="3AD4CE3C" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3"/>
<!-- multilevel numbered -->
<w:p w14:paraId="581E2232" w14:textId="4791D278" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:r>
<w:t xml:space="preserve">Multilevel numbered
</w:t>
</w:r>
<w:r>
<w:t>list</w:t>
</w:r>
</w:p>
<w:p w14:paraId="4450BE7A" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Stop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="68A823BD" w14:textId="77777777" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Drop</w:t>
</w:r>
</w:p>
<w:p w14:paraId="482394C4" w14:textId="2CE4A683" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>To the ground</w:t>
</w:r>
</w:p>
<w:p w14:paraId="3367B702" w14:textId="250D58D1" w:rsidR="003808A3" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Or the ceiling</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0685DA8E" w14:textId="7F5507ED" w:rsidR="003808A3" w:rsidRPr="00D02C23" w:rsidRDefault="003808A3" w:rsidP="003808A3">
<w:pPr><w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="4"/></w:numPr>
</w:pPr>
<w:r>
<w:t>Roll</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p>
<w:sectPr w:rsidR="003808A3" w:rsidRPr="00D02C23" w:rsidSect="00227C1D"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr>
</w:body>
</w:document>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/105Incorrect fonts in html - coming from w:rFonts attributes?2018-04-24T05:21:50ZAlex ThegIncorrect fonts in html - coming from w:rFonts attributes?Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts:...Brinton Ch 8 has some incorrect fonts coming through into the html. The following text is all Times in Word:
>Even though ‘ulama’ like Qaradawi assume that images...
However, it comes through in the rinsed html in 3 different fonts: Times, Menlo Regular, and Helvetica. It looks like it has to do with the `w:rFonts` attributes: `w:cs`, `w:eastAsia`, `w:ascii` and `w:hAnsi`. These specify the font to use for certain character types.
The word "Qaradawi" is extracted as Helvetica:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
```
And " assume that " is extracted as Menlo Regular:
```xml
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
```
The html doesn't specify different fonts for different character types in the same way. How does XSweet handle these `w:rFonts` attributes? Since this displays in the original Word as all Times, I am guessing that Word doesn't consider any of the characters in these runs to be of the type `w:eastAsia` or `w:cs`, but I'm not sure how it decides what kind of character it's looking at. Do you have a better idea what's going on here?
Here's the full XML:
```xml
<w:p w14:paraId="3E8B35BD" w14:textId="77777777" w:rsidR="00DE7EE7" w:rsidRPr="009337E2" w:rsidRDefault="00DE7EE7" w:rsidP="00DE7EE7">
<w:pPr><w:widowControl w:val="0"/>
<w:tabs><w:tab w:val="left" w:pos="560"/><w:tab w:val="left" w:pos="1120"/><w:tab w:val="left" w:pos="1680"/><w:tab w:val="left" w:pos="2240"/><w:tab w:val="left" w:pos="2800"/><w:tab w:val="left" w:pos="3360"/><w:tab w:val="left" w:pos="3920"/><w:tab w:val="left" w:pos="4480"/><w:tab w:val="left" w:pos="5040"/><w:tab w:val="left" w:pos="5600"/><w:tab w:val="left" w:pos="6160"/><w:tab w:val="left" w:pos="6720"/></w:tabs><w:autoSpaceDE w:val="0"/><w:autoSpaceDN w:val="0"/><w:adjustRightInd w:val="0"/><w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr><w:rFonts w:cs="Times"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Times New Roman"/>
<w:color w:val="000000"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve">Even though</w:t>
</w:r>
<w:r w:rsidR="00BA3E1D">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t>‘ulama’</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> like </w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica"/>
</w:rPr>
<w:t>Qaradawi</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:eastAsia="Helvetica" w:cs="Menlo Regular"/>
</w:rPr>
<w:t xml:space="preserve"> assume that</w:t>
</w:r>
<w:r w:rsidRPr="009337E2">
<w:rPr>
<w:rFonts w:cs="Times"/>
</w:rPr>
<w:t xml:space="preserve"> images of certain objects </w:t>
</w:r>
```
Here's how it's extracted:
```html
<p>
<span style="font-family: Times New Roman"><tab/></span>
<span style="font-family: Times">Even though </span>
<span style="font-family: Times">‘ulama’</span>
<span style="font-family: Times"> like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects </span>
```
And here's the final html
```html
<p><span class="tab"><!-- tab --></span>
<span style="font-family: Times">Even though ‘ulama’ like </span>
<span style="font-family: Helvetica">Qaradawi</span>
<span style="font-family: Menlo Regular"> assume that</span>
<span style="font-family: Times"> images of certain objects
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/104Incorrect font family applied to a paragraph2018-03-28T17:08:25ZAlex ThegIncorrect font family applied to a paragraphSee Best, References:
Almost all the references are in ps or spans that designate `font-family: Ariel`. But, there is one entry that displays as Times New Roman in the html.
From the final html (rinsed): this snippet shows 2 bibliograp...See Best, References:
Almost all the references are in ps or spans that designate `font-family: Ariel`. But, there is one entry that displays as Times New Roman in the html.
From the final html (rinsed): this snippet shows 2 bibliography entries. The first entry displays (incorrectly) as Times New Roman in the browser, while the second entry is a typical entry that displays correctly:
```html
<p style="margin-left: 36pt; padding-left: 36pt; text-indent: -36pt">Aud, Susan, William, Hussar, Frank Johnson, Grace Kena, Erin Roth, Eileen Manning, Xiaolel Wang, and Jijun Zhang. 2012. <i>The Condition of Education 2012</i>. Washington: National Center for Education Statistics. (http://nces.ed.gov/pubs2012/2012045.pdf--retrieved May 15, 2013)<span style="font-family: Times New Roman; font-size: 10.5pt"> </span></p>
<p style="font-family: Arial; font-size: 12pt; margin-left: 36pt; padding-left: 36pt; text-indent: -36pt">Aud, Susan, William, Hussar, Grace Kena, Kevin Bianco, Lauren Frohlich, Jana Kemp, and Kim Tahan. 2011. <i>The Condition of Education 2011</i>. Washington: National Center for Education Statistics. (http://nces.ed.gov/pubs2011/2011033.pdf--retrieved May 16, 2013). </p>
```
The correct entry above specifies `<p style="font-family: Arial;` but the first one doesn't.
Looking at the initial extraction shows the reason. While the paragraph consists mostly of spans with `style="font-family: Arial"`, there's one empty span at the very end with a `style="font-family: Times New Roman"`.
```html
<p style="margin-left: 36pt; text-indent: -36pt; padding-left: 36pt">
<span style="font-family: Arial; font-size: 12pt">Aud, S</span>
<span style="font-family: Arial; font-size: 12pt">usan, William, Hussar, Frank</span>
<span style="font-family: Arial; font-size: 12pt"></span>
<span style="font-family: Arial; font-size: 12pt">Johnson, Grace Kena, Erin Roth, Eileen Manning, Xiaolel Wang,
</span>
<span style="font-family: Arial; font-size: 12pt">and
</span>
<span style="font-family: Arial; font-size: 12pt">Jijun Zhang.
</span>
<span style="font-family: Arial; font-size: 12pt">2012.</span>
<span style="font-family: Arial; font-size: 12pt"></span<name />
<span style="font-family: Arial; font-size: 12pt"></span>
<span style="font-family: Arial; font-size: 12pt">
<i>The Condition of Education 2012</i>
</span>
<span style="font-family: Arial; font-size: 12pt">. Washington:
</span>
<span style="font-family: Arial; font-size: 12pt">National Center for Education Statistics.
</span>
<span style="font-family: Arial; font-size: 12pt">
(</span>
<span style="font-family: Arial; font-size: 12pt">http://nces.ed.gov/pubs2012/2012045.pdf--retrieved May 15</span>
<span style="font-family: Arial; font-size: 12pt">, 2013)</span>
<span style="font-family: Times New Roman; font-size: 10.5pt"></span>
</p>
```
It seems the fact that there are multiple font families used is enough to keep the font-family from being added into paragraph styles.
Can you update this to ignore fonts that are applied to empty tags? That would keep invisible tags sprinkled into a Word doc from causing incorrect fonts in the html.1.0.0https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/16Don't pass bold into the editor2017-05-25T16:36:22ZAlex ThegDon't pass bold into the editorUniversity of California Press doesn't use bold text to style their books, instead using italics for emphasis. For now, can we simply ignore and strip out the bolding when we pass the HTML through Typescript into Editoria?
There is a l...University of California Press doesn't use bold text to style their books, instead using italics for emphasis. For now, can we simply ignore and strip out the bolding when we pass the HTML through Typescript into Editoria?
There is a lot of semantic information in what is bolded. I wonder if in the future, the editors will want bolding to come into the editor as something else, like italics, or somehow flagged for attention. But for now, can we simply remove the `<b>` to `<strong>` mapping, and strip out the b tags?
Thanks!Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/15Remove the `<!-- tab -->` comment inside `<span class="tab">`s when these spa...2017-05-25T09:13:32ZAlex ThegRemove the `<!-- tab -->` comment inside `<span class="tab">`s when these spans are removedCurrently, tabs from the Word come through into the HTML like so:
```html
<p><span class="tab"><!-- tab --></span>and sail on it with a fair wind, </p>
```
Then, in the last step of Typescript (reduce), the `<span class="tab">`s are remo...Currently, tabs from the Word come through into the HTML like so:
```html
<p><span class="tab"><!-- tab --></span>and sail on it with a fair wind, </p>
```
Then, in the last step of Typescript (reduce), the `<span class="tab">`s are removed. However, if I recall correctly, since this was originally implemented, we decided to added the `<!-- tab -->` comment into the `<span class="tab">`s, to keep the tab spans from being scrubbed out as empty elements.
The result is that the `<span class="tab">`s are removed by the final Typescript reduce step, but the `<!-- tab -->` comment remains.
```html
<p>
<!-- tab -->and sail on it with a fair wind, </p>
```
So, can we update the final reduce step of Typescript so that it removes these comments too? The eventual goal is to implement proper tabs inside the Editoria editor with CSS, so for now, I think we want to remove them.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/14Split paragraphs on breaks2017-05-19T19:57:20ZAlex ThegSplit paragraphs on breaksXSweet captures breaks and carriage returns as either `<br class="cr"/>` or `<br class="br"/>`.
Typescript currently drops these altogether. Instead, it should split the paragraphs on the breaks. One thing to note is that the new para...XSweet captures breaks and carriage returns as either `<br class="cr"/>` or `<br class="br"/>`.
Typescript currently drops these altogether. Instead, it should split the paragraphs on the breaks. One thing to note is that the new paragraph should reuse the same styling from its parent paragraph.
See https://gitlab.coko.foundation/wendell/XSweet/issues/103 for the back storyWendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/103Capture carriage returns from Word2017-06-07T20:17:26ZAlex ThegCapture carriage returns from WordSee Powell01.
There is a poem at the very end of this chapter, and many of the different lines are being combined into one paragraph, losing the returns that separate the lines. This is because the xml carriage return tag `<w:cr/>` c...See Powell01.
There is a poem at the very end of this chapter, and many of the different lines are being combined into one paragraph, losing the returns that separate the lines. This is because the xml carriage return tag `<w:cr/>` comes through as an empty span in the extraction step. Can you please add the xml carriage return tag to the list of elements XSweet pays attention to?
Compare the two examples below.
1. Here are two lines in Word that get combined into one p in the html. In the xml, they are separated by a carriage return tag `<w:cr/>`, and many lines like this are enclosed in one single `<w:p>`. This get extracted as a series of spans inside one `<p>`.
```xml
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:tab/>
<w:t>no flood will carry you away.</w:t>
</w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:cr/></w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:tab/>
<w:t>You will not taste the river’s evils,</w:t>
</w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:cr/></w:r>
```
2. This is the xml for two lines that DO come through properly, as separate `<p>`s in the html. These use a close paragraph tag `</w:p>` for each line, instead of a `<w:cr>`:
```xml
<w:p w14:paraId="0E41693A" w14:textId="77777777" w:rsidR="00CE3D10" w:rsidRPr="008B5A47" w:rsidRDefault="00CE3D10" w:rsidP="00CE3D10">
<w:pPr><w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:tab/>
<w:t>nor will your boat be idle.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="7E6E92D5" w14:textId="77777777" w:rsidR="00CE3D10" w:rsidRPr="008B5A47" w:rsidRDefault="00CE3D10" w:rsidP="00CE3D10">
<w:pPr><w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B5A47">
<w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/></w:rPr><w:tab/>
<w:t>No accident will affect your mast,</w:t>
</w:r>
</w:p>
```Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/102Inline italics lost because "font-style: italic" applied at paragraph level2018-03-22T22:50:50ZAlex ThegInline italics lost because "font-style: italic" applied at paragraph levelIn Powell00, the heading “[b]The Hittite-Hurrian Kingship in Heaven and The Song of Ullikummi” has some inline italics that don’t come through, because the paragraph has a `font-style: italic` applied to it. This means that the whole li...In Powell00, the heading “[b]The Hittite-Hurrian Kingship in Heaven and The Song of Ullikummi” has some inline italics that don’t come through, because the paragraph has a `font-style: italic` applied to it. This means that the whole line shows in the browser in italics, and the differentiation between the normal text and the italics is hidden. Removing the paragraph-level italic styling means you can see the inline italics again.
Everything in the paragraph is contained within a series of spans - there is no text in this `<p>` that's not enclosed by another tag. In Word, the text in the spans that don't specify italics come through as normal, unitalicized text. But in the extracted version, the entire line is italicized. Put a different way, in Word, the spans in the paragraph that specify style seem to override *all* the paragraph-level style information, but in the extracted html, they don't.
Is there a good way to correct for this?
This is how it's extracted:
```html
<p style="font-family: Times New Roman; font-style: italic; color: #19191B">
<span style="font-family: Times New Roman; font-weight: bold; color: #19191B">
<b>[b]</b>
</span>
<span style="font-family: Times New Roman; color: #19191B">The Hittite-Hurrian
</span>
<span style="font-family: Times New Roman; font-style: italic; color: #19191B">
<i>Kingship in Heaven </i>
</span>
<span style="font-family: Times New Roman; color: #19191B">and </span>
<span style="font-family: Times New Roman; font-style: italic; color: #19191B">
<i>The Song of Ullikummi</i>
</span>
</p>
```
Interesting to note that this also gets promoted to a header, and the bolding, while preserved, is thus not apparent in a browser (it is if you change this back to a `<p>` again though).Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/101Line break inside a paragraph2017-06-07T20:17:25ZAlex ThegLine break inside a paragraphIn the original Word file for Powell00 (5-10-17-conversions.zip), there’s a paragraph that contains a line break in it: `<w:br/>`
This comes between the very last line of the paragraph and the rest of it (between “...scholarly speculati...In the original Word file for Powell00 (5-10-17-conversions.zip), there’s a paragraph that contains a line break in it: `<w:br/>`
This comes between the very last line of the paragraph and the rest of it (between “...scholarly speculation about” and ““interpolated” non-genuine...”. It’s invisible in the original Word file, unless you change the font size of the paragraph, in which case you can see the paragraph reflow but preserve the break.
When it goes through XSweet, the break is preserved where it originally is:
```html
...of scholarly speculation about <br class="br" />“interpolated” non-genuine portions of the Homeric and Hesiodic poems. </p>
```
Right now, breaks don't go through Typescript, so it's simply dropped. But what might cause a line break to show up inside a paragraph in a Word file in the first place? Any ideas?
Here’s the paragraph from the original Word file xml:
```xml
<w:p w14:paraId="749A5FF6" w14:textId="77777777" w:rsidR="00687B92" w:rsidRPr="008B5A47" w:rsidRDefault="00687B92" w:rsidP="00687B92">
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto"/>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B5A47">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:tab/>
<w:t xml:space="preserve">It is common to speak of “rhapsodic interpolations” in examining ancient texts, but they were probably uncommon, at least in archaic literature. A rhapsode may make up verses to suit his pleasure, but unless they are written down in the tradition that becomes canonical, that is copied and recopied, they do not survive. Therefore the texts of Homer and Hesiod that we possess </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t>must be</w:t>
</w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t xml:space="preserve"> substantially the texts that these poets composed, recorded by dictation in the </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t>eighth c</w:t>
</w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t>entury BC at the dawn of alphabetic literacy. Of course such texts are liable to the usual distortions that come from copying and recopying, but these distortions are always minor and do not affect the main narrative</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t xml:space="preserve">, in spite of an inordinate amount of scholarly speculation about
</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:br/>
<w:t>“interpolated” non-genuine portions of the Homeric and Hesiodic poems
</w:t>
</w:r>
<w:r w:rsidRPr="008B5A47">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
</w:rPr>
<w:t xml:space="preserve">. </w:t>
</w:r>
</w:p>
```Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/100Ignore all character style information from the paragraph level2017-08-16T18:17:08ZAlex ThegIgnore all character style information from the paragraph levelSome paragraphs that look like normal text in Word are being extracted as bold, which has the added effect of sabotaging header promotion.
Here are 2 examples of where the extraction goes wrong:
[bBowenChapter1.docx](/uploads/ad1ce...Some paragraphs that look like normal text in Word are being extracted as bold, which has the added effect of sabotaging header promotion.
Here are 2 examples of where the extraction goes wrong:
[bBowenChapter1.docx](/uploads/ad1cea812824d5353237f94b735fe809/bBowenChapter1.docx)
[outputs_bowen_ch1.zip](/uploads/fb532748f7160c323193a5e4049bcb5d/outputs_bowen_ch1.zip)
Bad bolding:
* "In the second half of the 19th century..."
* "In a modern global food system characterized..."
* "When people want to show how protecting terroir..."
[bBowenChapter3.docx](/uploads/6de2759e35daa9478f546cda6904507b/bBowenChapter3.docx)
[outputs_bowen_ch3.zip](/uploads/cbe334e1702d943eb76caebbef932f09/outputs_bowen_ch3.zip)
Bad bolding:
* "The standard that regulates tequila production..."
* "As I discuss later in this chapter, several..."
* "Tequila’s regulatory infrastructure sounds like something..."
* "The main premise behind any DO is..."
In both of these cases, it stops almost all of the headings that should be promoted from being caught. Headers are denoted with bold. If you look at the digest-paragraph output, you'll see that p style correctly included in the assimilated list of discrete styles, but it doesn't make it through into the "filtered" group because the data-average-length gets skewed too high (120+ character average length) by the improperly bolded paragraphs, causing the style to be dropped from consideration.
So, fixing the underlying incorrect bolding issue will kill two birds. I will also keep an eye out for other places a "<120char" rule stops correct promotions.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/99Split header promotion logic by display and by Style into 2 distinct pathways2017-04-23T22:06:11ZAlex ThegSplit header promotion logic by display and by Style into 2 distinct pathwaysContinuing after implementing #81, @wendell is separating these two different approaches out from each other:
1. all the logic to consider header promotion based just on display properties
2. capturing information from Word styles, whi...Continuing after implementing #81, @wendell is separating these two different approaches out from each other:
1. all the logic to consider header promotion based just on display properties
2. capturing information from Word styles, which will happening eventually but has now been turned off.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/98"Chapter One - " text dropped from extraction2018-03-15T20:05:00ZAlex Theg"Chapter One - " text dropped from extractionBoyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1...Boyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1-9RINSED.html](/uploads/da01d16780e6f8cc2dd7e8b02877db1c/e.BoylesChapter1-9RINSED.html)
The top level header reads "Chapter One - Introduction" in Word. But the text "Chapter One - " is not coming through from Word into html. It seems to have been implemented as some kind of list in Word, but I can't find anything evidence of it in the initial extraction.
![Screen_Shot_2017-04-21_at_1.42.36_AM](/uploads/55402036b54e43e7429bb576ce93278c/Screen_Shot_2017-04-21_at_1.42.36_AM.png)
Is there an easy fix?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/97"Chapter One -2017-04-21T13:39:08ZAlex Theg"Chapter One -https://gitlab.coko.foundation/XSweet/XSweet/-/issues/96Only use header promotion inside `<div class="docx-body">`2017-04-21T22:46:50ZAlex ThegOnly use header promotion inside `<div class="docx-body">`[e.BoylesChapter3-9RINSED.html](/uploads/b7dcd4064381d3b1cfe6cf81c7a14700/e.BoylesChapter3-9RINSED.html)
In Boyles Ch 3, most of the footnotes get promoted to headers. Can header promotion be limited to what's inside the `<div class=...[e.BoylesChapter3-9RINSED.html](/uploads/b7dcd4064381d3b1cfe6cf81c7a14700/e.BoylesChapter3-9RINSED.html)
In Boyles Ch 3, most of the footnotes get promoted to headers. Can header promotion be limited to what's inside the `<div class="docx-body">`?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/95Handle lists2017-08-09T19:00:55ZAlex ThegHandle listsKohl-Arenas, Ch 4:
[b_kohl-arenas_ch4.docx](/uploads/c6e9bbf254f76892d7c88f2b3ebc7b98/b_kohl-arenas_ch4.docx)
This is an example of list items that should eventually be extracted and ported into Editoria.
Search for "One in five male ...Kohl-Arenas, Ch 4:
[b_kohl-arenas_ch4.docx](/uploads/c6e9bbf254f76892d7c88f2b3ebc7b98/b_kohl-arenas_ch4.docx)
This is an example of list items that should eventually be extracted and ported into Editoria.
Search for "One in five male subjects" for the first list item.Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/94Inconsistent indentation2018-03-15T16:27:37ZAlex ThegInconsistent indentationSee Gilbert ch 2: [b06_ch02.docx](/uploads/6bd821eeabee201b9cd021c00aff5382/b06_ch02.docx)
The indented spot starting with "Gramma?", has the same indentation level as the dialogue right below it:
![Screen_Shot_2017-04-19_at_11.09.3...See Gilbert ch 2: [b06_ch02.docx](/uploads/6bd821eeabee201b9cd021c00aff5382/b06_ch02.docx)
The indented spot starting with "Gramma?", has the same indentation level as the dialogue right below it:
![Screen_Shot_2017-04-19_at_11.09.32_AM](/uploads/c383012829026bab0f8fe39033c0547c/Screen_Shot_2017-04-19_at_11.09.32_AM.png)
But in the HTML, there are 3 lines that have a "padding-left: 18pt," which makes them appear more indented than they do in the Word doc:
“How come some of them have their shawls over their heads?”
“Nisht shawl,” Gramma corrects, “tallis.”
“But why are they bowing up and down?”
Low priority as indentation doesn't port into Editoria.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/93Normalize invisible formatting on single spaces2017-05-26T12:46:07ZAlex ThegNormalize invisible formatting on single spacesSee Weaver, Ch 1: [b05_ch01.docx](/uploads/e4065a9648f08d3aa103cc2e1b96b74e/b05_ch01.docx)
Low priority improvement for later.
Sometimes, there will be single spaces that are formatted differently than the surrounding text. In th...See Weaver, Ch 1: [b05_ch01.docx](/uploads/e4065a9648f08d3aa103cc2e1b96b74e/b05_ch01.docx)
Low priority improvement for later.
Sometimes, there will be single spaces that are formatted differently than the surrounding text. In this example, there are 2 bold spaces. While this doesn't affect how it looks to the eye, it leads to cluttered formatting under the hood.
```html
To which he might have replied<b> </b>as he had to a showbiz promoter 25 years earlier:<b> </b>“Act? That was no act; that
```
Can you implement a rule that looks for tags that
1. enclose only a space or tab, and
2. denote bold, underline, or italics
And then scrub these out in a cleanup step?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/92Header promotion example for Alex to check2019-07-07T22:40:39ZAlex ThegHeader promotion example for Alex to checkIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotionIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotion1.0.0Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/91Centering not captured in HTML2018-03-14T19:51:20ZAlex ThegCentering not captured in HTMLGilbert: [a01_ti.docx](/uploads/3700d924a92610925a5f80dab0c0031e/a01_ti.docx)
The last line is centered in the Word doc but not in the HTML. How come?
On hold for now as text alignment doesn't port into Editoria for now.Gilbert: [a01_ti.docx](/uploads/3700d924a92610925a5f80dab0c0031e/a01_ti.docx)
The last line is centered in the Word doc but not in the HTML. How come?
On hold for now as text alignment doesn't port into Editoria for now.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/90Some indentations not reflected in HTML2018-03-22T19:18:06ZAlex ThegSome indentations not reflected in HTMLGarcia ch 4: [d_Chapter4GGv5.docx](/uploads/394c8e688977abb7158be3801f068a21/d_Chapter4GGv5.docx)
The first 3 paragraphs are indented in the HTML as `<p style="text-indent: 36pt">`, but starting with the 4th paragraph ("The urban struct...Garcia ch 4: [d_Chapter4GGv5.docx](/uploads/394c8e688977abb7158be3801f068a21/d_Chapter4GGv5.docx)
The first 3 paragraphs are indented in the HTML as `<p style="text-indent: 36pt">`, but starting with the 4th paragraph ("The urban structures..."), the indentation stops. This appears to come from the styles. The first 3 paragraphs are "Normal + First Line: 0.5" but the 4th on are just "Normal." However, these paragraphs are indeed indented when displayed in Word.
Is there a way to respect these 1st-line indentations that aren't coming through? This is related to #81 in that we want to recreate what's actually displayed in Word.1.0.0Wendell PiezWendell Piez