XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-08-07T14:24:43Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/142CSS for hanging paragraphs2018-08-07T14:24:43ZAlex ThegCSS for hanging paragraphsXSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style...XSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style="margin-left: 36pt">`
Two indent no hanging: `<w:ind w:left="1440"/>` -> `<p style="margin-left: 72pt">`
But the indentation with hanging needs another CSS property to be correct:
One indent hanging:
`<w:ind w:left="1440" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 36pt;` added in addition to what's already there to be correct.
Two indent hanging:
`<w:ind w:left="2160" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 72pt;` added in addition to what's already there and then it's correct.
Here's an test docx: [hanging.docx](/uploads/459ecfb10d4e6c42caf16f4983c52142/hanging.docx)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/140Formatting issues with nested spans and Word styles2018-05-01T13:52:44ZAlex ThegFormatting issues with nested spans and Word stylesThis is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it use...This is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it uses the Word style "BookTitle + Not Bold, Not Small caps". It looks like this is a question of nested spans, the priority in which the formatting is resolved, and how Word style modifiers are extracted into the html.
Here's the html after the join step:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">
<span style="font-weight: normal">Acknowledgements</span>
</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
Here it is after the the collapse step. At this point, I believe the innermost span's `font-weight: normal` should have been passed to the outer `class="BookTitle` span, but it is not:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
And here is the final rinsed html:
```html
<h2 style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0"><!-- bookmark ='_GoBack'--></a>
<a href="#docx-bookmark_0"><!-- bookmark end --></a>
</h2>
```
And, the `font-variant: normal` needs to be passed down to the innermost span, or else it's clobbered by the `BookTitle` styling on the innermost span.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/138Spaces between superscripts dropped in join step2019-07-09T19:02:41ZAlex ThegSpaces between superscripts dropped in join stepThis may or may not be related to https://gitlab.coko.foundation/XSweet/XSweet/issues/44
When superscripts are separated by regular spaces, in the join step, they are collapsed into one superscript and the separating spaces are dropped ...This may or may not be related to https://gitlab.coko.foundation/XSweet/XSweet/issues/44
When superscripts are separated by regular spaces, in the join step, they are collapsed into one superscript and the separating spaces are dropped entirely. Example:
Join input:
```html
<p>...impact-related injury.<sup>6</sup> <sup>7</sup> <sup>8</sup> Then, there...</p>
```
Join output:
```html
<p>...linear impact-related injury.<sup>678</sup> Then, there...</p>
```
Instead, the non-superscript spaces should stay where they are.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/136First greedy hyperlinker error2018-04-05T17:25:48ZAlex ThegFirst greedy hyperlinker errorRelated to #3
I have found the first false hit for the new hyperlink recognizer.
A sentence-starting ellipsis implemented with three periods causes a false hyperlink:
In this:
> "...the lion sleeps tonight."
This gets hyperlinked:
> ...Related to #3
I have found the first false hit for the new hyperlink recognizer.
A sentence-starting ellipsis implemented with three periods causes a false hyperlink:
In this:
> "...the lion sleeps tonight."
This gets hyperlinked:
> "...the
Note that this does _not_ happen if an ellipsis character is used. This is fine:
> "…the lion sleeps tonight."1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/135"Ambiguous rule match" warning in the terminal2018-03-28T16:42:03ZAlex Theg"Ambiguous rule match" warning in the terminalHi @wendell, you've fixed that other terminal warning I'd pinged you about in mattermost, but there's another one. It often doesn't show up in short documents like frontmatter and so forth, but it does appear on lots of larger chapter fi...Hi @wendell, you've fixed that other terminal warning I'd pinged you about in mattermost, but there's another one. It often doesn't show up in short documents like frontmatter and so forth, but it does appear on lots of larger chapter files.
The issue comes from lines 602 and 611 of the `docx-html-extract.xsl` sheet - something to do with the bolding extraction. Could you check this out and make sure it's not causing issues?
Here's an example from Gilbert:
```
converting: b13_ch09
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/132Invisible bib entry in Horton visible in HTML2019-07-07T23:05:58ZAlex ThegInvisible bib entry in Horton visible in HTMLHow come? For Alex to investigate.
"U.S. Dept. of Labor. 2006. Census of Fatal Occupational Injuries."
XML:
```xml
<w:p w14:paraId="59927B05" w14:textId="77777777" w:rsidR="00DA5911" w:rsidRPr="00DA5911" w:rsidRDefault="00DA5911" w:rsi...How come? For Alex to investigate.
"U.S. Dept. of Labor. 2006. Census of Fatal Occupational Injuries."
XML:
```xml
<w:p w14:paraId="59927B05" w14:textId="77777777" w:rsidR="00DA5911" w:rsidRPr="00DA5911" w:rsidRDefault="00DA5911" w:rsidP="00DA5911">
<w:pPr>
<w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
<w:spacing w:line="0" w:lineRule="auto"/>
<w:rPr>
<w:rFonts w:ascii="ff6" w:eastAsia="Times New Roman" w:hAnsi="ff6" w:cs="Times New Roman"/>
<w:color w:val="231F20"/>
<w:sz w:val="102"/>
<w:szCs w:val="102"/>
</w:rPr>
</w:pPr>
<w:proofErr w:type="gramStart"/>
<w:r w:rsidRPr="00DA5911">
<w:rPr>
<w:rFonts w:ascii="ff6" w:eastAsia="Times New Roman" w:hAnsi="ff6" w:cs="Times New Roman"/>
<w:color w:val="231F20"/>
<w:sz w:val="102"/>
<w:szCs w:val="102"/>
</w:rPr>
<w:t>U.S. Dept. of Labor.</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
<w:r w:rsidRPr="00DA5911">
<w:rPr>
<w:rFonts w:ascii="ff6" w:eastAsia="Times New Roman" w:hAnsi="ff6" w:cs="Times New Roman"/>
<w:color w:val="231F20"/>
<w:sz w:val="102"/>
<w:szCs w:val="102"/>
</w:rPr>
<w:t xml:space="preserve">2006. Census of Fatal Occupational Injuries.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="63865577" w14:textId="77777777" w:rsidR="00DA5911" w:rsidRPr="00DA5911" w:rsidRDefault="00DA5911" w:rsidP="00DA5911">
<w:pPr><w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/><w:spacing w:line="0" w:lineRule="auto"/>
<w:rPr><w:rFonts w:ascii="ff6" w:eastAsia="Times New Roman" w:hAnsi="ff6" w:cs="Times New Roman"/><w:color w:val="231F20"/><w:sz w:val="102"/><w:szCs w:val="102"/></w:rPr>
</w:pPr>
<w:r w:rsidRPr="00DA5911">
<w:rPr><w:rFonts w:ascii="ff6" w:eastAsia="Times New Roman" w:hAnsi="ff6" w:cs="Times New Roman"/><w:color w:val="231F20"/><w:sz w:val="102"/><w:szCs w:val="102"/></w:rPr>
<w:t xml:space="preserve">Washington, D.C., Bureau of Labor Statistics.
</w:t>
</w:r>
</w:p>
```
HTML:
```html
<p>
<span style="font-family: ff6; color: #231F20; font-size: 51pt">U.S. Dept. of Labor.</span>
<span style="font-family: ff6; color: #231F20; font-size: 51pt"> 2006. Census of Fatal Occupational Injuries.</span>
</p>
<p>
<span style="font-family: ff6; color: #231F20; font-size: 51pt">Washington, D.C., Bureau of Labor Statistics. </span>
</p>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/131Word styles with ' + Not [bold/italics/etc.]" extract incorrectly2019-07-07T22:07:59ZAlex ThegWord styles with ' + Not [bold/italics/etc.]" extract incorrectlyI'm hopeful that this will fix a lot of the remaining formatting problems.
In Word, named styles can be modified by adding or removing formatting, e.g.:
* "Normal + Bold"
* "Emphasis + Not Italic"
The way these modifications are applie...I'm hopeful that this will fix a lot of the remaining formatting problems.
In Word, named styles can be modified by adding or removing formatting, e.g.:
* "Normal + Bold"
* "Emphasis + Not Italic"
The way these modifications are applied are by specifying a `w:val` on the formatting tag. A value of 1 turns the formatting on; a value of 0 turns it off. For example, this means "No Italics": `<w:i w:val="0"/>`.
Adding formatting on top of styling works correctly, but whenever "+ Not[format]" is specified, it's extracted as just the opposite: that formatting turns on when it should turn off. The extraction doesn't take the `val` property into account, but should.
Whenever `w:val="0"` appears in any of the following tags...
* `<w:i w:val="0"/>`
* `<w:b w:val="0"/>`
* `<w:u w:val="none"/>
...then XSweet should turn that formatting off. The formatting could be specified on the paragraph level, or in a span inside a paragraph, so finding it and turning it off might be the trickiest part. Perhaps we could preserve these until all the styling should have been promoted to the `<p style=>`, and then check for the formatting to turn off. Or, is there somet tag or way to turn off inline formatting regardless of whether it's already applied or not? If so, that might be simplest.
Here are 2 examples from Braggs Ch 1:
## Example 1
This paragraph is erroneously extracted as italic. In the Word doc, there's no visible change of formatting between this paragraph and the previous one, but in fact the named Word style change from "Normal (Web) + Auto" to "Emphasis + Not Italics":
```xml
<w:p w14:paraId="3E1B900F" w14:textId="6A38EA12" w:rsidR="00CB4328" w:rsidRPr="004F60F8" w:rsidRDefault="00CB4328" w:rsidP="0036427E">
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto"/>
<w:jc w:val="both"/>
</w:pPr>
<w:r w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t>Throughout his works, Césaire</w:t>
</w:r>
<w:r w:rsidR="004A368E" w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t xml:space="preserve">claimed</w:t>
</w:r>
```
In the above, `<w:rStyle w:val="Emphasis"/>` specifies italics (in the CSS extracted to the top of the HTML). Then, the `<w:i w:val="0"/>` (not italics) should turn the italics back off, but it doesn't:
```html
<p style="text-align: justify">
<span style="font-style: italic">
<span class="Emphasis">Throughout his works, Césaire</span>
</span>
<span style="font-style: italic">
<span class="Emphasis"> claimed</span>
</span>
```
## Example 2
Footnote 15 comes through in bold in the HTML but not in Word. Its Word style is "Heading1 + Times New Roman, 10pt, Not Bold". So, the paragraph inherits bolding from its "Heading1" style. Then, whereas `<w:b w:val="0"/>` should remove the bold formatting, it doesn't.
Word XML:
```xml
<w:endnote w:id="15">
<w:p w14:paraId="7FAB95D6" w14:textId="2E2795A2" w:rsidR="0005708E" w:rsidRPr="008B0C6C" w:rsidRDefault="0005708E" w:rsidP="00EF6AD7">
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:spacing w:before="0" w:beforeAutospacing="0" w:after="0" w:afterAutospacing="0"/>
<w:jc w:val="both"/>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B0C6C">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">Bryan Wagner introduces an insightful analysis</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">of
</w:t>
</w:r>
```
And the HTML, incorrectly bolded:
```html
<div class="docx-endnote" id="en15">
<p class="Heading1" style="margin-top: 0pt; margin-bottom: 0pt; xsweet-outline-level: 0; font-family: Times; font-weight: bold; font-size: 24pt; text-align: justify">
<span style="font-size: 10pt; font-family: Times New Roman">
...
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>Bryan Wagner introduces an insightful analysis </lang>
</span>
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>of </lang>
</span>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/124Ignore indentation specified at the paragraph level2018-03-15T16:27:37ZAlex ThegIgnore indentation specified at the paragraph levelSee b_Schuster_Captns
In this example, some of the entries have hanging indentation specified on them. That hanging indentation is extracted into the HTML as `margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt`. In the Word...See b_Schuster_Captns
In this example, some of the entries have hanging indentation specified on them. That hanging indentation is extracted into the HTML as `margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt`. In the Word doc, these entries are flush left for the first line, but hanging on subsequent lines:
![word_doc](/uploads/db9596faf0018aa494fe7eb9caf3c5a3/word_doc.png)
But in the HTML, the entire entry is indented.
![browser_display](/uploads/4ed14d6b36bde033eb16d5483aabe3d8/browser_display.png)
The paragraph is initially extracted with 3 indent-related styles associated with it:
`<p style="margin-left: 67.5pt; padding-left: 67.5pt; text-indent: -67.5pt">`
This properly achieves the hanging indentation on lines after the first one, but the `margin-left: 67.5pt` shouldn't be there, as it improperly indents the whole entry. Removing that attribute moves the whole paragraph back to its proper flush left position.
The `margin-left` comes from the p-level style. I believe this is something that should be ignored entirely.
Word XML:
```xml
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">Fig. 4 (Ch. 1):
</w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">A credit counselor at
</w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Fundación</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve"></w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Paraguaya</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">
working on the
</w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t>Ikatú</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>project</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
</w:p>
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr><w:ind w:left="1350" w:hanging="1350"/>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">Fig. 5 (Ch. 2):
</w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Paraguay Land Company, Limited 1889; To be issued for Land Warrants of the Government of Paraguay, at the rate of 2 fully-paid Shares for each £100 Warrant</w:t>
</w:r>
</w:p>
```
HTML:
```html
<p>Fig. 4 (Ch. 1): <i>A credit counselor at </i><i>Fundación</i><i> </i><i>Paraguaya</i><i> working on the </i>Ikatú <i>project</i></p>
<p/>
<p style="margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt">Fig. 5 (Ch. 2): <i>Paraguay Land Company, Limited 1889; To be issued for Land Warrants of the Government of Paraguay, at the rate of 2 fully-paid Shares for each £100 Warrant</i></p>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/123Logical switch for type of header promotion to use2018-07-27T13:59:00ZAlex ThegLogical switch for type of header promotion to useWe now have implementations of at least two different header promotion strategies that can be useful for importing book chapters:
1. The original, classic header promotion based on analysis of font size, formatting, and an analysis of th...We now have implementations of at least two different header promotion strategies that can be useful for importing book chapters:
1. The original, classic header promotion based on analysis of font size, formatting, and an analysis of the entire document (pretty good right now)
2. Header promotion based upon the outline levels of a document (some tweaks and further testing required)
Classic header promotion works best with docxs that aren't structured with Word styles; in these instances, outline-level analysis doesn't work at all because there aren't any outline levels specified. But if the author has used Word styles, outline level header promotion seems to work better than the classic approach more often than not. Combining the two promotion strategies together doesn't work well - it should be one approach or the other.
At this point, what we need is a logical switch that analyzes the document and decides whether to use the classic approach or the outline level approach. I don't know exactly what the criteria to make the decision should be - probably something along the lines of counting the number of times outline levels are specified in Word as a proxy for whether it's structured or not - but I can propose something for that. In the meantime, @wendell do you have ideas about how such a one-track-or-the-other switch could be implemented in the pipeline? Laying the groundwork for this seems like the next thing to do for header promotion.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/122Reconcile font size from Word style and inline font style2017-12-02T01:08:41ZAlex ThegReconcile font size from Word style and inline font styleIt seems there needs to be an additional check in the scrub step to ensure that any inline font size specifications (perhaps other formatting too?) being removed as redundant match the formatting on the paragraph. Currently, it looks lik...It seems there needs to be an additional check in the scrub step to ensure that any inline font size specifications (perhaps other formatting too?) being removed as redundant match the formatting on the paragraph. Currently, it looks like there's only a check that all the inline specifications are consistent among themselves, and named Word styles can mess that up.
Here's an illustrative example. In Bakker Ch 1, there is a heading that reads "The heroic migrant and the end of migration". In the docx, this is 12pt font, just like the rest of the file, but this ends up as 11pt font in the HTML.
This text is Word style "Default". In the `styles.xml` sheet, "Default" text is 11pt font:
```xml
<w:style w:type="paragraph" w:customStyle="1" w:styleId="Default">
<w:name w:val="Default"/>
<w:rPr>
<w:rFonts w:ascii="Helvetica" w:hAnsi="Arial Unicode MS" w:cs="Arial Unicode MS"/>
<w:color w:val="000000"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
</w:style>
```
In the docx, this text was styled "Default" but then changed to font size 12. In the docx xml, font size 12 is specified at the paragraph level, and on the runs themselves (I've included the Word xml at the end of this ticket).
In the initial html extraction, the paragraph specifies font size 11 (which comes from the style), but everything in the `p` specifies font size 12 inline:
```html
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<span style="font-size: 12pt">
<i><iCs>The heroic migrant and the end of migration</iCs></i>
</span>
<span style="font-size: 12pt">
<tab/>
</span>
</p>
```
The inline 12pt specification gets stripped from the whitespace-only element after the join step:
```html
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<span style="font-size: 12pt">
<i>The heroic migrant and the end of migration</i>
</span>
<tab/>
</p>
```
With the collapsed step, the 12pt inline specification disappears, leaving only the 11pt font specified on the `p`. That's where the error is introduced:
```html
<p class="Default" style="font-family: Helvetica; font-size: 11pt; margin-bottom: 6pt">
<i>The heroic migrant and the end of migration</i>
<tab/>
</p>
```
I think the inline font size is removed because all the inline styling in the `p` is consistent. But what's missing is a step to check that the inline formatting matches the `p`-level formatting.
Is that something that you could add to the scrub step?
* if inline formatting is about to be removed because it can be handled at the `p` level, confirm that the inline formatting matches the `p`-level formatting
* if there are conflicts, the inline formatting should take precedent, and the `p` formatting should be updated to reflect the inline formatting.
For reference, here's the original Word xml:
```xml
<w:p w14:paraId="5617B83D" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr>
<w:pStyle w:val="Default"/>
<w:tabs>
<w:tab w:val="left" w:pos="720"/>
</w:tabs>
<w:spacing w:after="120" w:line="480" w:lineRule="auto"/>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:i/>
<w:iCs/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
<w:t>The heroic migrant and the end of migration</w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:color="000000"/>
</w:rPr>
<w:tab/>
</w:r>
</w:p>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/121Exclude all data in tables from header promotion analysis2017-10-30T18:22:07ZAlex ThegExclude all data in tables from header promotion analysisVery similar to the list item header promotion issue (#120), we should stop table data from being promoted to headers.
As with lists, my first preference would be that tables are ignored entirely for header analysis and actual header pr...Very similar to the list item header promotion issue (#120), we should stop table data from being promoted to headers.
As with lists, my first preference would be that tables are ignored entirely for header analysis and actual header promotion consideration. If that's hard though, just preventing table data from being actually promoted to headers would probably be sufficient.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/120Don&#39;t consider lists for header promotion2017-10-30T20:34:37ZAlex ThegDon't consider lists for header promotionWe can be sure that lists won&#39;t contain headers, so we should implement a rule to keep list items from being promoted. We could either:
* not consider lists and list items in the entire-document header analysis, or
* allow list items...We can be sure that lists won't contain headers, so we should implement a rule to keep list items from being promoted. We could either:
* not consider lists and list items in the entire-document header analysis, or
* allow list items to be considered in the overall document analysis (to calculate things like average run size, etc.) but never allow them to be promoted themselves.
My strong preference is to:
* exclude lists/list items from the overall whole-document analysis
* exclude list items from consideration for header promotion
If, for some reason, it's very difficult to exclude lists from the whole-document analysis, then it seems like it would be fine to only exclude list items for consideration for promotion to headers - that is the important part.
What do you think?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/119Preserve white space in tables2017-10-03T19:23:31ZAlex ThegPreserve white space in tablesIn tables in particular, more so than in a general content context, whitespace can be important. I think it would be wise to try to faithfully preserve it as best as possible.
See zMarks_Table1 for an example: the author has achieved in...In tables in particular, more so than in a general content context, whitespace can be important. I think it would be wise to try to faithfully preserve it as best as possible.
See zMarks_Table1 for an example: the author has achieved indentation to indicate a hierarchy within cells using plain old spaces.
Curious what you think the best approach would be @wendell? `<pre>`s? Convert runs of multiple spaces in a row in tables into CSS? Indentation specified in CSS currently comes through fine.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/118Tables: capture vertical cell alignment2017-10-05T18:54:20ZAlex ThegTables: capture vertical cell alignmentHi @wendell, let's try to capture these vertical alignments in cells:
The vertical alignment property for Word table cells are specified in the `<w:tcPr>`; this generally appears only on cells that aren't vertically aligned to "top", wh...Hi @wendell, let's try to capture these vertical alignments in cells:
The vertical alignment property for Word table cells are specified in the `<w:tcPr>`; this generally appears only on cells that aren't vertically aligned to "top", which is the default and generally not specified:
* `<w:vAlign w:val="top"/>`
* `<w:vAlign w:val="center"/>`
* `<w:vAlign w:val="bottom"/>`
http://officeopenxml.com/WPtableCellProperties-verticalAlignment.php
These map to the following CSS attributes:
* `vertical-align: top`
* `vertical-align: middle`
* `vertical-align: bottom`
Here's a test document with a row dedicated to vertical alignment: [table_sweet.docx](/uploads/e136f49cdd1fc3fffd336a8997273302/table_sweet.docx)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/117Self-closing <a> tags cause problem in Editoria's editor (Wax)2018-03-09T01:07:43ZAlex ThegSelf-closing <a> tags cause problem in Editoria's editor (Wax)Now that the XSweet INK step has been updated, I tested the import into Editoria and the Wax (Substance) editor. The INK step includes the link handling sheets, and some of the links are problematic when they're piped into the editor.
N...Now that the XSweet INK step has been updated, I tested the import into Editoria and the Wax (Substance) editor. The INK step includes the link handling sheets, and some of the links are problematic when they're piped into the editor.
Normal links from Word to external urls come through fine when they're wrapped between an opening `<a>` tag and a closing `</a>` tags.
But, self-closing `<a>` tags from internal bookmark links and references seem to behave like opening `<a>` tags that don't close. For example, one book chapter starts with an `<a id="_GoBack"/>` tag:
```html
<h2><a id="_GoBack"/>CHAPTER 2</h2>
<h3>Facts, Figures and the Politics of Measurement: </h3>
```
There aren't any more `<a>`s in the document, so Wax things the entire chapter is a hyperlink. So all the text is one clickable blue link. The solution I believe is to add a closing `</a>` to these.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/116Update sheets to XSLT 3.02017-10-03T17:25:23ZAlex ThegUpdate sheets to XSLT 3.0With the switch to Saxon 9.8, each XSL sheet that is run causes this warning to log:
```
Warning at xsl:stylesheet on line 7 column 48 of List_test_2-7BESPOKEHEADERXSLT.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
```...With the switch to Saxon 9.8, each XSL sheet that is run causes this warning to log:
```
Warning at xsl:stylesheet on line 7 column 48 of List_test_2-7BESPOKEHEADERXSLT.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
```
So, the solution I think is to update the `xsl:stylesheet version="2.0"`s in all the xsl sheets to `xsl:stylesheet version="3.0"`. Is it as easy as that?1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/115Heading 1 and 2 styles coming through as the same style2019-07-07T22:54:34ZChris JenningsHeading 1 and 2 styles coming through as the same styleI have a Word document with styles thus: Heading 1, Heading 2, Heading 3
All come into Editoria as Heading 1
![Screenshot_2017-08-16_12.59.20](/uploads/effb89e725cbd140b01c647a0f3bc71a/Screenshot_2017-08-16_12.59.20.png)
![Scree...I have a Word document with styles thus: Heading 1, Heading 2, Heading 3
All come into Editoria as Heading 1
![Screenshot_2017-08-16_12.59.20](/uploads/effb89e725cbd140b01c647a0f3bc71a/Screenshot_2017-08-16_12.59.20.png)
![Screen_Shot_2017-08-16_at_13.00.03](/uploads/769ebd996140ec2609478b8381eadfcd/Screen_Shot_2017-08-16_at_13.00.03.png)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/112Scrub step should drop formatting for single spaces, just like it does for tabs2018-03-14T23:40:17ZAlex ThegScrub step should drop formatting for single spaces, just like it does for tabsThe scrub step should recognize when whitespace-only elements are wrapped in useless tags, remove the wrappers and throw away the formatting on them. This is exactly what happens tabs wrapped in spans, but not for spaces wrapped in spans...The scrub step should recognize when whitespace-only elements are wrapped in useless tags, remove the wrappers and throw away the formatting on them. This is exactly what happens tabs wrapped in spans, but not for spaces wrapped in spans. We should do the same thing for both.
In this example, you can see that tabs and spaces are handled differently
Extracted:
```xml
<p class="Body">
<span style="font-size: 12pt">Some paragraph text</span>
<span style="font-size: 12pt"><tab/></span>
<span style="font-size: 12pt"> </span></p>
</p>
```
Scrubbed:
```xml
<p class="Body">
<span style="font-size: 12pt">Some paragraph text</span>
<tab/>
<span style="font-size: 12pt"> </span>
</p>
```
We should give spaces the same treatment as tabs, to end up with this instead:
```xml
<p class="Body"><span style="font-size: 12pt">Some paragraph text</span><tab/> </p>
```
Or, if there's a reason it's problematic to remove the spans that wrap the space, then we should at least throw away the formatting on those spans.
Here's another example from the same doc. Here, a paragraph with one space in spans with formatting looks like this all the way through the joined step:
```xml
<p class="Body"><span style="font-size: 12pt"> </span></p>
```
Then, in "collapse" the style info is moved to the paragraph level, and persists all the way into the final rinsed html:
```xml
<p class="Body" style="font-size: 12pt"> </p>
```
So it really should be dropped in the scrub step.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/111Namespace warnings2017-09-12T17:08:52ZAlex ThegNamespace warningsAs I run the XSweet pipeline, a warning message prints out for every file:
```bash
Warning
SXXP0005: The source document is in namespace http://www.w3.org/1999/xhtml, but none of
the template rules match elements in this namespace ...As I run the XSweet pipeline, a warning message prints out for every file:
```bash
Warning
SXXP0005: The source document is in namespace http://www.w3.org/1999/xhtml, but none of
the template rules match elements in this namespace (Use --suppressXsltNamespaceCheck:on
to avoid this warning)
```
It doesn't seem to be causing any issues, besides logging this message into the terminal. I am not sure which specific stylesheet triggers this warning, but this seems to have started in [this commit](https://gitlab.coko.foundation/XSweet/XSweet/commit/b8aed614f200280597276274eeb1df2178999917)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/107Display endnote reference numbers in the html2018-03-15T00:21:42ZAlex ThegDisplay endnote reference numbers in the htmlThis commit removes the numbers that displayed from the endnotes in the html:
https://gitlab.coko.foundation/wendell/XSweet/commit/733fe21c2cac83b71143ff06f32be9f7acbfea87
Currently, the notes at the end of the html are unnumbered, alt...This commit removes the numbers that displayed from the endnotes in the html:
https://gitlab.coko.foundation/wendell/XSweet/commit/733fe21c2cac83b71143ff06f32be9f7acbfea87
Currently, the notes at the end of the html are unnumbered, although they're linked properly to the inline note callouts. The notes at the end of the html should show with note numbers that correspond to the numbered note callouts. Since the note callouts are automatically renumbered, note numbers displayed with notes should be generated too, to ensure that they correlate properly to their callouts, work with footnotes, don't display as roman numerals, etc.
I've seen authors delete the automatic numbering and manually number the notes - I think that's for a person to clean up manually.
Would implementing this also require that it be scrubbed out again in Typescript? If so, I'll make an issue there as well.1.0.0