XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2020-06-04T10:07:50Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/149Capture ordered list start value2020-06-04T10:07:50ZAlex ThegCapture ordered list start valueExtract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After t...Extract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After this is complete, evaluate whether we also need have a `continue` list property as @GitBruno suggests in #106https://gitlab.coko.foundation/XSweet/XSweet/-/issues/147How do we test?2018-07-30T08:12:02ZBruno Herfsthello@brunoherfst.comHow do we test?I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](ht...I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](https://github.com/expath/xspec)?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/146Extract Break Types2018-07-31T06:50:38ZBruno Herfsthello@brunoherfst.comExtract Break TypesBreak types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-bre...Break types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-break'>`https://gitlab.coko.foundation/XSweet/XSweet/-/issues/145Collapse adjacent and repeated inline formatting tags2018-07-27T04:46:12ZBruno Herfsthello@brunoherfst.comCollapse adjacent and repeated inline formatting tagsExtracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>Extracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/15Ornament detector2018-08-07T13:01:25ZAlex ThegOrnament detectorAuthors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement ...Authors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement these in a few different ways:
1. Any number of text dividers: `***`, `*****`, `- - -`, etc.
2. Using a horizontal rule in Word
It would be good as an enhancement step (not extraction) to be able to port these into Wax, so I propose we implement the following:
1. Add an optional enhancement step to HTMLevator that can recognize a range of ornaments and convert them into `<hr>`s
2. Then, add a step into Editoria Typescript to convert `<hr>`s into ornaments for Wax. There's a ticket in for implementing this in Wax (https://gitlab.coko.foundation/wax/wax/issues/178), so we'll need to wait for this to be implemented to have a target format for ornaments. But it should be a straightforward mapping.
But we can start on the first part: ornament recognition.
I think there are 2 parts to this that would get us most of the way there:
## 1. Recognizing text ornaments
I think the rule for this is pretty simple: any paragraph that contains ONLY any combination of
* asterisks
* spaces
* hyphens
* en dashes
* em dashes
* tabs
is an ornament. The paragraph and its content should be clobbered and replaced with an `<hr>`
## 2. Convert horizontal rules to `<hr>`s
In Word, on a new line, typing 3 or more hyphens in a row then hitting enter creates a horizontal rule. Under the hood, it's achieved by applying a bottom border to the previous paragraph, like so:
How it looks in Word:
>
Content
***
content
The OOXML:
```xml
<w:p w14:paraId="4F67C0DD" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00C8440C">
<w:pPr>
<w:pBdr><w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto"/></w:pBdr>
</w:pPr>
<w:r>
<w:t>Content</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5A5EC17D" w14:textId="77777777" w:rsidR="00C8440C" w:rsidRDefault="00C8440C">
<w:r>
<w:t>content</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
</w:p>
```
So, HTMLevator would need to recognize this bottom border, and add an `<hr>` after the end of that paragraph.
What do you think?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/14How header promotion uses styles, and the list of styles we pay attention to2018-05-02T15:44:13ZAlex ThegHow header promotion uses styles, and the list of styles we pay attention toBakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XS...Bakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XSweet#56, this is the second improvement this group of headers suggests:
When we find and promote Word styles we care about, like "section heading", we could also look for other similarly formatted text that's NOT labeled with a Word style but should have been. In this instance, the header promotion script could:
* see the "section heading" styles and promotes the marked headings
* note how the section headings were formatted (here it's underlined 12pt helvetica font)
* look for other potential headings formatted the same way, and promotes them as appropriate (this would catch the other 4)
Would this be easy to do?https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/37Some hyperlinks truncated in Wax2018-04-26T16:55:28ZAlex ThegSome hyperlinks truncated in WaxSome of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.Some of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/36Capture paragraph-level italicization in Typescript reduce step2018-04-27T17:06:36ZAlex ThegCapture paragraph-level italicization in Typescript reduce stepThis is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basi...This is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basic:
```html
<p style="-xsweet-outline-level: 0; font-family: Times New Roman; font-style: italic">The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
Editoria reduced:
```html
<p>The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
The result of the reduce step should be wrapped in `<em>` tags, like so:
```html
<p><i>The Community Union: Organizing Farmworkers for Mutual Aid</i></p>
```
~~If bolding and underlining are not similarly caught and pushed to inline elements in the reduce step, for porting into Wax (and I thought they were), they should be.~~ Just kidding... by this point, bold and underline have been converted to italics.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/10Directional quotation marks broken by inline formatting tags2018-06-05T04:37:31ZAlex ThegDirectional quotation marks broken by inline formatting tagsHere's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-fa...Here's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-family: Times New Roman; text-align: center">"<i>Desabilitado:”</i>”</h3>
```
But it should really be:
```html
<h3 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado:</i>”</h3>
```
Issues:
1. Because the open quote is by itself, the macro cleans it up to a straight quote, since it doesn't know what direction it should go. The quotation mark should recognize that it's next to a word, even across inline formatting tags, and be assigned a direction accordingly.
2. This shouldn't apply after the first issue is fixed, but if the text cleanup _does_ encounter a directional single or double quotation mark all by itself (e.g. `<p>”</p>`), with no clue as to which way it should face, it replaces it with a straight single or double quotation mark. I'd prefer that in these instances, it just sticks with whatever direction the original quotation mark was. If this is tricky to do though, let's leave it alone.
3. We end up with an extra closing quotation mark. I am guessing it's because the colon is brought into the italics tag (coercing punctionation to match prior word's formatting), but it looks like the quotation mark comes along for the ride, too. Instead, it should be left where it is outside the italic tags, as it's not one of the punctuation marks that rule should apply to.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/143Remove <spacing> element2018-07-27T04:13:07ZAlex ThegRemove <spacing> elementIn Prado, Ch 1, some paragraphs are composed of very small snippets of text enclosed by `<spacing>` tags.
The `<spacing>` tags should be removed, joining the text inside and outside of them into one string.
```html
<p style="margin-lef...In Prado, Ch 1, some paragraphs are composed of very small snippets of text enclosed by `<spacing>` tags.
The `<spacing>` tags should be removed, joining the text inside and outside of them into one string.
```html
<p style="margin-left: 5pt; margin-right: 2.15pt; margin-top: 0.5pt; text-indent: 36pt">
<spacing>Man</spacing>u
<spacing>e</spacing>l
<spacing> B</spacing>o
<spacing>telh</spacing>o
<spacing> </spacing>de
<spacing> Lacer</spacing>da
<spacing> </spacing>w
<spacing>a</spacing>s
...
```https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/9Don't promote ornamental breaks to headers2018-08-07T13:01:26ZAlex ThegDon't promote ornamental breaks to headersThe most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with...The most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with any number of spaces or tabs between asterisks) should be excluded from consideration for promotion to headings.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/141Extract a default font from Word docs2018-04-24T06:41:30ZAlex ThegExtract a default font from Word docsI believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the br...I believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the browser's default font.
Putting this on hold as a future development.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/8Hyperlinker should catch trailing slash if present.2018-05-29T01:32:03ZAlex ThegHyperlinker should catch trailing slash if present.Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it ...Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it really should be: [http://www.arsdisputandi.org/](http://www.arsdisputandi.org/)https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/7Hyperlink inferrer tags some file names as links that shouldn't be2018-05-29T01:38:51ZAlex ThegHyperlink inferrer tags some file names as links that shouldn't beIn an author docx that lists captions for images to be included in the book, these strings get linked as hyperlinks:
* 04_IntroTodosSantos.jpg
* 13_ClinicScale_20R06.jpg
They shouldn't be, since they don't point to anything. Could we a...In an author docx that lists captions for images to be included in the book, these strings get linked as hyperlinks:
* 04_IntroTodosSantos.jpg
* 13_ClinicScale_20R06.jpg
They shouldn't be, since they don't point to anything. Could we add a small adjustment to be sure things like this don't get linked? Perhaps a validation like "must have at least one slash if it ends in a file extension" or something similar?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/114Ensure endnotes appear with their numbers at the end2017-08-16T23:29:59ZAlex ThegEnsure endnotes appear with their numbers at the endSee b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, whic...See b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, which comes through into the html. Clicking the inline note callout ("3") in the html takes you to the correct corresponding note, but it's not labeled "3". This is because the note is missing its `<w:endnoteRef/>` in the OOXML:
```xml
<w:r>
<w:rPr><w:sz w:val="24"/><w:szCs w:val="24"/><w:vertAlign w:val="superscript"/></w:rPr><w:endnoteRef/></w:r>
<w:r>
```
All the other note references have this `<w:endnoteRef/>` tag. This element is extracted into the html as `<span class="endnoteRef">`, and that’s where the corresponding note number gets inserted.
To fix this, we could insert this `<span class="endnoteRef">` into the html in the proper place (inside `<div class="docx-endnote”> for every note`) even if it doesn’t exist. Since XSweet renumbers the notes, it should have a list of the notes anyway.
I'll put this on hold as a validation step that can come after 1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/98"Chapter One - " text dropped from extraction2018-03-15T20:05:00ZAlex Theg"Chapter One - " text dropped from extractionBoyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1...Boyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1-9RINSED.html](/uploads/da01d16780e6f8cc2dd7e8b02877db1c/e.BoylesChapter1-9RINSED.html)
The top level header reads "Chapter One - Introduction" in Word. But the text "Chapter One - " is not coming through from Word into html. It seems to have been implemented as some kind of list in Word, but I can't find anything evidence of it in the initial extraction.
![Screen_Shot_2017-04-21_at_1.42.36_AM](/uploads/55402036b54e43e7429bb576ce93278c/Screen_Shot_2017-04-21_at_1.42.36_AM.png)
Is there an easy fix?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/77Chapter headings come through in blue2017-08-16T19:56:39ZAlex ThegChapter headings come through in blueIn Buchbinder's book, all of the chapters come through into the HTML with blue coloring. It seems to be caused by a `style="color: auto"` attribute. The headings display as black in Word, but light blue in browsers (Chrome and Firefox)...In Buchbinder's book, all of the chapters come through into the HTML with blue coloring. It seems to be caused by a `style="color: auto"` attribute. The headings display as black in Word, but light blue in browsers (Chrome and Firefox)
Here are examples of what's causing it:
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Introduction</h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter One </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">The Bottom of the Funnel </h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Three </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Sticky Brains </h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Four </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Treating the Family</h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Five </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Locating Pain in Societal Stress</h1>
```
I'll mark this as low priority, since this is an html-only improvement. The color gets scrubbed out by typescript.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/72Catch "Subtitle" styles2018-03-15T17:27:31ZAlex ThegCatch "Subtitle" stylesSee #71 for Boyles files
The chapter's subtitle ("A Startling Turn of Events") doesn't get promoted to a heading. There are 2 issues here:
1. It is labeled as a Word style "Subtitle." This should be one of the styles that XSweet pay...See #71 for Boyles files
The chapter's subtitle ("A Startling Turn of Events") doesn't get promoted to a heading. There are 2 issues here:
1. It is labeled as a Word style "Subtitle." This should be one of the styles that XSweet pays attention to for consideration as a header. In this entire book, the author's used styles h1-4 to mark the different heading levels correctly. I think the conversion pipeline is noting this, as all the properly labeled headers are coming through accurately, so it would be good to stir class="Subtitle" into the mix too.
2. (more of a note to self) Even if the author hadn't styled this, it's short and centered, so should be caught by the "if it's centered and short it's a header" rule, once that is implemented.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/52Making @class values HTML-safe2017-08-16T19:19:22ZWendell PiezMaking @class values HTML-safeXSweet produces HTML 'class' attributes with labels reflecting Word Styles ("Paragraph Styles" and "Character Styles") assigned in the source document.
We are already normalizing these names into HTML-safe versions by removing spaces an...XSweet produces HTML 'class' attributes with labels reflecting Word Styles ("Paragraph Styles" and "Character Styles") assigned in the source document.
We are already normalizing these names into HTML-safe versions by removing spaces and other unwanted characters, but at risk of confusion when there are name collisions (in the rare case where a document has both "Header 1" and "Header1" styles, and they are different). We are also not making provision for some peculiarities of HTML @class, such as the fact that they are not case-sensitive.
A separate filter to rewrite style names to safe values would account for all of this properly. It should:
* Remove spaces and unwanted characters
* Cast to lower case
* Relabel and resulting collisions with distinguishing suffixes.
This needs to happen both on @class, and inside enclosed CSS (wherein classes are referenced), for comprehensiveness.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/4Mapping `<code>` and code blocks2018-03-16T17:50:57ZWendell PiezMapping `<code>` and code blocksEditoria accepts both `<code>` (inline) and code blocks in the form:
```<pre><code> { content } </code></pre>```
The problem will be detecting these in HTMLTypescript, where we are more likely to have things like `<span class="someCode...Editoria accepts both `<code>` (inline) and code blocks in the form:
```<pre><code> { content } </code></pre>```
The problem will be detecting these in HTMLTypescript, where we are more likely to have things like `<span class="someCode">` (where the `someCode` style may or may not have a monospace font), or `<span style="font-family: Courier New">` or (for a code block) `<p style="font-family: Courier New; font-size: 14pt">`.
Seeing actual examples will help. Assembling a list of known monospace fonts (on platforms used by authors) could also help - a filter could detect these and act accordingly.Wendell PiezWendell Piez