XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-08-07T12:51:33Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/153Update XSweet to work with outline and list HTML attributes2018-08-07T12:51:33ZAlex ThegUpdate XSweet to work with outline and list HTML attributesAfter #152 is finished, and these are added as HTML attributes:
* `-xsweet-outline-level`
* `-xsweet-list-level`
We will need to:
1. Update how list handling works to use the new attribute
2. Update where heading promotion looks for thi...After #152 is finished, and these are added as HTML attributes:
* `-xsweet-outline-level`
* `-xsweet-list-level`
We will need to:
1. Update how list handling works to use the new attribute
2. Update where heading promotion looks for this data
3. Remove the above information from the CSS `style`https://gitlab.coko.foundation/XSweet/XSweet/-/issues/152Semantic data mixed with style data?2018-08-30T07:25:11ZBruno Herfsthello@brunoherfst.comSemantic data mixed with style data?I noticed that XSweet saves semantic data as style info:
<p style="font-family: Tahoma; font-size: 18pt; -xsweet-outline-level: 1">
Should `-xsweet-outline-level` become a `data-*` attribute?
<p style="font-family: Tahoma; fon...I noticed that XSweet saves semantic data as style info:
<p style="font-family: Tahoma; font-size: 18pt; -xsweet-outline-level: 1">
Should `-xsweet-outline-level` become a `data-*` attribute?
<p style="font-family: Tahoma; font-size: 18pt;" data-xsweet-outline-level="1">https://gitlab.coko.foundation/XSweet/XSweet/-/issues/151Update binary references to use extracted copies, rather than originals2018-08-08T09:07:37ZAlex ThegUpdate binary references to use extracted copies, rather than originalsThings like embedded images, media, and math are all stored in the .docx directory. For the HTML extraction, these files should be copied over to the same directory as the HTML files. That way, they're easily accessible, and the HTML doe...Things like embedded images, media, and math are all stored in the .docx directory. For the HTML extraction, these files should be copied over to the same directory as the HTML files. That way, they're easily accessible, and the HTML doesn't require the input .docx file to stay where it originally was. However, XSLT doesn't allow for file system manipulation by itself. That task will fall to INK, which is slated to be rebuilt in JavaScript (rather than RoR). Once that is complete, XSweet should be updated to reference copies of the binaries in the output directory, rather than directly referencing the binaries of the original .docx file.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/16Test heading promotion method chooser2018-07-26T22:00:41ZAlex ThegTest heading promotion method chooserThe criteria used for determining the heading promotion method is as follows:
* If the extracted HTML contains 2 or more `xsweet-outline-level` properties, then use the outline-level heading promotion
* Else, use the property-based clas...The criteria used for determining the heading promotion method is as follows:
* If the extracted HTML contains 2 or more `xsweet-outline-level` properties, then use the outline-level heading promotion
* Else, use the property-based classic method
These are OK for now, but could benefit from further refinement.
This is carried forward from the previous issue: https://gitlab.coko.foundation/XSweet/XSweet/issues/123https://gitlab.coko.foundation/XSweet/XSweet/-/issues/150Hyperlinks in footnotes broken2018-07-27T16:58:45ZBruno Herfsthello@brunoherfst.comHyperlinks in footnotes brokenHyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperl...Hyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperlink.docx)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/149Capture ordered list start value2020-06-04T10:07:50ZAlex ThegCapture ordered list start valueExtract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After t...Extract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After this is complete, evaluate whether we also need have a `continue` list property as @GitBruno suggests in #106https://gitlab.coko.foundation/XSweet/XSweet/-/issues/147How do we test?2018-07-30T08:12:02ZBruno Herfsthello@brunoherfst.comHow do we test?I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](ht...I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](https://github.com/expath/xspec)?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/146Extract Break Types2018-07-31T06:50:38ZBruno Herfsthello@brunoherfst.comExtract Break TypesBreak types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-bre...Break types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-break'>`https://gitlab.coko.foundation/XSweet/XSweet/-/issues/145Collapse adjacent and repeated inline formatting tags2018-07-27T04:46:12ZBruno Herfsthello@brunoherfst.comCollapse adjacent and repeated inline formatting tagsExtracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>Extracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/15Ornament detector2018-08-07T13:01:25ZAlex ThegOrnament detectorAuthors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement ...Authors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement these in a few different ways:
1. Any number of text dividers: `***`, `*****`, `- - -`, etc.
2. Using a horizontal rule in Word
It would be good as an enhancement step (not extraction) to be able to port these into Wax, so I propose we implement the following:
1. Add an optional enhancement step to HTMLevator that can recognize a range of ornaments and convert them into `<hr>`s
2. Then, add a step into Editoria Typescript to convert `<hr>`s into ornaments for Wax. There's a ticket in for implementing this in Wax (https://gitlab.coko.foundation/wax/wax/issues/178), so we'll need to wait for this to be implemented to have a target format for ornaments. But it should be a straightforward mapping.
But we can start on the first part: ornament recognition.
I think there are 2 parts to this that would get us most of the way there:
## 1. Recognizing text ornaments
I think the rule for this is pretty simple: any paragraph that contains ONLY any combination of
* asterisks
* spaces
* hyphens
* en dashes
* em dashes
* tabs
is an ornament. The paragraph and its content should be clobbered and replaced with an `<hr>`
## 2. Convert horizontal rules to `<hr>`s
In Word, on a new line, typing 3 or more hyphens in a row then hitting enter creates a horizontal rule. Under the hood, it's achieved by applying a bottom border to the previous paragraph, like so:
How it looks in Word:
>
Content
***
content
The OOXML:
```xml
<w:p w14:paraId="4F67C0DD" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00C8440C">
<w:pPr>
<w:pBdr><w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto"/></w:pBdr>
</w:pPr>
<w:r>
<w:t>Content</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5A5EC17D" w14:textId="77777777" w:rsidR="00C8440C" w:rsidRDefault="00C8440C">
<w:r>
<w:t>content</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
</w:p>
```
So, HTMLevator would need to recognize this bottom border, and add an `<hr>` after the end of that paragraph.
What do you think?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/14How header promotion uses styles, and the list of styles we pay attention to2018-05-02T15:44:13ZAlex ThegHow header promotion uses styles, and the list of styles we pay attention toBakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XS...Bakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XSweet#56, this is the second improvement this group of headers suggests:
When we find and promote Word styles we care about, like "section heading", we could also look for other similarly formatted text that's NOT labeled with a Word style but should have been. In this instance, the header promotion script could:
* see the "section heading" styles and promotes the marked headings
* note how the section headings were formatted (here it's underlined 12pt helvetica font)
* look for other potential headings formatted the same way, and promotes them as appropriate (this would catch the other 4)
Would this be easy to do?https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/37Some hyperlinks truncated in Wax2018-04-26T16:55:28ZAlex ThegSome hyperlinks truncated in WaxSome of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.Some of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/36Capture paragraph-level italicization in Typescript reduce step2018-04-27T17:06:36ZAlex ThegCapture paragraph-level italicization in Typescript reduce stepThis is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basi...This is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basic:
```html
<p style="-xsweet-outline-level: 0; font-family: Times New Roman; font-style: italic">The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
Editoria reduced:
```html
<p>The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
The result of the reduce step should be wrapped in `<em>` tags, like so:
```html
<p><i>The Community Union: Organizing Farmworkers for Mutual Aid</i></p>
```
~~If bolding and underlining are not similarly caught and pushed to inline elements in the reduce step, for porting into Wax (and I thought they were), they should be.~~ Just kidding... by this point, bold and underline have been converted to italics.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/10Directional quotation marks broken by inline formatting tags2018-06-05T04:37:31ZAlex ThegDirectional quotation marks broken by inline formatting tagsHere's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-fa...Here's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-family: Times New Roman; text-align: center">"<i>Desabilitado:”</i>”</h3>
```
But it should really be:
```html
<h3 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado:</i>”</h3>
```
Issues:
1. Because the open quote is by itself, the macro cleans it up to a straight quote, since it doesn't know what direction it should go. The quotation mark should recognize that it's next to a word, even across inline formatting tags, and be assigned a direction accordingly.
2. This shouldn't apply after the first issue is fixed, but if the text cleanup _does_ encounter a directional single or double quotation mark all by itself (e.g. `<p>”</p>`), with no clue as to which way it should face, it replaces it with a straight single or double quotation mark. I'd prefer that in these instances, it just sticks with whatever direction the original quotation mark was. If this is tricky to do though, let's leave it alone.
3. We end up with an extra closing quotation mark. I am guessing it's because the colon is brought into the italics tag (coercing punctionation to match prior word's formatting), but it looks like the quotation mark comes along for the ride, too. Instead, it should be left where it is outside the italic tags, as it's not one of the punctuation marks that rule should apply to.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/143Remove <spacing> element2018-07-27T04:13:07ZAlex ThegRemove <spacing> elementIn Prado, Ch 1, some paragraphs are composed of very small snippets of text enclosed by `<spacing>` tags.
The `<spacing>` tags should be removed, joining the text inside and outside of them into one string.
```html
<p style="margin-lef...In Prado, Ch 1, some paragraphs are composed of very small snippets of text enclosed by `<spacing>` tags.
The `<spacing>` tags should be removed, joining the text inside and outside of them into one string.
```html
<p style="margin-left: 5pt; margin-right: 2.15pt; margin-top: 0.5pt; text-indent: 36pt">
<spacing>Man</spacing>u
<spacing>e</spacing>l
<spacing> B</spacing>o
<spacing>telh</spacing>o
<spacing> </spacing>de
<spacing> Lacer</spacing>da
<spacing> </spacing>w
<spacing>a</spacing>s
...
```https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/9Don't promote ornamental breaks to headers2018-08-07T13:01:26ZAlex ThegDon't promote ornamental breaks to headersThe most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with...The most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with any number of spaces or tabs between asterisks) should be excluded from consideration for promotion to headings.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/142CSS for hanging paragraphs2018-08-07T14:24:43ZAlex ThegCSS for hanging paragraphsXSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style...XSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style="margin-left: 36pt">`
Two indent no hanging: `<w:ind w:left="1440"/>` -> `<p style="margin-left: 72pt">`
But the indentation with hanging needs another CSS property to be correct:
One indent hanging:
`<w:ind w:left="1440" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 36pt;` added in addition to what's already there to be correct.
Two indent hanging:
`<w:ind w:left="2160" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 72pt;` added in addition to what's already there and then it's correct.
Here's an test docx: [hanging.docx](/uploads/459ecfb10d4e6c42caf16f4983c52142/hanging.docx)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/141Extract a default font from Word docs2018-04-24T06:41:30ZAlex ThegExtract a default font from Word docsI believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the br...I believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the browser's default font.
Putting this on hold as a future development.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/140Formatting issues with nested spans and Word styles2018-05-01T13:52:44ZAlex ThegFormatting issues with nested spans and Word stylesThis is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it use...This is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it uses the Word style "BookTitle + Not Bold, Not Small caps". It looks like this is a question of nested spans, the priority in which the formatting is resolved, and how Word style modifiers are extracted into the html.
Here's the html after the join step:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">
<span style="font-weight: normal">Acknowledgements</span>
</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
Here it is after the the collapse step. At this point, I believe the innermost span's `font-weight: normal` should have been passed to the outer `class="BookTitle` span, but it is not:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
And here is the final rinsed html:
```html
<h2 style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0"><!-- bookmark ='_GoBack'--></a>
<a href="#docx-bookmark_0"><!-- bookmark end --></a>
</h2>
```
And, the `font-variant: normal` needs to be passed down to the innermost span, or else it's clobbered by the `BookTitle` styling on the innermost span.1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/8Hyperlinker should catch trailing slash if present.2018-05-29T01:32:03ZAlex ThegHyperlinker should catch trailing slash if present.Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it ...Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it really should be: [http://www.arsdisputandi.org/](http://www.arsdisputandi.org/)