XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-08-07T12:51:33Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/153Update XSweet to work with outline and list HTML attributes2018-08-07T12:51:33ZAlex ThegUpdate XSweet to work with outline and list HTML attributesAfter #152 is finished, and these are added as HTML attributes:
* `-xsweet-outline-level`
* `-xsweet-list-level`
We will need to:
1. Update how list handling works to use the new attribute
2. Update where heading promotion looks for thi...After #152 is finished, and these are added as HTML attributes:
* `-xsweet-outline-level`
* `-xsweet-list-level`
We will need to:
1. Update how list handling works to use the new attribute
2. Update where heading promotion looks for this data
3. Remove the above information from the CSS `style`https://gitlab.coko.foundation/XSweet/XSweet/-/issues/152Semantic data mixed with style data?2018-08-30T07:25:11ZBruno Herfsthello@brunoherfst.comSemantic data mixed with style data?I noticed that XSweet saves semantic data as style info:
<p style="font-family: Tahoma; font-size: 18pt; -xsweet-outline-level: 1">
Should `-xsweet-outline-level` become a `data-*` attribute?
<p style="font-family: Tahoma; fon...I noticed that XSweet saves semantic data as style info:
<p style="font-family: Tahoma; font-size: 18pt; -xsweet-outline-level: 1">
Should `-xsweet-outline-level` become a `data-*` attribute?
<p style="font-family: Tahoma; font-size: 18pt;" data-xsweet-outline-level="1">https://gitlab.coko.foundation/XSweet/XSweet/-/issues/151Update binary references to use extracted copies, rather than originals2018-08-08T09:07:37ZAlex ThegUpdate binary references to use extracted copies, rather than originalsThings like embedded images, media, and math are all stored in the .docx directory. For the HTML extraction, these files should be copied over to the same directory as the HTML files. That way, they're easily accessible, and the HTML doe...Things like embedded images, media, and math are all stored in the .docx directory. For the HTML extraction, these files should be copied over to the same directory as the HTML files. That way, they're easily accessible, and the HTML doesn't require the input .docx file to stay where it originally was. However, XSLT doesn't allow for file system manipulation by itself. That task will fall to INK, which is slated to be rebuilt in JavaScript (rather than RoR). Once that is complete, XSweet should be updated to reference copies of the binaries in the output directory, rather than directly referencing the binaries of the original .docx file.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/16Test heading promotion method chooser2018-07-26T22:00:41ZAlex ThegTest heading promotion method chooserThe criteria used for determining the heading promotion method is as follows:
* If the extracted HTML contains 2 or more `xsweet-outline-level` properties, then use the outline-level heading promotion
* Else, use the property-based clas...The criteria used for determining the heading promotion method is as follows:
* If the extracted HTML contains 2 or more `xsweet-outline-level` properties, then use the outline-level heading promotion
* Else, use the property-based classic method
These are OK for now, but could benefit from further refinement.
This is carried forward from the previous issue: https://gitlab.coko.foundation/XSweet/XSweet/issues/123https://gitlab.coko.foundation/XSweet/XSweet/-/issues/150Hyperlinks in footnotes broken2018-07-27T16:58:45ZBruno Herfsthello@brunoherfst.comHyperlinks in footnotes brokenHyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperl...Hyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperlink.docx)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/149Capture ordered list start value2020-06-04T10:07:50ZAlex ThegCapture ordered list start valueExtract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After t...Extract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After this is complete, evaluate whether we also need have a `continue` list property as @GitBruno suggests in #106https://gitlab.coko.foundation/XSweet/XSweet/-/issues/148Capture list type2022-03-09T11:23:00ZAlex ThegCapture list typeFrom #106 @GitBruno
XSweet should extract the list type from lists, in addition to what it does now, and hold onto it as a property, `xsweet-list-type`.
@atheg todo: investigate most commonly occurring list types in Word and specify ma...From #106 @GitBruno
XSweet should extract the list type from lists, in addition to what it does now, and hold onto it as a property, `xsweet-list-type`.
@atheg todo: investigate most commonly occurring list types in Word and specify mappings from the Word OOXML to `xsweet-list-type` property values.Dione Mentisdione@coko.foundationBharathydasanDione Mentisdione@coko.foundationhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/147How do we test?2018-07-30T08:12:02ZBruno Herfsthello@brunoherfst.comHow do we test?I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](ht...I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](https://github.com/expath/xspec)?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/145Collapse adjacent and repeated inline formatting tags2018-07-27T04:46:12ZBruno Herfsthello@brunoherfst.comCollapse adjacent and repeated inline formatting tagsExtracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>Extracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/15Ornament detector2018-08-07T13:01:25ZAlex ThegOrnament detectorAuthors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement ...Authors often include ornaments to create divisions within chapters in their Word files. These are things like:
> chapter content chapter content chapter content.
> `* * *`
> Back to more chapter content
Authors can and do implement these in a few different ways:
1. Any number of text dividers: `***`, `*****`, `- - -`, etc.
2. Using a horizontal rule in Word
It would be good as an enhancement step (not extraction) to be able to port these into Wax, so I propose we implement the following:
1. Add an optional enhancement step to HTMLevator that can recognize a range of ornaments and convert them into `<hr>`s
2. Then, add a step into Editoria Typescript to convert `<hr>`s into ornaments for Wax. There's a ticket in for implementing this in Wax (https://gitlab.coko.foundation/wax/wax/issues/178), so we'll need to wait for this to be implemented to have a target format for ornaments. But it should be a straightforward mapping.
But we can start on the first part: ornament recognition.
I think there are 2 parts to this that would get us most of the way there:
## 1. Recognizing text ornaments
I think the rule for this is pretty simple: any paragraph that contains ONLY any combination of
* asterisks
* spaces
* hyphens
* en dashes
* em dashes
* tabs
is an ornament. The paragraph and its content should be clobbered and replaced with an `<hr>`
## 2. Convert horizontal rules to `<hr>`s
In Word, on a new line, typing 3 or more hyphens in a row then hitting enter creates a horizontal rule. Under the hood, it's achieved by applying a bottom border to the previous paragraph, like so:
How it looks in Word:
>
Content
***
content
The OOXML:
```xml
<w:p w14:paraId="4F67C0DD" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00C8440C">
<w:pPr>
<w:pBdr><w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto"/></w:pBdr>
</w:pPr>
<w:r>
<w:t>Content</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5A5EC17D" w14:textId="77777777" w:rsidR="00C8440C" w:rsidRDefault="00C8440C">
<w:r>
<w:t>content</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
</w:p>
```
So, HTMLevator would need to recognize this bottom border, and add an `<hr>` after the end of that paragraph.
What do you think?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/144README instructions do not match repo contents2018-05-16T22:40:45ZGhost UserREADME instructions do not match repo contentsThe [applications/README.md](https://gitlab.coko.foundation/XSweet/XSweet/blob/master/applications/readme.md) file mentions the file `local-fixup/hyperlink-inferencer.xsl` but this file does not exist in most branches of this repo. It on...The [applications/README.md](https://gitlab.coko.foundation/XSweet/XSweet/blob/master/applications/readme.md) file mentions the file `local-fixup/hyperlink-inferencer.xsl` but this file does not exist in most branches of this repo. It only exists in the `ink-api-publish` branch [here](https://gitlab.coko.foundation/XSweet/XSweet/blob/ink-api-publish/applications/local-fixup/hyperlink-inferencer.xsl).https://gitlab.coko.foundation/XSweet/XSweet_runner_scripts/-/issues/1Workflows call script removed from xsweet2018-05-24T22:29:52ZGhost UserWorkflows call script removed from xsweetHi there,
The `execute_chain.sh` script runs a multi-step process, but one of the steps requires `hyperlink-inferencer.xsl`, which appears to have been removed from all but one branch of the xsweet repository.
This file _only_ exists i...Hi there,
The `execute_chain.sh` script runs a multi-step process, but one of the steps requires `hyperlink-inferencer.xsl`, which appears to have been removed from all but one branch of the xsweet repository.
This file _only_ exists in the `ink-api-publish` branch ([link](https://gitlab.coko.foundation/XSweet/XSweet/blob/ink-api-publish/applications/local-fixup/hyperlink-inferencer.xsl)).
Is this step no longer necessary?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/14How header promotion uses styles, and the list of styles we pay attention to2018-05-02T15:44:13ZAlex ThegHow header promotion uses styles, and the list of styles we pay attention toBakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XS...Bakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XSweet#56, this is the second improvement this group of headers suggests:
When we find and promote Word styles we care about, like "section heading", we could also look for other similarly formatted text that's NOT labeled with a Word style but should have been. In this instance, the header promotion script could:
* see the "section heading" styles and promotes the marked headings
* note how the section headings were formatted (here it's underlined 12pt helvetica font)
* look for other potential headings formatted the same way, and promotes them as appropriate (this would catch the other 4)
Would this be easy to do?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/13Preliminary "Header inferencing" XSLT2018-05-01T20:44:41ZWendell PiezPreliminary "Header inferencing" XSLTThe develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
I...The develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
It needs demo and discussion, and in particular the rules for what makes a header need to be shaken out.
However it also needs an invocation script. (So far only an XProc pipeline, which does a little more work than the regular extraction pipeline, then generates an applies an XSLT to achieve this mapping.) So that comes next...https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/12Add terminal stylesheet to emit plain text2018-05-01T20:41:10ZWendell PiezAdd terminal stylesheet to emit plain textWe want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.We want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/38Investigate inconsistent italics in Waters Intro2018-05-03T22:42:18ZAlex ThegInvestigate inconsistent italics in Waters IntroItalics in the first note of Waters introduction continues past where it should - investigate.Italics in the first note of Waters introduction continues past where it should - investigate.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/37Some hyperlinks truncated in Wax2018-04-26T16:55:28ZAlex ThegSome hyperlinks truncated in WaxSome of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.Some of the properly identified links in the HTML are truncated when ported into Wax. See examples from Horton references.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/36Capture paragraph-level italicization in Typescript reduce step2018-04-27T17:06:36ZAlex ThegCapture paragraph-level italicization in Typescript reduce stepThis is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basi...This is closely related to https://gitlab.coko.foundation/XSweet/HTMLevator/issues/2.
Italicization specified as a paragraph `style` is being dropped in the editoria typescript reduce step. This is from b_kohl-arenas_ch2:
Editoria basic:
```html
<p style="-xsweet-outline-level: 0; font-family: Times New Roman; font-style: italic">The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
Editoria reduced:
```html
<p>The Community Union: Organizing Farmworkers for Mutual Aid</p>
```
The result of the reduce step should be wrapped in `<em>` tags, like so:
```html
<p><i>The Community Union: Organizing Farmworkers for Mutual Aid</i></p>
```
~~If bolding and underlining are not similarly caught and pushed to inline elements in the reduce step, for porting into Wax (and I thought they were), they should be.~~ Just kidding... by this point, bold and underline have been converted to italics.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/11Premature end of file error2018-04-26T16:07:34ZAlex ThegPremature end of file errorThis file fails in the UCP macro text cleanup step, with a "Premature end of file" error. This stops the conversion in its tracks. I attached a snippet of the docx, as well as the outputs from each step: Here's the error I get in the te...This file fails in the UCP macro text cleanup step, with a "Premature end of file" error. This stops the conversion in its tracks. I attached a snippet of the docx, as well as the outputs from each step: Here's the error I get in the terminal - any idea what's going on?
[horton.zip](/uploads/334b27456367c76a63571752aabc97c3/horton.zip)
```bash
converting: 1
Warning at xsl:stylesheet on line 9 column 34 of mark-lists.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Warning at xsl:stylesheet on line 7 column 34 of itemize-lists.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Warning at /xsl:stylesheet in outline-headers.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Type error at char 52 in xsl:sequence/@select on line 314 column 11 of ucp-text-macros.xsl:
XPTY0004: A sequence of more than one item is not allowed as the third argument of
fn:replace() ("$1 ", "s")
at xsl:apply-templates (file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../applications/htmlevator/applications/local-fixup/ucp-text-macros.xsl#188)
processing sequence/splice[3]
at xsl:apply-templates (file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../applications/htmlevator/applications/local-fixup/ucp-text-macros.xsl#156)
processing sequence
in built-in template rule for /html/body[1]/div[1]/p[5]/text()[1] in the unnamed mode
in built-in template rule for /html/body[1]/div[1]/p[5] in the unnamed mode
in built-in template rule for /html in the unnamed mode
A sequence of more than one item is not allowed as the third argument of fn:replace() ("$1 ", "s")
Error on line 1 column 1 of 1-10UCPTEXTED.xhtml:
SXXP0003: Error reported by XML parser: Premature end of file.
org.xml.sax.SAXParseException; systemId: file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../outputs/horton_convs/1-10UCPTEXTED.xhtml; lineNumber: 1; columnNumber: 1; Premature end of file.
```https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/10Directional quotation marks broken by inline formatting tags2018-06-05T04:37:31ZAlex ThegDirectional quotation marks broken by inline formatting tagsHere's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-fa...Here's an text macro cleanup that highlights several issues:
Rinsed html output:
```html
<h4 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado</i>:”</h4>
```
Text macro cleanup output:
```html
<h3 style="font-family: Times New Roman; text-align: center">"<i>Desabilitado:”</i>”</h3>
```
But it should really be:
```html
<h3 style="font-family: Times New Roman; text-align: center">“<i>Desabilitado:</i>”</h3>
```
Issues:
1. Because the open quote is by itself, the macro cleans it up to a straight quote, since it doesn't know what direction it should go. The quotation mark should recognize that it's next to a word, even across inline formatting tags, and be assigned a direction accordingly.
2. This shouldn't apply after the first issue is fixed, but if the text cleanup _does_ encounter a directional single or double quotation mark all by itself (e.g. `<p>”</p>`), with no clue as to which way it should face, it replaces it with a straight single or double quotation mark. I'd prefer that in these instances, it just sticks with whatever direction the original quotation mark was. If this is tricky to do though, let's leave it alone.
3. We end up with an extra closing quotation mark. I am guessing it's because the colon is brought into the italics tag (coercing punctionation to match prior word's formatting), but it looks like the quotation mark comes along for the ride, too. Instead, it should be left where it is outside the italic tags, as it's not one of the punctuation marks that rule should apply to.