HTMLevator issueshttps://gitlab.coko.foundation/XSweet/HTMLevator/-/issues2018-05-01T20:44:41Zhttps://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/13Preliminary "Header inferencing" XSLT2018-05-01T20:44:41ZWendell PiezPreliminary "Header inferencing" XSLTThe develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
I...The develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
It needs demo and discussion, and in particular the rules for what makes a header need to be shaken out.
However it also needs an invocation script. (So far only an XProc pipeline, which does a little more work than the regular extraction pipeline, then generates an applies an XSLT to achieve this mapping.) So that comes next...https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/12Add terminal stylesheet to emit plain text2018-05-01T20:41:10ZWendell PiezAdd terminal stylesheet to emit plain textWe want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.We want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/11Premature end of file error2018-04-26T16:07:34ZAlex ThegPremature end of file errorThis file fails in the UCP macro text cleanup step, with a "Premature end of file" error. This stops the conversion in its tracks. I attached a snippet of the docx, as well as the outputs from each step: Here's the error I get in the te...This file fails in the UCP macro text cleanup step, with a "Premature end of file" error. This stops the conversion in its tracks. I attached a snippet of the docx, as well as the outputs from each step: Here's the error I get in the terminal - any idea what's going on?
[horton.zip](/uploads/334b27456367c76a63571752aabc97c3/horton.zip)
```bash
converting: 1
Warning at xsl:stylesheet on line 9 column 34 of mark-lists.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Warning at xsl:stylesheet on line 7 column 34 of itemize-lists.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Warning at /xsl:stylesheet in outline-headers.xsl:
Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor
Type error at char 52 in xsl:sequence/@select on line 314 column 11 of ucp-text-macros.xsl:
XPTY0004: A sequence of more than one item is not allowed as the third argument of
fn:replace() ("$1 ", "s")
at xsl:apply-templates (file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../applications/htmlevator/applications/local-fixup/ucp-text-macros.xsl#188)
processing sequence/splice[3]
at xsl:apply-templates (file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../applications/htmlevator/applications/local-fixup/ucp-text-macros.xsl#156)
processing sequence
in built-in template rule for /html/body[1]/div[1]/p[5]/text()[1] in the unnamed mode
in built-in template rule for /html/body[1]/div[1]/p[5] in the unnamed mode
in built-in template rule for /html in the unnamed mode
A sequence of more than one item is not allowed as the third argument of fn:replace() ("$1 ", "s")
Error on line 1 column 1 of 1-10UCPTEXTED.xhtml:
SXXP0003: Error reported by XML parser: Premature end of file.
org.xml.sax.SAXParseException; systemId: file:/Users/atheg/Desktop/crawler/header_promotion_strategies/_cleaner/sh_branches/master/XSweet-master-5d1b023c7acf4eb861193b39e85fcc1fc32a455f/scripts/../outputs/horton_convs/1-10UCPTEXTED.xhtml; lineNumber: 1; columnNumber: 1; Premature end of file.
```https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/6Some closing tags dropped and blocks of text repeated in the macro text clean...2018-05-03T22:42:18ZAlex ThegSome closing tags dropped and blocks of text repeated in the macro text cleanup stepPicking up an issue from an existing ticket (the seastar part of https://gitlab.coko.foundation/XSweet/HTMLevator/issues/3). Upon further review, it does not appear to have anything to do with the "force punctuation formatting to match p...Picking up an issue from an existing ticket (the seastar part of https://gitlab.coko.foundation/XSweet/HTMLevator/issues/3). Upon further review, it does not appear to have anything to do with the "force punctuation formatting to match preceding word formatting" rule.
This is a widespread issue, and I will post examples below as I encounter them.
In some instances (maybe all, will investigate), the close format tag is dropped, then a certain amount of text gets duplicated until another open format tag, so some text then appears twice. In one of the repeated chunks, sentences are separated by 2 spaces, while in the other chunk, the double spaces have been replaced by single spaces.
I can also confirm that this issue happens in exactly the same way for inline bold, underline, and italic tags.
# Example 1
Rinsed html:
```html
<p class="Default" style="font-family: Helvetica; font-size: 12pt; margin-bottom: 6pt">
One of the central promises of change that former Mexican president Vicente Fox made in the run-up to his victorious election in 2000 was that he would govern on behalf of 118 million Mexicans – a number that included both the 100 million people residing within the territorial confines of the Mexican nation-state as well as the 18 million
<i>mexicanos en el exterior</i>
, the imagined community of Mexican migrants and their descendents living abroad. In recognition of their economic contributions to Mexico, and their continued commitment to the nation, Fox often referred to those
<i>mexicanos en el exterior</i>
as heroes. In this, president Fox was part of an expanding chorus of leaders from major migrant-sending states, from Ireland to the Philippines, who have celebrated the heroic contributions of migrants to their homelands over recent decades. For Fox, this heroic imagery took perhaps its grandest form on December 3, 2000, just three days into the presidency. That day Fox held his first public event and opened the official presidential residence, Los Pinos, for a meeting with migrant leaders. In his official address, the newly inaugurated president waxed eloquently about the spirit and tenacity of the migrant, about the set of characteristics that migrants shared with a curious amalgam of historical figures:
</p>
```
Macro text cleanups applied:
```html
<p class="Default" style="font-family: Helvetica; font-size: 12pt; margin-bottom: 6pt">
One of the central promises of change that former Mexican president Vicente Fox made in the run-up to his victorious election in 2000 was that he would govern on behalf of 118 million Mexicans—a number that included both the 100 million people residing within the territorial confines of the Mexican nation-state as well as the 18 million
<i>mexicanos en el exterior,</i>
the imagined community of Mexican migrants and their descendents living abroad. In recognition of their economic contributions to Mexico, and their continued commitment to the nation, Fox often referred to those
<i>mexicanos en el exterior as heroes. In this, president Fox was part of an expanding chorus of leaders from major migrant-sending states, from Ireland to the Philippines, who have celebrated the heroic contributions of migrants to their homelands over recent decades. For Fox, this heroic imagery took perhaps its grandest form on December 3, 2000, just three days into the presidency. That day Fox held his first public event and opened the official presidential residence, Los Pinos, for a meeting with migrant leaders. In his official address, the newly inaugurated president waxed eloquently about the spirit and tenacity of the migrant, about the set of characteristics that migrants shared with a curious amalgam of historical figures:
</i> as heroes. In this, president Fox was part of an expanding chorus of leaders from major migrant-sending states, from Ireland to the Philippines, who have celebrated the heroic contributions of migrants to their homelands over recent decades. For Fox, this heroic imagery took perhaps its grandest form on December 3, 2000, just three days into the presidency. That day Fox held his first public event and opened the official presidential residence, Los Pinos, for a meeting with migrant leaders. In his official address, the newly inaugurated president waxed eloquently about the spirit and tenacity of the migrant, about the set of characteristics that migrants shared with a curious amalgam of historical figures:
</p>
```1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/5After cleanup, some words duplicated2018-04-09T15:27:38ZAlex ThegAfter cleanup, some words duplicatedAs of a commit today (I think one of the HTMLevator ones), something weird is happening with this chunk of text:
This:
```html
<p style="font-family: Helvetica">
<b>Outstanding </b>
<u>Underline</u>
<b> issues:</b>
</p>
```
Becom...As of a commit today (I think one of the HTMLevator ones), something weird is happening with this chunk of text:
This:
```html
<p style="font-family: Helvetica">
<b>Outstanding </b>
<u>Underline</u>
<b> issues:</b>
</p>
```
Becomes
```html
<p style="font-family: Helvetica">
<b>Outstanding Underline</b>
<u>Underline</u>
<b> issues:</b>
</p>
```
Any ideas what's causing this?1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/4Insert hair space (u200a) btwn pairs of single/double quotes2018-04-06T18:42:23ZAlex ThegInsert hair space (u200a) btwn pairs of single/double quotesAs part of the macro cleanups, we should insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional cou...As part of the macro cleanups, we should insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional counterparts.
* left single quote+left double quote (u2018+u201c)
* left double quote+left single quote (u201c+u2018)
* right single quote+right double quote (u2019+u201d)
* right double quote+right single quote (u201d+u2019)
This currently partially works. See the following example inputs and outputs: the characters in Word on the left and the final Typescript output on the right.
* `"'quote'"` -> `<p style="font-family: Helvetica">“ ‘quote’ ”</p>`
* works properly; hs between both pairs of quotes
* `'"quote"'` -> `<p style="font-family: Helvetica">‘“quote” ’</p>`
* hs between the 2nd quotes but not the 1st
* `'”quote"‘` -> `<p style="font-family: Helvetica">‘“quote” ’</p>
* hs between the 2nd quotes but not the 1st
* `‘"quote"’` -> `<p style="font-family: Helvetica">‘“quote” ’</p>`
* hs between the 2nd quotes but not the 1st
* `""quote""` -> `<p style="font-family: Helvetica">“ “quote””</p>`
* hs between the 1st quotes but not the 2nd1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/3Force punctuation to match formatting of preceding word2018-04-21T02:43:19ZAlex ThegForce punctuation to match formatting of preceding wordAs part of the macro cleanups, we should force punctuation to match formatting of preceding word. Let's do that for the following:
* ,
* .
* :
* ;
* ?
* !
Current example:
```xml
<w:p w14:paraId="369F0F7E" w14:textId="2A1BB479" w:rsidR=...As part of the macro cleanups, we should force punctuation to match formatting of preceding word. Let's do that for the following:
* ,
* .
* :
* ;
* ?
* !
Current example:
```xml
<w:p w14:paraId="369F0F7E" w14:textId="2A1BB479" w:rsidR="00733D7F" w:rsidRDefault="00733D7F">
<w:pPr>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/></w:rPr>
<w:t>this is all bold except for the period</w:t>
</w:r>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/></w:rPr>
<w:t>.</w:t>
</w:r>
</w:p>
```
...results in...
```html
<p style="font-family: Helvetica"><b>this is all bold except for the period</b>.</p>
```1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/2Handling paragraph-level formatting?2018-04-26T20:51:18ZAlex ThegHandling paragraph-level formatting?With #1 finished, the mappings of in-line formatting tags is working properly: (`<b>` and `<u>` become `<i>`).
However, when an entire paragraph is formatted with bold, underlining, or italics, that property is promoted to the paragraph...With #1 finished, the mappings of in-line formatting tags is working properly: (`<b>` and `<u>` become `<i>`).
However, when an entire paragraph is formatted with bold, underlining, or italics, that property is promoted to the paragraph level. Consequently, bolding and underlining aren't mapped to italics, and even if they were, paragraph-level bolding, italics, and underlining is all dropped by Typescript.
Here's an example
Initial extraction
```html
<p><span style="font-family: Helvetica"><b>Bold</b></span></p>
<p><span style="font-family: Helvetica"><i>italics</i></span></p>
<p><span style="font-family: Helvetica"><u>Underline</u></span></p>
```
After the `rinse` step, properties are on the paragraph:
```
<p style="font-family: Helvetica; font-weight: bold">Bold</p>
<p style="font-family: Helvetica; font-style: italic">Italics</p>
<p style="font-family: Helvetica; text-decoration: underline">Underlined</p>
```
Finally, in the last Editoria reduce step, the `style` properties are dropped and the above becomes the following:
```html
<p>Bold</p>
<p>Italics</p>
<p>Underlined</p>
```
One solution might be to add a step in Typescript before the UCP cleanups that looks for one of these:
* `font-weight: bold`
* `font-style: italic`
* `text-decoration: underline`
And handles them by adding an opening `<b>`, `<u>`, or `<i>` right after the opening `<p>`, and the related closing tag just before the `</p>`. What do you think?1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/1`ucp-mappings.xsl` sheet throws "Content is not allowed in prolog" error2018-03-30T23:17:27ZAlex Theg`ucp-mappings.xsl` sheet throws "Content is not allowed in prolog" errorThe new `ucp-mappings.xsl` throws an error each time it is run. It thus doesn't produce an output and subsequent steps in the pipeline cannot execute.
```bash
Error on line 2 column 1 of bGreen_pt3-11UCPMAPPED.xhtml:
SXXP0003: Error r...The new `ucp-mappings.xsl` throws an error each time it is run. It thus doesn't produce an output and subsequent steps in the pipeline cannot execute.
```bash
Error on line 2 column 1 of bGreen_pt3-11UCPMAPPED.xhtml:
SXXP0003: Error reported by XML parser: Content is not allowed in prolog.
org.xml.sax.SAXParseException; systemId: file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-5d67cce275b7f65a0345e2ad500c29a8da146b6a/scripts/../outputs/Green/bGreen_pt3-11UCPMAPPED.xhtml; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.
```
Sometimes this occurs on line 2 column 1; other times it occurs on line 4 column 1:
```bash
Error on line 4 column 1 of bGreen05-11UCPMAPPED.xhtml:
```
Attached is the bash output for a whole book.[prolog_content_error.log](/uploads/789d4e71d23df72d56b19b88088da27d/prolog_content_error.log)