XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-04-02T18:30:14Zhttps://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/28Insert hair space (u200a) btwn pairs of single/double quotes2018-04-02T18:30:14ZAlex ThegInsert hair space (u200a) btwn pairs of single/double quotesAs part of the macro cleanups, we should insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional cou...As part of the macro cleanups, we should insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional counterparts.
* left single quote+left double quote (u2018+u201c)
* left double quote+left single quote (u201c+u2018)
* right single quote+right double quote (u2019+u201d)
* right double quote+right single quote (u201d+u2019)
This currently partially works. See the following example inputs and outputs: the characters in Word on the left and the final Typescript output on the right.
* `"'quote'"` -> `<p style="font-family: Helvetica">“ ‘quote’ ”</p>`
* works properly; hs between both pairs of quotes
* `'"quote"'` -> `<p style="font-family: Helvetica">‘“quote” ’</p>`
* hs between the 2nd quotes but not the 1st
* `'”quote"‘` -> `<p style="font-family: Helvetica">‘“quote” ’</p>
* hs between the 2nd quotes but not the 1st
* `‘"quote"’` -> `<p style="font-family: Helvetica">‘“quote” ’</p>`
* hs between the 2nd quotes but not the 1st
* `""quote""` -> `<p style="font-family: Helvetica">“ “quote””</p>`
* hs between the 1st quotes but not the 2nd1.0.0https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/27Force punctuation to match formatting of preceding word2018-04-02T18:29:55ZAlex ThegForce punctuation to match formatting of preceding wordAs part of the macro cleanups, we should force punctuation to match formatting of preceding word. Let's do that for the following:
* ,
* .
* :
* ;
* ?
* !
Current example:
```xml
<w:p w14:paraId="369F0F7E" w14:textId="2A1BB479" w:rsidR=...As part of the macro cleanups, we should force punctuation to match formatting of preceding word. Let's do that for the following:
* ,
* .
* :
* ;
* ?
* !
Current example:
```xml
<w:p w14:paraId="369F0F7E" w14:textId="2A1BB479" w:rsidR="00733D7F" w:rsidRDefault="00733D7F">
<w:pPr>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/></w:rPr>
<w:t>this is all bold except for the period</w:t>
</w:r>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/></w:rPr>
<w:t>.</w:t>
</w:r>
</w:p>
```
...results in...
```html
<p style="font-family: Helvetica"><b>this is all bold except for the period</b>.</p>
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/136First greedy hyperlinker error2018-04-05T17:25:48ZAlex ThegFirst greedy hyperlinker errorRelated to #3
I have found the first false hit for the new hyperlink recognizer.
A sentence-starting ellipsis implemented with three periods causes a false hyperlink:
In this:
> "...the lion sleeps tonight."
This gets hyperlinked:
> ...Related to #3
I have found the first false hit for the new hyperlink recognizer.
A sentence-starting ellipsis implemented with three periods causes a false hyperlink:
In this:
> "...the lion sleeps tonight."
This gets hyperlinked:
> "...the
Note that this does _not_ happen if an ellipsis character is used. This is fine:
> "…the lion sleeps tonight."1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/135"Ambiguous rule match" warning in the terminal2018-03-28T16:42:03ZAlex Theg"Ambiguous rule match" warning in the terminalHi @wendell, you've fixed that other terminal warning I'd pinged you about in mattermost, but there's another one. It often doesn't show up in short documents like frontmatter and so forth, but it does appear on lots of larger chapter fi...Hi @wendell, you've fixed that other terminal warning I'd pinged you about in mattermost, but there's another one. It often doesn't show up in short documents like frontmatter and so forth, but it does appear on lots of larger chapter files.
The issue comes from lines 602 and 611 of the `docx-html-extract.xsl` sheet - something to do with the bolding extraction. Could you check this out and make sure it's not causing issues?
Here's an example from Gilbert:
```
converting: b13_ch09
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
Warning
XTDE0540: Ambiguous rule match for /w:styles/w:style[3]/w:rPr[1]/w:b[1]
Matches both
"element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)[not((data(attribute::attribute(Q{}val))) = ("0", "none"))]" on line 611 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
and "element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}style)//element(Q{http://schemas.openxmlformats.org/wordprocessingml/2006/main}b)" on line 602 of file:/Users/atheg/Desktop/crawler/header_promotion_strategies/XSweet-staging-85088850446b93f01a458b5995f36b3041096384/scripts/../applications/docx-extract/docx-html-extract.xsl
```1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/134Book files loading but not opening in Wax2018-04-27T22:37:58ZAlison McGonagle-O’ConnellBook files loading but not opening in Waxon editoria-testing.coko.foundation, logged in as admin, I find that all files aren't pulling in via bulk upload.
Expected functionality:
1. Select bulk upload
2. Select all files to load
3. All files promulgate into the expected areas ...on editoria-testing.coko.foundation, logged in as admin, I find that all files aren't pulling in via bulk upload.
Expected functionality:
1. Select bulk upload
2. Select all files to load
3. All files promulgate into the expected areas based on file name prefixes
4. Files become accessible for review/editing via wax
Encountered functionality:
1. Select bulk upload
2. Select all files to load
3. Most files promulgate into 'body' despite file name prefixes
4. Files 7, 8, and 9 appear to have loaded onto the book builder
4. Files are not accessible for review/editing via wax
Books:
* Better git it in your soul
* Polemics and patronage - files not loaded were 2,3,4 in "body" (note that the files did appropriately promulgate based on prefixes for this title)Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/133Numbering from Horton bib2019-07-07T23:09:30ZAlex ThegNumbering from Horton bibSources nested under "Castañeda, Heide" in Horton bib are really an automatically numbered list starting at 2007.
Follow numbering references through to `numbering.xml`; generate plan for extracting.
Only the first bullet gets properly...Sources nested under "Castañeda, Heide" in Horton bib are really an automatically numbered list starting at 2007.
Follow numbering references through to `numbering.xml`; generate plan for extracting.
Only the first bullet gets properly extracted; revisit this once this bug has been fixed (ticket number #106)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/131Word styles with ' + Not [bold/italics/etc.]" extract incorrectly2019-07-07T22:07:59ZAlex ThegWord styles with ' + Not [bold/italics/etc.]" extract incorrectlyI'm hopeful that this will fix a lot of the remaining formatting problems.
In Word, named styles can be modified by adding or removing formatting, e.g.:
* "Normal + Bold"
* "Emphasis + Not Italic"
The way these modifications are applie...I'm hopeful that this will fix a lot of the remaining formatting problems.
In Word, named styles can be modified by adding or removing formatting, e.g.:
* "Normal + Bold"
* "Emphasis + Not Italic"
The way these modifications are applied are by specifying a `w:val` on the formatting tag. A value of 1 turns the formatting on; a value of 0 turns it off. For example, this means "No Italics": `<w:i w:val="0"/>`.
Adding formatting on top of styling works correctly, but whenever "+ Not[format]" is specified, it's extracted as just the opposite: that formatting turns on when it should turn off. The extraction doesn't take the `val` property into account, but should.
Whenever `w:val="0"` appears in any of the following tags...
* `<w:i w:val="0"/>`
* `<w:b w:val="0"/>`
* `<w:u w:val="none"/>
...then XSweet should turn that formatting off. The formatting could be specified on the paragraph level, or in a span inside a paragraph, so finding it and turning it off might be the trickiest part. Perhaps we could preserve these until all the styling should have been promoted to the `<p style=>`, and then check for the formatting to turn off. Or, is there somet tag or way to turn off inline formatting regardless of whether it's already applied or not? If so, that might be simplest.
Here are 2 examples from Braggs Ch 1:
## Example 1
This paragraph is erroneously extracted as italic. In the Word doc, there's no visible change of formatting between this paragraph and the previous one, but in fact the named Word style change from "Normal (Web) + Auto" to "Emphasis + Not Italics":
```xml
<w:p w14:paraId="3E1B900F" w14:textId="6A38EA12" w:rsidR="00CB4328" w:rsidRPr="004F60F8" w:rsidRDefault="00CB4328" w:rsidP="0036427E">
<w:pPr>
<w:spacing w:line="480" w:lineRule="auto"/>
<w:jc w:val="both"/>
</w:pPr>
<w:r w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t>Throughout his works, Césaire</w:t>
</w:r>
<w:r w:rsidR="004A368E" w:rsidRPr="004F60F8">
<w:rPr>
<w:rStyle w:val="Emphasis"/>
<w:i w:val="0"/>
</w:rPr>
<w:t xml:space="preserve">claimed</w:t>
</w:r>
```
In the above, `<w:rStyle w:val="Emphasis"/>` specifies italics (in the CSS extracted to the top of the HTML). Then, the `<w:i w:val="0"/>` (not italics) should turn the italics back off, but it doesn't:
```html
<p style="text-align: justify">
<span style="font-style: italic">
<span class="Emphasis">Throughout his works, Césaire</span>
</span>
<span style="font-style: italic">
<span class="Emphasis"> claimed</span>
</span>
```
## Example 2
Footnote 15 comes through in bold in the HTML but not in Word. Its Word style is "Heading1 + Times New Roman, 10pt, Not Bold". So, the paragraph inherits bolding from its "Heading1" style. Then, whereas `<w:b w:val="0"/>` should remove the bold formatting, it doesn't.
Word XML:
```xml
<w:endnote w:id="15">
<w:p w14:paraId="7FAB95D6" w14:textId="2E2795A2" w:rsidR="0005708E" w:rsidRPr="008B0C6C" w:rsidRDefault="0005708E" w:rsidP="00EF6AD7">
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:spacing w:before="0" w:beforeAutospacing="0" w:after="0" w:afterAutospacing="0"/>
<w:jc w:val="both"/>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="008B0C6C">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">Bryan Wagner introduces an insightful analysis</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b w:val="0"/>
<w:bCs w:val="0"/>
<w:kern w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
<w:lang w:val="fr-FR"/>
</w:rPr>
<w:t xml:space="preserve">of
</w:t>
</w:r>
```
And the HTML, incorrectly bolded:
```html
<div class="docx-endnote" id="en15">
<p class="Heading1" style="margin-top: 0pt; margin-bottom: 0pt; xsweet-outline-level: 0; font-family: Times; font-weight: bold; font-size: 24pt; text-align: justify">
<span style="font-size: 10pt; font-family: Times New Roman">
...
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>Bryan Wagner introduces an insightful analysis </lang>
</span>
<span style="font-family: Times New Roman; font-size: 10pt">
<lang>of </lang>
</span>
```1.0.0https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/26Double quotation marks followed by punctuation face the wrong way2018-03-28T22:10:22ZAlex ThegDouble quotation marks followed by punctuation face the wrong wayIn some instances where punctuation is placed outside of the closing double quotation mark, the UCP cleanup step inserts the wrong directional double quotation mark.
Here are a few examples:
* `“Send $300 to Mexico for $15“,`
* `“Send $...In some instances where punctuation is placed outside of the closing double quotation mark, the UCP cleanup step inserts the wrong directional double quotation mark.
Here are a few examples:
* `“Send $300 to Mexico for $15“,`
* `“Send $300 to Mexico for $15“.`
* `“fraud“?`
It is a bit hard to see, but both quotation marks are left double quotation marks, when in fact the last one should be a right double quotation mark. This occurs regardless of whether the original double quotation marks are directional and/or facing the right way. Punctuation outside the quotes is not technically correct but would still be good to catch.
This does _not_ happen with single quotes, which come through correctly.
I don't know what the best search pattern for this is, but it might be something like:
`double quotation mark + punctuation mark (-,.:;?!)` -> `right double quotation mark + punctuation mark`
The only punctuation that I can think of that would ever start a quotation is an ellipsis, which wouldn't be covered by the above rule.1.0.0https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/25Removing italicized tab at start of paragraph drops one <em> tag2018-03-28T17:00:13ZAlex ThegRemoving italicized tab at start of paragraph drops one <em> tagFor Bakker Chapter 1:
There is an italicized tab that starts one of the paragraphs in this chapter. The tab gets removed in the very last Editoria reduce step, but the opening `<em>` tag also gets dropped, leaving only one self-closing ...For Bakker Chapter 1:
There is an italicized tab that starts one of the paragraphs in this chapter. The tab gets removed in the very last Editoria reduce step, but the opening `<em>` tag also gets dropped, leaving only one self-closing `<em/>`. This tag causes an error in Wax and thus the chapter cannot be opened to reading and editing.
Here's what the HTML looks like after the Editoria notes step:
```html
<p class="Default" style="font-family: Helvetica; font-size: 12pt">
<i>
<span class="tab"><!-- tab --></span>
</i>
This historical documentary material...
```
Then, during the Editoria basic step, the `<i>` tags get converted into `<em>`s:
```html
<p class="Default" style="font-family: Helvetica; font-size: 12pt">
<em>
<span class="tab"><!-- tab --></span>
</em>
This historical documentary material...
```
Finally, in the Editoria reduce step, the tab removed, leaving only a `<em/>` in its place:
```html
<p><em/>This historical documentary material...
```
So, when the Editoria reduce step removes tabs at the beginning of paragraphs, it needs to also remove both formatting tags if there are any.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/130Remove internal-to-word bookmarks2018-03-09T00:58:20ZAlex ThegRemove internal-to-word bookmarksInternal Word bookmarks are currently extracted as links (`<a>`s), but they cause problems when they're imported into Wax, and don't serve any useful purpose. Specifically, they prevent Wax from loading at all. Instead of passing them th...Internal Word bookmarks are currently extracted as links (`<a>`s), but they cause problems when they're imported into Wax, and don't serve any useful purpose. Specifically, they prevent Wax from loading at all. Instead of passing them through, we should eliminate them altogether. My vote would be to not even extract them in the first place, rather than extracting them, then scrubbing them out in Typescript. I can't see that they do anything useful in the HTML, so it would be good not to clutter the clean HMTL with them.
Bookmarks begin with a `w:bookmarkStart` tag and end with a `w:bookmarkEnd`. Everything between these two tags should be deleted.
This is an example of the XML from the docx:
```xml
<w:p w14:paraId="39355185" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr><w:pStyle w:val="ChapterTitles"/><w:suppressAutoHyphens/>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/><w:bCs/></w:rPr>
</w:pPr><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica"/><w:b/><w:bCs/></w:rPr>
<w:t xml:space="preserve">CHAPTER 1
</w:t>
</w:r>
</w:p>
```
It is initially extracted like this, and remains like this all the way through the editoria basic step:
```html
<h3 class="ChapterTitles" style="font-family: Arial Unicode MS; font-size: 12pt; margin-bottom: 6pt; text-align: center; text-decoration: underline">
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
<b>CHAPTER 1 </b>
</h3>
```
Then, at the editoria reduce step, it's transformed to this:
```html
<h3><a id="docx-bookmark_0"/><a href="#docx-bookmark_0"/>CHAPTER 1</h3>
```
I recommend that the initial extraction completely drop these bookmarks, so the extraction looks like this:
```xml
<w:p w14:paraId="39355185" w14:textId="77777777" w:rsidR="001B3A68" w:rsidRDefault="00674026">
<w:pPr><w:pStyle w:val="ChapterTitles"/><w:suppressAutoHyphens/>
<w:rPr><w:rFonts w:ascii="Helvetica" w:eastAsia="Helvetica" w:hAnsi="Helvetica" w:cs="Helvetica"/><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:rFonts w:ascii="Helvetica"/><w:b/><w:bCs/></w:rPr>
<w:t xml:space="preserve">CHAPTER 1
</w:t>
</w:r>
</w:p>
```https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/24Eliminate spaces before punctuation2018-04-06T22:22:06ZAlex ThegEliminate spaces before punctuationThis issue relates to the UCP cleanup macro (#21).
Any number of spaces before any of the following punctuation marks should be removed:
* ,
* ;
* :
* !
* ?
* )
* ]
* }
E.g. "Here's my sentence ." -> "Here's my sentence."This issue relates to the UCP cleanup macro (#21).
Any number of spaces before any of the following punctuation marks should be removed:
* ,
* ;
* :
* !
* ?
* )
* ]
* }
E.g. "Here's my sentence ." -> "Here's my sentence."https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/23Add nonbreaking spaces between initials2018-02-27T23:10:10ZAlex ThegAdd nonbreaking spaces between initialsThis issue relates to the UCP cleanup macro (#21).
Spaces between initials should be replaced with non-breaking spaces.
The search should look for:
* uppercase letter (A-Z) + period + space + uppercase letter (A-Z) + period
and repla...This issue relates to the UCP cleanup macro (#21).
Spaces between initials should be replaced with non-breaking spaces.
The search should look for:
* uppercase letter (A-Z) + period + space + uppercase letter (A-Z) + period
and replace it with:
* uppercase letter (A-Z) + period + nbsp (00A0) + uppercase letter (A-Z) + period
As a bonus, it would be nice if we could detect initials without spaces between the letters. So, before the above replacement would take place, we should look for the following pattern:
* space +
* any number of repetitions of a capital letter (A-Z) + a period + capital letter + period, etc.
* + a space
and add nbsps between the letters.
THEN, catch the following common abbreviations that would have been erroneously sucked up by the above rule. All of these should _not_ have spaces between the letters:
* U.S.
* D.C.
* A.M.
* P.M.
* A.D.
* B.C.
* B.C.E.
* A.C.E.https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/22Insert directional single quote before year abbreviations2018-02-27T23:09:53ZAlex ThegInsert directional single quote before year abbreviationsThis issue relates to the UCP cleanup macro (#21).
A straight apostrophe (u0027) or a left-side single quote (u2018) followed by a numeral (0-9) should be replaced with a right single quote (u2019)
This will ensure that abbreviated yea...This issue relates to the UCP cleanup macro (#21).
A straight apostrophe (u0027) or a left-side single quote (u2018) followed by a numeral (0-9) should be replaced with a right single quote (u2019)
This will ensure that abbreviated years get the right quotation mark.
'97 -> ’97
‘97 -> ’97https://gitlab.coko.foundation/XSweet/XSweet/-/issues/129Wax use of screen space in xpub2018-03-15T17:10:32ZDan MorganWax use of screen space in xpubA weird persistent problem of display in xpub wax is still the amount of white space below the text, so that the viewable amount of text is very small. (See attached screenshot with my annotations.) Can this be easily tweaked?
![Screen_S...A weird persistent problem of display in xpub wax is still the amount of white space below the text, so that the viewable amount of text is very small. (See attached screenshot with my annotations.) Can this be easily tweaked?
![Screen_Shot_2018-02-07_at_9.51.07_AM](/uploads/df3bcfc72f9b97766aa36acc217b663c/Screen_Shot_2018-02-07_at_9.51.07_AM.png)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/128Ingest issues in xpub testing illustrated here 22018-03-15T17:11:11ZDan MorganIngest issues in xpub testing illustrated here 2Uploaded the attached manuscript [TheBlameGame_Eerland.docx](/uploads/eadeee62a968cc0cc6e26f44450c294f/TheBlameGame_Eerland.docx)
Fatal issues are:
1. Tables (known issue)
2. Numbered lists, including lettered indented sublists e.g. as...Uploaded the attached manuscript [TheBlameGame_Eerland.docx](/uploads/eadeee62a968cc0cc6e26f44450c294f/TheBlameGame_Eerland.docx)
Fatal issues are:
1. Tables (known issue)
2. Numbered lists, including lettered indented sublists e.g. as appears early in attachment above.
The submission is left "as is" in current xpub, to compare against uploaded original.
Some side comments:
Ignoring issue above, and the unknown with figures/images, this is an imperfect, but probably usable (I'll leave that call to real users) ingested document. However, since authors often go to a lot of trouble to format their manuscripts according to guidelines, obviously I am not saying these imperfections would not annoy the heck out of some people. (Some headings are centered, some are not. Bolding and other emphases have been eliminated.)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/127Ingest issues in xpub testing illustrated here2018-03-15T17:10:04ZDan MorganIngest issues in xpub testing illustrated here2 manuscripts ingested into xpub left as is, with source documents attached here, so you can see ingestion issues.
Quality Uncertainty Erodes Trust in Science 2 (https://xpub.coko.foundation/projects/bcc6f388-b35e-46f5-8653-b374002a4e3...2 manuscripts ingested into xpub left as is, with source documents attached here, so you can see ingestion issues.
Quality Uncertainty Erodes Trust in Science 2 (https://xpub.coko.foundation/projects/bcc6f388-b35e-46f5-8653-b374002a4e33/versions/67bad662-51e6-425c-ae4b-12018e6ed278/manuscript) Source file: [Vazire-QualityUncertainty2.docx](/uploads/3662e0cdd906b52a1545735edbee4cc9/Vazire-QualityUncertainty2.docx)
Selective attention in inattentional blindness: Selection is specific but suppression is not (https://xpub.coko.foundation/projects/7bd06780-049c-4ab7-8f3d-61f4bee2bc93/versions/550430ee-8b98-40de-9020-94c993b9a078/manuscript) Source file: [WoodSimons_xpub.docx](/uploads/3d2917fbcdf0f80562a35a9cccf0ea1a/WoodSimons_xpub.docx)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/126New Editoria Typescript With notes2018-03-09T00:59:43ZChristosNew Editoria Typescript With notesIn this example in the main text there are two notes (the callouts).
`<note data-id="note-d8d5c4dfdfb0a1d448d2e52f4fac6299"></note>`
`<note data-id="note-d24ccd7aa83507285dbdf8c3b2467b8f"></note>`
When a note is created, Wax will...In this example in the main text there are two notes (the callouts).
`<note data-id="note-d8d5c4dfdfb0a1d448d2e52f4fac6299"></note>`
`<note data-id="note-d24ccd7aa83507285dbdf8c3b2467b8f"></note>`
When a note is created, Wax will create a new editing surface for it, which is an extension of the `container` and we named it `note-container`. The important thing in the implementation of notes,
is that somehow Wax must know which editing surface belongs to each note. In the previous implementation, there was a note-content property where the content of the note was extracted from. Now this is done by matching the callout and the editing surface in the following way.
Each note callout has a data-id property. During the creation of the editing surface Wax will get this property and will add the prefix of `container-`. This constructed tag will then be added as an `id` in the `<note-container>` tag.
So when you detect a note you create a callout as before with a data id property. For example (it could be incremental note-1, note-2) depending on your implementation or randomly generated `<note data-id='note-1'></note>`. Inside the HTML body tag,
and right after where id="main" ends, for each note you should create an associated `note-container` with id the data-id of the callout with container as a prefix as explained above. Inside you add the html for the note in `<p>` tags. So if we have two callouts
`<note data-id='note-1'></note>` and `<note data-id='note-2'></note>` We create 2 note containers
`<note-container id="container-note-1"></note-container> `
`<note-container id="container-note-2"></note-container>`
Also in the example below the note-containers are grouped in a `<div id="notes">`. this `div` tag is not necessary although it will be a nice to have in order to provide an organised HTML output.
```
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8"/>
</head>
<body>
<container id="main">
<h1> test </h1>
<p> some <note data-id="note-d8d5c4dfdfb0a1d448d2e52f4fac6299"></note>
text <note data-id="note-d24ccd7aa83507285dbdf8c3b2467b8f"></note> </p>
</container>
div id="notes">
<note-container id="container-note-d8d5c4dfdfb0a1d448d2e52f4fac6299">
<p>note content par. 1</p>
<p>note content par 2</p>
<p>note content par 3</p>
</note-container>
<note-container id="container-note-d24ccd7aa83507285dbdf8c3b2467b8f">
<p>note content par. 1</p>
<p>note content par 2</p>
<p>note content par 3</p>
</note-container>
</div>
</body>
</html>
```https://gitlab.coko.foundation/XSweet/XSweet/-/issues/125New Editoria Typescript2018-03-09T01:01:13ZChristosNew Editoria TypescriptAs it stands now, the HTML that XSweet sends is wrapped in the `body` tag, like below:
```
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8"/>
</head>
<body>
<h1>test</h1>
<p> some test </p>
</body>
</h...As it stands now, the HTML that XSweet sends is wrapped in the `body` tag, like below:
```
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8"/>
</head>
<body>
<h1>test</h1>
<p> some test </p>
</body>
</html>
```
From now on, Wax supports multiple editing surfaces, so we have to declare which HTML block goes to which editing surface,
so Wax can make the matching. We decided to call our primary editing surface `main`. So all HTML that would go in the `body`, should be wrapped in the following tag. We will follow this up with how different types of containers should be set up (specifically for notes) in a separate issue.
```
<container id="main">
<h1> test </h1>
<p> some test </p>
</container>
```
Final output would look as follows:
```
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8"/>
</head>
<body>
<container id="main">
<h1> test </h1>
<p> some test </p>
</container>
</body>
</html>
```https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/21Add UCP cleanup ingest/output macros to Typescript2021-03-18T16:16:43ZAlex ThegAdd UCP cleanup ingest/output macros to TypescriptUCP runs a series of cleanup macros on book chapters, both when they are ingested before editing, and also again at output to prevent any of the same cleanups from being accidentally reintroduced. Since these cleanups are to be used at t...UCP runs a series of cleanup macros on book chapters, both when they are ingested before editing, and also again at output to prevent any of the same cleanups from being accidentally reintroduced. Since these cleanups are to be used at the beginning and the end of the editing process, and since different presses will have slightly different cleanups, these changes should all be housed in a single XSLT sheet.
Here are several cleanups to implement - there may be more to follow. I am checking this with Erich at UCP, so we will probably add to this list going forward:
- [x] Hyphens between numerals should be converted to en dashes: "2-3" -> "2–3"
- [x] Double spaces should be converted to single spaces, anywhere they're found: <pre>"...touches. However, the..." -> "...touches. However, the..."</pre>
- [x] Spaces around em dashes should be removed (any number of consecutive spaces spaces before or after an em dash)
* "that sentence —as I’ve done" -> "that sentence—as I’ve done"
* "that sentence — as I’ve done" -> "that sentence—as I’ve done"
- [x] Series of periods converted to ellipses
* "..." -> "…"
# Update
I have gotten through a lot of the macro but it is long. Here are some of the remaining cleanups. I will post the rest tomorrow morning. These are in no special order and @wendell you can draw from these and start checking them off as you can - let me know if any of these need clarification. There are plenty more rules left. The most complicated is probably smart quotes. The macro actually does a deletes and replaces all the "s and 's and lets Word's auto-formatting determine their direction when it inserts them in again, which I don't think is an option for us. In any event, more to come!
- [x] Two adjacent hyphens become an em dash: "--" -> "—"
- [x] An en dash surround on both sides by spaces should be converted to an em dash: " – " -> " — "
- [x] Equal signs should be surrounded on either side by one and only one space: " = "
- [x] Replace runs of multiple consecutive spaces with just one space
- [ ] ~~Replace runs of multiple consecutive tabs with just one tab~~ Update: we scrub these out anyway in preparation for Wax, so this is not necessary
- [x] Spaces touching tabs should be removed
- [x] Remove spaces at the very beginning and ends of `p`s
- [x] Remove tabs that end a paragraph (not ones that start)
- [x] Delete empty paragraphs (I believe we are already doing this)
## Quotation marks
All straight, non-directional single and double quotes should be converted into "smart" directional quotes, depending on context. Since the original macro uses Word's auto-formatting, we'll have to make the rules for determining which direction they should point.
Straight quotation marks:
* u0022: quotation mark
* u0027: apostrophe
Should all be replaced by one of the following:
* u2018: left single quotation mark
* u2019: right single quotation mark
* u201c: left double quotation mark
* u201d: right double quotation mark
### Replacement rules from macro:
- [x] ' -> right or left single quotation mark (u2018 or u2019)
- [x] '' -> right or left double quotation mark (u201c or u201d)
- [x] ` -> right or left single quotation mark (u2018 or u2019)
- [x] `` -> right or left double quotation mark (u201c or u201d)
- [x] em dash+right double quote (u2014+u201d) -> em dash+left double quote (u2014+u201c)
- [x] left double quote+em dash (u201c+u2014)-> right double quote+em dash (u201d+u2014)
The following 3 search pattern should look for a straight single quote or a left single quote and replace with a right single quote
- [x] " 'em" or " ‘em" (space+u0027+"em" or space+u2019+"em") -> " ’em" (space+u2019+"em")
- [x] "'n'" or "'n'" (u0027+"n"+u0027 or u2018+"n"+u2018) -> "’n’" (u2019+"n"+u2019)"
- [x] " 'tis" (space+u0027+"tis" or space+u2018+"tis") -> " ’tis" (space+u2019+"tis")
Then:
- [ ] ~~Insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional counterparts.~~ update: tracking in #28
* left single quote+left double quote (u2018+u201c)
* left double quote+left single quote (u201c+u2018)
* right single quote+right double quote (u2019+u201d)
* right double quote+right single quote (u201d+u2019)
### Directional rules
Here are my proposed rules for direction. They would have to be executed before all of the above rules from the macro:
- [x] First, replace all 4 directional quotation marks with their non-directional counterparts:
* u2018 and u2019 -> u0027
* u201c and u201d -> u0022
* also \` and `` to their respective u0027 and u0022
Then:
- [x] apostrophe+alphabetical character (u0027+letter) -> left single quotation mark (u2018+letter)
- [x] alphabetical character+apostrophe (letter+u0027( -> alphabetical character+right single quotation mark (letter+u2019)
- [x] quotation mark+alphabetical character (u0022+letter) -> left double quotation mark+alphabetical character (u201c+letter)
- [x] alphabetical character+quotation mark (letter+u0022) -> alphabetical character+right double quotation mark (letter+u201d)
In any case, these will probably need some refinement but double check me and let me know what you think!
### Formatting
- [x] Convert underlining to italics
- [ ] ~~Convert bold to italics~~ update: tracking in #29
We currently convert literal `<u>` tags into `<i>s` in the “Editoria basic” step. But, that can sometimes get scrubbed out in the “Editoria reduce” step. We should also catch underlining, italics, and bold when it’s specified in the css style, which we’re not currently doing. Wax looks for an `<em>` tag for italics. So, the following should all be converted into text wrapped in `<em>`:
* `<i>`
* `<b>`
* `<u>`
* `<p style=“font-weight: bold”>`
* `<p style=“font-style: italic”>`
* `<p style=“text-decoration: underline”>`
Once this is implemented, we should also update the “Push mappings” to reflect this.
- [ ] ~~Force punctuation to match formatting of preceding word~~ tracking in #27
Since we're porting into Wax, we don't need to worry about fonts/font size. The only thing I can think to catch is formatting (italics, bold, underline). And, since all of these should get flattened to `<em>`s, I think this could be as simple as ensuring that if the preceding word is `<em>`, the trailing punctuation is as well. These are the punctuation marks that this rule should apply to:
* ,
* .
* :
* ;
* ?
* !
### Rules already implemented
The following cleanups don't require any additional coding, since XSweet is handling these as it should already:
* Remove page breaks and section breaks
* Page breaks are extracted as `<br class=“br”>`, and the pipeline replaces these by breaking paragraphs on `br`s
* Section breaks are dropped, since we’re not explicitly catching them
* Remove any comments: already happens, since wed don’t handle them
* Delete headers and footers: we’re already dropping these
* Remove soft hyphens: these do not come through into the html.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/124Ignore indentation specified at the paragraph level2018-03-15T16:27:37ZAlex ThegIgnore indentation specified at the paragraph levelSee b_Schuster_Captns
In this example, some of the entries have hanging indentation specified on them. That hanging indentation is extracted into the HTML as `margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt`. In the Word...See b_Schuster_Captns
In this example, some of the entries have hanging indentation specified on them. That hanging indentation is extracted into the HTML as `margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt`. In the Word doc, these entries are flush left for the first line, but hanging on subsequent lines:
![word_doc](/uploads/db9596faf0018aa494fe7eb9caf3c5a3/word_doc.png)
But in the HTML, the entire entry is indented.
![browser_display](/uploads/4ed14d6b36bde033eb16d5483aabe3d8/browser_display.png)
The paragraph is initially extracted with 3 indent-related styles associated with it:
`<p style="margin-left: 67.5pt; padding-left: 67.5pt; text-indent: -67.5pt">`
This properly achieves the hanging indentation on lines after the first one, but the `margin-left: 67.5pt` shouldn't be there, as it improperly indents the whole entry. Removing that attribute moves the whole paragraph back to its proper flush left position.
The `margin-left` comes from the p-level style. I believe this is something that should be ignored entirely.
Word XML:
```xml
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">Fig. 4 (Ch. 1):
</w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">A credit counselor at
</w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Fundación</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve"></w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Paraguaya</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">
working on the
</w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t>Ikatú</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>project</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
</w:p>
<w:p w:rsidR="00F069B3" w:rsidRDefault="00F069B3" w:rsidP="00F069B3">
<w:pPr><w:ind w:left="1350" w:hanging="1350"/>
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:color w:val="000000"/></w:rPr>
<w:t xml:space="preserve">Fig. 5 (Ch. 2):
</w:t>
</w:r>
<w:r w:rsidRPr="005501F5">
<w:rPr><w:i/><w:color w:val="000000"/></w:rPr>
<w:t>Paraguay Land Company, Limited 1889; To be issued for Land Warrants of the Government of Paraguay, at the rate of 2 fully-paid Shares for each £100 Warrant</w:t>
</w:r>
</w:p>
```
HTML:
```html
<p>Fig. 4 (Ch. 1): <i>A credit counselor at </i><i>Fundación</i><i> </i><i>Paraguaya</i><i> working on the </i>Ikatú <i>project</i></p>
<p/>
<p style="margin-left: 67.5pt; text-indent: -67.5pt; padding-left: 67.5pt">Fig. 5 (Ch. 2): <i>Paraguay Land Company, Limited 1889; To be issued for Land Warrants of the Government of Paraguay, at the rate of 2 fully-paid Shares for each £100 Warrant</i></p>
```1.0.0