XSweet issueshttps://gitlab.coko.foundation/XSweet/XSweet/-/issues2018-03-15T20:05:00Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/98"Chapter One - " text dropped from extraction2018-03-15T20:05:00ZAlex Theg"Chapter One - " text dropped from extractionBoyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1...Boyles Ch 1 [e._Boyles_Chapter_1.docx](/uploads/86a58da52d6b60a5568a284f3225f592/e._Boyles_Chapter_1.docx)
[e.BoylesChapter1-1EXTRACTED.html](/uploads/dee5c6517f5f46d846e33d0922fa20cb/e.BoylesChapter1-1EXTRACTED.html)
[e.BoylesChapter1-9RINSED.html](/uploads/da01d16780e6f8cc2dd7e8b02877db1c/e.BoylesChapter1-9RINSED.html)
The top level header reads "Chapter One - Introduction" in Word. But the text "Chapter One - " is not coming through from Word into html. It seems to have been implemented as some kind of list in Word, but I can't find anything evidence of it in the initial extraction.
![Screen_Shot_2017-04-21_at_1.42.36_AM](/uploads/55402036b54e43e7429bb576ce93278c/Screen_Shot_2017-04-21_at_1.42.36_AM.png)
Is there an easy fix?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/161Add warning for images not supported by browsers2019-05-03T18:45:49ZAlex ThegAdd warning for images not supported by browsersWord allows some embedded image formats that are not supported by web browsers (e.g. `.emf`).
When XSweet encounters exotic image formats like this, it may be worthwhile to add some kind of warning, be it:
* a warning to the STDOUT for ...Word allows some embedded image formats that are not supported by web browsers (e.g. `.emf`).
When XSweet encounters exotic image formats like this, it may be worthwhile to add some kind of warning, be it:
* a warning to the STDOUT for the processor
* an inline comment in the HTML
* perhaps even a literal DOM element warning
See https://gitlab.coko.foundation/xpub/xpub-epmc/issues/63 for an example.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/180all paragraphs come in as H2s2023-02-21T05:30:35ZDan Viselall paragraphs come in as H2sThe attached .DOCX is a Kotahi test file; it comes in with all of its paragraphs as H2s (see screenshot of source inside of Kotahi). The same thing happens if I run it through http://pdf2html.cloud68.co/, which makes me think that this i...The attached .DOCX is a Kotahi test file; it comes in with all of its paragraphs as H2s (see screenshot of source inside of Kotahi). The same thing happens if I run it through http://pdf2html.cloud68.co/, which makes me think that this is XSweet – there's something about the file (the source of which I don't know) that's encoded incorrectly.
I don't have MS Word on my computer, but opening it up in Mac TextEdit shows some weirdness – all paragraphs are right-aligned, which is clearly incorrect. If I open it in Apple Pages, it looks more or less how I would expect it to.
This particular file isn't very important, but because Kotahi is processing a lot of Word docs coming from strange sources, we sometimes run into bugs that feel similar. (Most recently: display math is incorrectly coming in as H4s.) I don't know what they did to the DOCX to make it behave this way, though it would be nice if we could handle it?
[BodyMass.docx](/uploads/bcb5a0e6d8b6d92cf7c54875ac904f05/BodyMass.docx)
![Screen_Shot_2022-09-26_at_12.27.32_PM](/uploads/6983dda40058a13cc67808fc64f180d0/Screen_Shot_2022-09-26_at_12.27.32_PM.png)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/184Backslashes in math equations are not being converted correctly2023-10-02T12:15:31ZRyan Dix-PeekBackslashes in math equations are not being converted correctlyWhat seems to be happening is that backslashes are getting escaped. For example; `\sin x` turns into `\\sin x`. There were some fixes recently integrated into Kotahi via [this MR](https://gitlab.coko.foundation/kotahi/kotahi/-/merge_requ...What seems to be happening is that backslashes are getting escaped. For example; `\sin x` turns into `\\sin x`. There were some fixes recently integrated into Kotahi via [this MR](https://gitlab.coko.foundation/kotahi/kotahi/-/merge_requests/987) that were addressing this, but this should be handled during conversion.
Example; https://kotahi.kotahidev.cloud68.co/kotahi/versions/7fb2d295-046a-40fc-8faf-13de3d3a5b10/decision
![Screenshot_2023-10-02_at_08.56.22](/uploads/9c99571b3f070476513dfc24040af287/Screenshot_2023-10-02_at_08.56.22.png)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/49Boxes, borders and rules2022-04-22T04:57:40ZWendell PiezBoxes, borders and rulesWe haven't seen any cases of boxes, borders or rules, but that doesn't mean they won't show up.
Since they can be expressed in CSS, there's an argument we should be capturing them.
We may have to create an artificial test sample with s...We haven't seen any cases of boxes, borders or rules, but that doesn't mean they won't show up.
Since they can be expressed in CSS, there's an argument we should be capturing them.
We may have to create an artificial test sample with some made-up stuff just so we can get a look.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/76Brinton Ch 2: an incorrect h4 promotion2019-07-07T22:22:05ZAlex ThegBrinton Ch 2: an incorrect h4 promotionBrinton Ch 2: [b02_Brinton_Chapter_2.docx](/uploads/6fbc8fffc4a354e2f90e8c6cbdfae434/b02_Brinton_Chapter_2.docx)
[output_brinton_ch_2.zip](/uploads/fe7a849192953b6e043e6dc1681709b8/output_brinton_ch_2.zip)
This bit gets promoted to an ...Brinton Ch 2: [b02_Brinton_Chapter_2.docx](/uploads/6fbc8fffc4a354e2f90e8c6cbdfae434/b02_Brinton_Chapter_2.docx)
[output_brinton_ch_2.zip](/uploads/fe7a849192953b6e043e6dc1681709b8/output_brinton_ch_2.zip)
This bit gets promoted to an h4 but it's not:
`<h4 class="FootnoteText">According to an interview done with Sha‘rawi:</h4>`
Can you tell why? It's listed as FootnoteText, not sure if that contributes to it or not.
There is also another empty tag that gets promoted to an h4, but since it's empty it gets removed in typescript (there's a request in to move the empty header removal into header promotion itself - #55).1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/78Buchbinder Ch 2 headings don't get promoted2018-03-14T20:41:17ZAlex ThegBuchbinder Ch 2 headings don't get promotedCompare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
...Compare Chapters 1 and 2 of Buchbinder
[c_Buchbinder_Chap_1.docx](/uploads/bca82798d963a32c026fecfcd868aa76/c_Buchbinder_Chap_1.docx)
[output_Buchbinder_Chap_1.zip](/uploads/c4d1c813323a451c901734e12f6649af/output_Buchbinder_Chap_1.zip)
[c_Buchbinder_Chap_2.docx](/uploads/623a4ca2c779cd0571cac9bd4b0507fa/c_Buchbinder_Chap_2.docx)
[output_Buchbinder_Chap_2.zip](/uploads/e0e37f742260d239b6e89a7129733a90/output_Buchbinder_Chap_2.zip)
Header promotion works well in Ch 1, but misses all the headings beyond the very top level in Ch 2. In Ch 1, nothing that gets promoted isn't a header.
In Ch 2, the Chapter title and subtitle get caught as h1s, but none of the other headings get promoted. There are 25 h2 promotions, but they are all either entirely empty, filled only with tabs, or a visual divider made of asterisks.
Any ideas as to the differences between chapters 1 and 2 that cause 1 to work well but 2 to be less accurate? It doesn't appear to be Word styles.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/149Capture ordered list start value2020-06-04T10:07:50ZAlex ThegCapture ordered list start valueExtract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After t...Extract ordered list start value with list.
@atheg to examine .docx files and post the OOXML format here.
Before this we need to actually extract list types (#148). We need numbered lists in the first place for this to matter.
After this is complete, evaluate whether we also need have a `continue` list property as @GitBruno suggests in #106https://gitlab.coko.foundation/XSweet/XSweet/-/issues/72Catch "Subtitle" styles2018-03-15T17:27:31ZAlex ThegCatch "Subtitle" stylesSee #71 for Boyles files
The chapter's subtitle ("A Startling Turn of Events") doesn't get promoted to a heading. There are 2 issues here:
1. It is labeled as a Word style "Subtitle." This should be one of the styles that XSweet pay...See #71 for Boyles files
The chapter's subtitle ("A Startling Turn of Events") doesn't get promoted to a heading. There are 2 issues here:
1. It is labeled as a Word style "Subtitle." This should be one of the styles that XSweet pays attention to for consideration as a header. In this entire book, the author's used styles h1-4 to mark the different heading levels correctly. I think the conversion pipeline is noting this, as all the properly labeled headers are coming through accurately, so it would be good to stir class="Subtitle" into the mix too.
2. (more of a note to self) Even if the author hadn't styled this, it's short and centered, so should be caught by the "if it's centered and short it's a header" rule, once that is implemented.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/77Chapter headings come through in blue2017-08-16T19:56:39ZAlex ThegChapter headings come through in blueIn Buchbinder's book, all of the chapters come through into the HTML with blue coloring. It seems to be caused by a `style="color: auto"` attribute. The headings display as black in Word, but light blue in browsers (Chrome and Firefox)...In Buchbinder's book, all of the chapters come through into the HTML with blue coloring. It seems to be caused by a `style="color: auto"` attribute. The headings display as black in Word, but light blue in browsers (Chrome and Firefox)
Here are examples of what's causing it:
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Introduction</h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter One </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">The Bottom of the Funnel </h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Three </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Sticky Brains </h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Four </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Treating the Family</h1>
```
```html
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Chapter Five </h1>
<h1 class="Subtitle" style="color: auto; font-family: Garamond; font-size: 14pt; font-weight: bold; margin-bottom: 0pt">Locating Pain in Societal Stress</h1>
```
I'll mark this as low priority, since this is an html-only improvement. The color gets scrubbed out by typescript.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/67Check that headers promoted to the same level are internally consistent2018-03-15T17:44:14ZAlex ThegCheck that headers promoted to the same level are internally consistentb Bowen Chapter 1: [b_Bowen_Chapter_1.docx](/uploads/1d451dda0fc0a1b5a2c5afc8f93eeb60/b_Bowen_Chapter_1.docx)
Output: [output_b_Bowen_Chapter_1.zip](/uploads/fd1d441a66ac8f328a7cf3922661a6a0/output_b_Bowen_Chapter_1.zip)
Overall this ...b Bowen Chapter 1: [b_Bowen_Chapter_1.docx](/uploads/1d451dda0fc0a1b5a2c5afc8f93eeb60/b_Bowen_Chapter_1.docx)
Output: [output_b_Bowen_Chapter_1.zip](/uploads/fd1d441a66ac8f328a7cf3922661a6a0/output_b_Bowen_Chapter_1.zip)
Overall this chapter comes through quite nicely, and it offers an interesting case for header promotion.
The only headers this has are 1) a chapter title, and 2) 11 same-level headings in the body of the text.
1. The chapter title is bold and centered
2. The other headings are bold but not centered
Beyond that, the text is the same as the rest of the content. Word styles don't seem to factor into this one. So, here we have two levels of headers that are different from each other only in that one is centered and one is not. I think this is an important clue that these are different heading levels.
I've just proposed a step to check whether different heading levels are formatted the same, and combine them if they are (#64). The other side of that coin would be a step that ensure that all the headings promoted to the same level are *internally* consistent on some key criteria, and I think text alignment is one of those points.
What do you think?1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/145Collapse adjacent and repeated inline formatting tags2018-07-27T04:46:12ZBruno Herfsthello@brunoherfst.comCollapse adjacent and repeated inline formatting tagsExtracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>Extracted Source:
<span style="font-style: italic"><span class="My Italic Style">italicised</span></span>
Result:
<em><em>italicised</em></em>https://gitlab.coko.foundation/XSweet/XSweet/-/issues/173Copyediting cleanups are not suitable for Spanish language2022-05-16T09:15:37ZSofia OlguinCopyediting cleanups are not suitable for Spanish language## Context
The [HTMLevator copyediting cleanups](https://xsweet.org/documentation/htmlevator/) does the following:
>Any number of spaces before or after em dashes are removed
This is suitable for English language texts, but in Spani...## Context
The [HTMLevator copyediting cleanups](https://xsweet.org/documentation/htmlevator/) does the following:
>Any number of spaces before or after em dashes are removed
This is suitable for English language texts, but in Spanish this generates errors in all the dialogs. In Spanish, the dialogs are written this:
—Hola —dijo el joven.
To reproduce:
- upload the attached word file in Editoria
- Check the chapter in Editoria and see how the space character disappear.
[dashSpace.docx](/uploads/7af2653f4e107dcf0ba33789e6b1ceb7/dashSpace.docx)
## Suggested solution
The HTMLevator copyediting cleanups should be configureable to support:
* difference uses cases between languages
* diffrerent use cases between Editorial house style guidesDione Mentisdione@coko.foundationDione Mentisdione@coko.foundationhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/142CSS for hanging paragraphs2018-08-07T14:24:43ZAlex ThegCSS for hanging paragraphsXSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style...XSweet extracts regular paragraph indentation from Word into CSS correctly, but it needs a tweak to how it handles hanging paragraphs.
Indentation without hanging works great:
One indent no hanging: `<w:ind w:left="720"/>` -> `<p style="margin-left: 36pt">`
Two indent no hanging: `<w:ind w:left="1440"/>` -> `<p style="margin-left: 72pt">`
But the indentation with hanging needs another CSS property to be correct:
One indent hanging:
`<w:ind w:left="1440" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 36pt;` added in addition to what's already there to be correct.
Two indent hanging:
`<w:ind w:left="2160" w:hanging="720"/>` -> `<p style="padding-left: 36pt; text-indent: -36pt">`
It needs a `margin-left: 72pt;` added in addition to what's already there and then it's correct.
Here's an test docx: [hanging.docx](/uploads/459ecfb10d4e6c42caf16f4983c52142/hanging.docx)1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/174Deploy an example service2022-04-28T05:54:42ZAdam Hydeadam@coko.foundationDeploy an example serviceWe have long needed a web based service to run from (linked from) xsweet.org where folks can test uploading a docx file and see the results.We have long needed a web based service to run from (linked from) xsweet.org where folks can test uploading a docx file and see the results.Ryan Dix-PeekRyan Dix-Peekhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/187document with watermark isn't parsed correctly2024-03-18T04:47:28ZDan Viseldocument with watermark isn't parsed correctlyA Kotahi client submitted a document with a watermark on the first page (not attached because it's a client document). This doesn't go through XSweet (testing was done on XSweet without Kotahi); if the watermark is removed, the document ...A Kotahi client submitted a document with a watermark on the first page (not attached because it's a client document). This doesn't go through XSweet (testing was done on XSweet without Kotahi); if the watermark is removed, the document does go through.
Not sure if this is a particular type of watermark that's causing problems – if I make a test document in LibreOffice with a watermark and send it through XSweet, it works.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/178Doesn't handle images correctly when converting to html2022-07-05T04:09:56ZAnna KhapsasovaDoesn't handle images correctly when converting to html**Problem:**
When we convert docx to html the xsl builds the wrong path to media folder with images. And as a result html doesn't contain images.
**Version:**
We are using pubsweet/job-xsweet:1.5.4 image which contains this issue.
**Fi...**Problem:**
When we convert docx to html the xsl builds the wrong path to media folder with images. And as a result html doesn't contain images.
**Version:**
We are using pubsweet/job-xsweet:1.5.4 image which contains this issue.
**Fix:**
To fix it you need to change docx-html-extract.xsl
Please replace the raw
<xsl:variable name="docx-base" select="resolve-uri('.', document-uri(/))"/>
to
<xsl:variable name="docx-base" select="substring-before(resolve-uri('.', document-uri(/)), '/word')"/>
**Request**
This solution already tested and builds correct path for images. Could you kindly fix the file and update image with this fix?
Unfortunately I don't have persmissions to create a branch and pull request inside the project
Kind regards,
AnnaSuki VenkatSuki Venkathttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/87Don't distinguish between otherwise equivalent style signatures on "text-alig...2019-07-07T22:39:46ZAlex ThegDon't distinguish between otherwise equivalent style signatures on "text-align: left" aloneGarcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differentl...Garcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differently (#81) will stop some of the spurious promotions.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/114Ensure endnotes appear with their numbers at the end2017-08-16T23:29:59ZAlex ThegEnsure endnotes appear with their numbers at the endSee b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, whic...See b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, which comes through into the html. Clicking the inline note callout ("3") in the html takes you to the correct corresponding note, but it's not labeled "3". This is because the note is missing its `<w:endnoteRef/>` in the OOXML:
```xml
<w:r>
<w:rPr><w:sz w:val="24"/><w:szCs w:val="24"/><w:vertAlign w:val="superscript"/></w:rPr><w:endnoteRef/></w:r>
<w:r>
```
All the other note references have this `<w:endnoteRef/>` tag. This element is extracted into the html as `<span class="endnoteRef">`, and that’s where the corresponding note number gets inserted.
To fix this, we could insert this `<span class="endnoteRef">` into the html in the proper place (inside `<div class="docx-endnote”> for every note`) even if it doesn’t exist. Since XSweet renumbers the notes, it should have a list of the notes anyway.
I'll put this on hold as a validation step that can come after 1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/141Extract a default font from Word docs2018-04-24T06:41:30ZAlex ThegExtract a default font from Word docsI believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the br...I believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the browser's default font.
Putting this on hold as a future development.