XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2024-03-18T04:47:28Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/187document with watermark isn't parsed correctly2024-03-18T04:47:28ZDan Viseldocument with watermark isn't parsed correctlyA Kotahi client submitted a document with a watermark on the first page (not attached because it's a client document). This doesn't go through XSweet (testing was done on XSweet without Kotahi); if the watermark is removed, the document ...A Kotahi client submitted a document with a watermark on the first page (not attached because it's a client document). This doesn't go through XSweet (testing was done on XSweet without Kotahi); if the watermark is removed, the document does go through.
Not sure if this is a particular type of watermark that's causing problems – if I make a test document in LibreOffice with a watermark and send it through XSweet, it works.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/178Doesn't handle images correctly when converting to html2022-07-05T04:09:56ZAnna KhapsasovaDoesn't handle images correctly when converting to html**Problem:**
When we convert docx to html the xsl builds the wrong path to media folder with images. And as a result html doesn't contain images.
**Version:**
We are using pubsweet/job-xsweet:1.5.4 image which contains this issue.
**Fi...**Problem:**
When we convert docx to html the xsl builds the wrong path to media folder with images. And as a result html doesn't contain images.
**Version:**
We are using pubsweet/job-xsweet:1.5.4 image which contains this issue.
**Fix:**
To fix it you need to change docx-html-extract.xsl
Please replace the raw
<xsl:variable name="docx-base" select="resolve-uri('.', document-uri(/))"/>
to
<xsl:variable name="docx-base" select="substring-before(resolve-uri('.', document-uri(/)), '/word')"/>
**Request**
This solution already tested and builds correct path for images. Could you kindly fix the file and update image with this fix?
Unfortunately I don't have persmissions to create a branch and pull request inside the project
Kind regards,
AnnaSuki VenkatSuki Venkathttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/87Don't distinguish between otherwise equivalent style signatures on "text-alig...2019-07-07T22:39:46ZAlex ThegDon't distinguish between otherwise equivalent style signatures on "text-align: left" aloneGarcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differentl...Garcia bibliography: [e_BibliographyGGv6.docx](/uploads/c994d731e262919ab6c4a84d8df7ec66/e_BibliographyGGv6.docx)
Alex, check this for entries and lines that are promoted to headers. Hopefully considering the display text differently (#81) will stop some of the spurious promotions.1.0.0Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/9Don't promote ornamental breaks to headers2018-08-07T13:01:26ZAlex ThegDon't promote ornamental breaks to headersThe most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with...The most common ornamental divider authors use is a series of asterisks, which may or may not be separated by spaces or tabs.
* "***"
* "*****"
* "* * *"
* `*<span class="tab"></span>*`
Paragraphs containing only such a a pattern (with any number of spaces or tabs between asterisks) should be excluded from consideration for promotion to headings.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/114Ensure endnotes appear with their numbers at the end2017-08-16T23:29:59ZAlex ThegEnsure endnotes appear with their numbers at the endSee b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, whic...See b_04_ch_3_Bakker, and #2, where this issue was first reported:
The inline callout is fine, but the third endnote doesn’t show its number next to the text of the note at the end of the document. This is an error in the Word doc, which comes through into the html. Clicking the inline note callout ("3") in the html takes you to the correct corresponding note, but it's not labeled "3". This is because the note is missing its `<w:endnoteRef/>` in the OOXML:
```xml
<w:r>
<w:rPr><w:sz w:val="24"/><w:szCs w:val="24"/><w:vertAlign w:val="superscript"/></w:rPr><w:endnoteRef/></w:r>
<w:r>
```
All the other note references have this `<w:endnoteRef/>` tag. This element is extracted into the html as `<span class="endnoteRef">`, and that’s where the corresponding note number gets inserted.
To fix this, we could insert this `<span class="endnoteRef">` into the html in the proper place (inside `<div class="docx-endnote”> for every note`) even if it doesn’t exist. Since XSweet renumbers the notes, it should have a list of the notes anyway.
I'll put this on hold as a validation step that can come after 1.0.0https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/18Error from 2-in-1 detect list sheet / Saxon version issue2018-12-19T20:11:42ZAlex ThegError from 2-in-1 detect list sheet / Saxon version issue@wendell I've got a question I'm stuck on for you:
I modified the `itemize-lists` sheet to add these detected itemized lists and wrap them in `ol`s. I also chained the two other sheets into one `DETECT-ITEMIZE-lists.xsl` sheet. That wor...@wendell I've got a question I'm stuck on for you:
I modified the `itemize-lists` sheet to add these detected itemized lists and wrap them in `ol`s. I also chained the two other sheets into one `DETECT-ITEMIZE-lists.xsl` sheet. That works in the IDE but not with my scripts. It seems to be a Saxon version issue: the IDE uses Saxon-HE 9-8-0-12, but the scripts use SaxonHE 9-8-0-1.
Using the original SaxonHE 9-8-0-1 processor with the scripts, I get this error:
```
Error at char 12 in xsl:variable/@select on line 42 column 48 of DETECT-ITEMIZE-lists.xsl:
FOXT0002: The transform option xslt-version is higher than the XSLT version supported by
this processor
at xsl:apply-templates (file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../applications/htmlevator/applications/list-detect/DETECT-ITEMIZE-lists.xsl#28)
processing /xsw:transform
The transform option xslt-version is higher than the XSLT version supported by this processor
```
Next, by swapping in 9-8-0-12 or higher in the scripts, the above complaint goes away, but then I get a new one about the UCP cleanups, whic doesn't process:
```
Error at char 8 in xsl:sequence/@select on line 283 column 79 of ucp-text-macros-new.xsl:
FORX0002: Syntax error at char 8 in regular expression: Expected '{' after \112
at xsl:apply-templates (file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../applications/htmlevator/applications/ucp-cleanup/ucp-text-macros-new.xsl#269)
processing xsw:sequence/xsw:match[5]
at xsl:apply-templates (file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../applications/htmlevator/applications/ucp-cleanup/ucp-text-macros-new.xsl#503)
processing xsw:sequence
at xsl:apply-templates (file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../applications/htmlevator/applications/ucp-cleanup/ucp-text-macros-new.xsl#269)
processing sequence/munge-quotes[1]
at xsl:apply-templates (file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../applications/htmlevator/applications/ucp-cleanup/ucp-text-macros-new.xsl#186)
processing sequence
in built-in template rule for /html/body[1]/div[1]/p[1] in the unnamed mode
in built-in template rule for /html in the unnamed mode
Syntax error at char 8 in regular expression: Expected '{' after \112
Error on line 1 column 1 of List_test_2-11UCPTEXTED.xhtml:
SXXP0003: Error reported by XML parser: Premature end of file.
org.xml.sax.SAXParseException; systemId: file:/Users/atheg/Desktop/lists_develop/recent_commit/test/XSweet_runner_scripts/XSweet-master-36ec4971e6213e2891146e67b2be5efe570a4484/scripts/../outputs/lists/List_test_2-11UCPTEXTED.xhtml; lineNumber: 1; columnNumber: 1; Premature end of file.
```
What I've done for the moment, to get it working (and it seems to), is remove the offending line (40) from the `DETECT-ITEMIZE-lists.xml` sheet, and stick with 9-8-0-1:
```xml
'xslt-version' : xs:decimal($xslt-spec/@version),
```
2 questions:
1. What have I broken by doing this? :)
2. Any thoughts on the best way to fix the errors so I can add this back in? Probably by updating Saxon and the UCP macro sheet?
Thanks Wendell.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/141Extract a default font from Word docs2018-04-24T06:41:30ZAlex ThegExtract a default font from Word docsI believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the br...I believe Word applies the "Normal" style to text by default, when no other Style is specified. We could extract and apply the default font specified for text upon which no other font is specified. Currently, this text displays in the browser's default font.
Putting this on hold as a future development.https://gitlab.coko.foundation/XSweet/XSweet/-/issues/146Extract Break Types2018-07-31T06:50:38ZBruno Herfsthello@brunoherfst.comExtract Break TypesBreak types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-bre...Break types are lost in conversion to HTML.
1 Page Break
2 Column Break
3 Next Page
4 Section Break
5 Even Page Break
5 Odd Page Break
6 Section Break
Could they be converted to classes: `<br class='page-break'>`https://gitlab.coko.foundation/XSweet/XSweet/-/issues/154Extract math from Word2021-01-21T11:56:02ZAlex ThegExtract math from WordIt looks like there are 2 main ways of embedding math into .docx files (other than plain text):
1. Using the built-in equation editor. This uses a tag XML structure - no binaries, it's all inline:
```xml
<m:oMathPara>
<m:oMath>
```
2. ...It looks like there are 2 main ways of embedding math into .docx files (other than plain text):
1. Using the built-in equation editor. This uses a tag XML structure - no binaries, it's all inline:
```xml
<m:oMathPara>
<m:oMath>
```
2. MathType, the most common math add-on for Word, which uses math binaries that need to be extracted.
For both of these, we should be representing these in MathML (as the standard for HTML5). It looks like we will have to define the mapping for the first option, which could be pretty time consuming. For MathType, we'll need to convert the binaries. @jure's made a ruby gem that converts from MathType to MathML. It may be that we'll need to do a rewrite of this to use it, but it could be a helpful resource.Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/140Formatting issues with nested spans and Word styles2018-05-01T13:52:44ZAlex ThegFormatting issues with nested spans and Word stylesThis is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it use...This is somewhat related to #131
[small_caps_example.docx](/uploads/c5e9961e5f3248e8d6ace6c892d97048/small_caps_example.docx)
In the attached example, "Acknowledgements" comes through in bold and small caps but it should not - it uses the Word style "BookTitle + Not Bold, Not Small caps". It looks like this is a question of nested spans, the priority in which the formatting is resolved, and how Word style modifiers are extracted into the html.
Here's the html after the join step:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">
<span style="font-weight: normal">Acknowledgements</span>
</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
Here it is after the the collapse step. At this point, I believe the innermost span's `font-weight: normal` should have been passed to the outer `class="BookTitle` span, but it is not:
```html
<p style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
</p>
```
And here is the final rinsed html:
```html
<h2 style="margin-bottom: 0pt">
<span style="font-variant: normal; font-weight: bold">
<span class="BookTitle">Acknowledgements</span>
</span>
<a class="bookmarkStart" id="docx-bookmark_0"><!-- bookmark ='_GoBack'--></a>
<a href="#docx-bookmark_0"><!-- bookmark end --></a>
</h2>
```
And, the `font-variant: normal` needs to be passed down to the innermost span, or else it's clobbered by the `BookTitle` styling on the innermost span.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/42Handle highlighting2020-06-03T15:08:18ZAlex ThegHandle highlightingOpening this issue because I'm looking at an example, but I'm going to put it on hold for now as it's a low priority.
From Green, Ch 1, "Fig. 6 about here" is highlighted green in Word, and comes through as a highlight tag in the HTML,...Opening this issue because I'm looking at an example, but I'm going to put it on hold for now as it's a low priority.
From Green, Ch 1, "Fig. 6 about here" is highlighted green in Word, and comes through as a highlight tag in the HTML, but does not actually appear in the HTML as highlighed:
```html
<p style="font-weight: bold; font-size: 18pt">
<highlight>[Fig. 6 about here.]</highlight>
</p>
```
1. Should we try to catch highlighting?
2. If so, do we care about preserving the original color?https://gitlab.coko.foundation/XSweet/XSweet/-/issues/62Handling whitespace-only formatting2019-07-07T21:34:20ZAlex ThegHandling whitespace-only formattingBakker ch1, see #56 for files
There are 5 headers of the same level, but one of theme doesn't get promoted like the others. Seems to be caused by a `<tab>` at the end of the heading "The heroic migrant and the end of migration".
T...Bakker ch1, see #56 for files
There are 5 headers of the same level, but one of theme doesn't get promoted like the others. Seems to be caused by a `<tab>` at the end of the heading "The heroic migrant and the end of migration".
These all get promoted to h1:
* Keeping the monies flowing the times of crises
* The limits of migrant inclusion
* Migration, state-led transnationalism, and development
* The Washington Consensus and beyond: the continuing significance of market fundamentalism in development policy and practice
This one doesn't:
* The heroic migrant and the end of migration
Here's one that gets promoted, just after join-elements and before the header promotion steps:
````html
<p class="Default" style="font-size: 12pt; font-style: italic; margin-bottom: 6pt"><i>The limits of migrant inclusion</i></p>
````
This is the offending tab (at least, I think it's the tab keeping this from being recognized as a header):
````html
<p class="Default" style="font-size: 12pt; margin-bottom: 6pt"><i>The heroic migrant and the end of migration</i>
<tab/>
</p>
````
Perhaps a cleaning step that strips out trailing tabs before promotion? I can't think where a trailing tab would ever be meaningful.1.0.0https://gitlab.coko.foundation/XSweet/XSweet/-/issues/92Header promotion example for Alex to check2019-07-07T22:40:39ZAlex ThegHeader promotion example for Alex to checkIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotionIn Gilbert, fwd: [a03_fwd.docx](/uploads/69b43e22a61426ec09e83037cf875c35/a03_fwd.docx)
Why is "Holly Near" not labeled a header, but "12/21/2014" is?
For Alex to check after next iteration of header promotion1.0.0Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/147How do we test?2018-07-30T08:12:02ZBruno Herfsthello@brunoherfst.comHow do we test?I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](ht...I want to write some tests to validate the behavior of XSweets. Is there a preference for how that is done? Add a test folder with some test source documents and their expected output? Or use an existing testing framework like [xspec](https://github.com/expath/xspec)?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/14How header promotion uses styles, and the list of styles we pay attention to2018-05-02T15:44:13ZAlex ThegHow header promotion uses styles, and the list of styles we pay attention toBakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XS...Bakker ch 2: [b_02_ch_1_Bakker.docx](/uploads/fbc7cf504b35339d8505347cadeeb240/b_02_ch_1_Bakker.docx)
Conversion outputs: [output_b02_ch_1_Bakker.zip](/uploads/467e11494ac6f4af558b39b442cc4b1b/output_b02_ch_1_Bakker.zip)
See issue XSweet#56, this is the second improvement this group of headers suggests:
When we find and promote Word styles we care about, like "section heading", we could also look for other similarly formatted text that's NOT labeled with a Word style but should have been. In this instance, the header promotion script could:
* see the "section heading" styles and promotes the marked headings
* note how the section headings were formatted (here it's underlined 12pt helvetica font)
* look for other potential headings formatted the same way, and promotes them as appropriate (this would catch the other 4)
Would this be easy to do?https://gitlab.coko.foundation/XSweet/editoria_typescript/-/issues/42How to run Wax 22020-11-12T02:02:44ZAlex ThegHow to run Wax 2This worked for me on macOS Mojave:
1. `git clone https://gitlab.coko.foundation/wax/wax-prosemirror`
2. `cd wax-prosemirror`
3. `yarn install` with node >= 12
4. `yarn editoria`
The last step starts a dev server and `localhost:3000` r...This worked for me on macOS Mojave:
1. `git clone https://gitlab.coko.foundation/wax/wax-prosemirror`
2. `cd wax-prosemirror`
3. `yarn install` with node >= 12
4. `yarn editoria`
The last step starts a dev server and `localhost:3000` running Wax 2
The demo file for the editor's content comes from `wax-prosemirror/editors/editoria/src/demo.js`. Changes to its content hot reload into the editor.https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/7Hyperlink inferrer tags some file names as links that shouldn't be2018-05-29T01:38:51ZAlex ThegHyperlink inferrer tags some file names as links that shouldn't beIn an author docx that lists captions for images to be included in the book, these strings get linked as hyperlinks:
* 04_IntroTodosSantos.jpg
* 13_ClinicScale_20R06.jpg
They shouldn't be, since they don't point to anything. Could we a...In an author docx that lists captions for images to be included in the book, these strings get linked as hyperlinks:
* 04_IntroTodosSantos.jpg
* 13_ClinicScale_20R06.jpg
They shouldn't be, since they don't point to anything. Could we add a small adjustment to be sure things like this don't get linked? Perhaps a validation like "must have at least one slash if it ends in a file extension" or something similar?https://gitlab.coko.foundation/XSweet/HTMLevator/-/issues/8Hyperlinker should catch trailing slash if present.2018-05-29T01:32:03ZAlex ThegHyperlinker should catch trailing slash if present.Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it ...Although the link isn't broken without it, it looks nicer if the hyperlink includes the trailing slash. Example:
This: `http://www.arsdisputandi.org/`
Is tagged as: [http://www.arsdisputandi.org](http://www.arsdisputandi.org)/
But it really should be: [http://www.arsdisputandi.org/](http://www.arsdisputandi.org/)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/150Hyperlinks in footnotes broken2018-07-27T16:58:45ZBruno Herfsthello@brunoherfst.comHyperlinks in footnotes brokenHyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperl...Hyperlinks in footnotes become internal DOC reference:
<a href="../customXml/item1.xml">
Expected it to be:
<a href="http://www.example.com">
[footnote-hyperlink.docx](/uploads/c03e49e543009d239dd03b2fb3606dca/footnote-hyperlink.docx)https://gitlab.coko.foundation/XSweet/XSweet/-/issues/176Import PDF & convert to Kotahi's HTML profile2022-06-08T17:04:19ZRyan Dix-PeekImport PDF & convert to Kotahi's HTML profile**Description;** the purpose of this task is to support the import of PDFs into Kotahi through integration with Sciencebeam. Sciencebeam supports the conversion of PDFs to XML. We require conversion of PDF to HTML (Kotahi HTML profile sp...**Description;** the purpose of this task is to support the import of PDFs into Kotahi through integration with Sciencebeam. Sciencebeam supports the conversion of PDFs to XML. We require conversion of PDF to HTML (Kotahi HTML profile specifically).
Suggested solution; XSweet accepts docx, but the remaining pipelines support HTML. Convert PDF to HTML and then feed the output through XSweet for the doc clean-up; PDF -> TEI-XML -> Docx -> XSweet -> Wax
**Acceptance criteria;**
- Ensure HTML is accessible in Wax.
- Extract manuscript metadata and populate the submission form i.e. title, abstract and/or author name data.Suki VenkatSuki Venkat