Skip to content
Snippets Groups Projects
Commit d616de87 authored by Agathe's avatar Agathe
Browse files

First commit

parents
No related branches found
No related tags found
No related merge requests found
Pipeline #21575 failed with stages
in 5 minutes and 7 seconds
Showing
with 1077 additions and 0 deletions
node_modules
.vscode
public
resources
# Local Netlify folder
.netlify
.DS_Store
/public
\ No newline at end of file
LICENSE 0 → 100755
MIT License
Copyright (c) 2020 Adam Hyde
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
---
title: "{{ replace .Name "-" " " | title }}"
date: {{ .Date }}
draft: true
---
baseURL = "https://agathe.pages.gitlab.coko.foundation/xsweet-site/"
languageCode = "en-us"
title = "xSweet, The open .docx to HTML conversion tool"
theme = ["hugo-video", "xsweet-theme", "my-doks"]
# relativeURLs = true
[outputs]
home = ["HTML", "RSS", "JSON"]
[markup]
[markup.goldmark]
[markup.goldmark.renderer]
unsafe = true
[markup.highlight]
codeFences = true
guessSyntax = false
lineNoStart = 1
# lineNos = true
lineNumbersInTable = false
noClasses = true
style = "perldoc"
tabWidth = 4
pygmentsUseClasses=true
[taxonomies]
# category = "categories"
tag = "tags"
# [permalinks]
# posts = "/:year/:month/:title/"
[menu]
[[menu.main]]
identifier = "overview"
name = "Overview"
url = "/overview/0-overview/"
weight= 1
[[menu.main]]
identifier = "documentation"
name = "Documentation"
url = "/documentation/overview/"
weight= 2
[[menu.main]]
identifier = "using-xSweet"
name = "Using xSweet"
url = "/using/using-xsweet/"
weight= 3
[[menu.main]]
identifier = "involved"
name = "Get Involved"
url = "/involved/how-to/"
weight= 4
---
title: "Homepage"
class: "homepage"
---
---
title: "HTMLevator"
draft: false
weight: 220
part: 1
Intro : "HTMLevator is a series of enhancement utilities for improving HTML documents"
class: documentation
---
HTMLevator is a series of enhancement utilities for improving HTML documents. These tools can be used on HTML files made from `.docx`s with XSweet, or on other, arbitrary HTML files. HTMLevator features include:
* Semantic structure inferring, adding headings and sections to flat HTML files
* Copyediting cleanups to normalize text
* Tools to add customized transformations in simplified syntax
## Contents
* Heading promotion](#heading)
* Formatting-based analysis](#format-based)
* Outline-level heading promotion](#outline-based)
* Custom configuration](#custom-hp)
* Plain text URL linking](#url)
* Plain text list tagging](#list-detection)
* Copyediting cleanups](#ucp)
* ucp-text-macros.xsl](#text-cleanup)
* ucp-mappings.xsl](#ucp-mappings)
* Custom transformations (experimental)](#custom-trans)
* Section inferrer (experimental)](#sections)
## Heading promotion
HTMLevator includes a feature that attempts to infer which elements are headings, transforming them from `<p>`s into headings: `<h1>` through `<h6>`. This is more art than science, as the input is generally not semantically tagged and structured. It is sometimes trivial to infer headers but it is also frequently quite difficult or impossible to do so unassisted or programmatically. As such, heading promotion will not catch all headings all the time, and it will work better on some documents than on others.
There are 3 heading promotion strategies built into XSweet:
1. Format-based analysis (default)
2. Outline-level heading promotion
3. Custom configuration for named Word styles or specific text
The `header-promote/header-promotion-CHOOSE.xsl` sheet will try to pick the best approach to use for a given document:
* If no custom configurations are supplied, `header-promotion-CHOOSE.xsl` checks to see whether outline levels appear to have been used. If outline level data exists, it is used as the basis for heading promotion
* If outline levels have not been used, format-based analysis is used to infer and promote headings
Alternatively, you can specify the header promotion method to use by passing it as a runtime parameter with `header-promotion-CHOOSE.xsl`:
* `method=ranked-format`
* `method=outline-level`
* `method=my-styles.xml`
### Format-based analysis
As a rule, authors indicate headings with visual formatting far more commonly than by applying named MS Word styles. It's not possible to have a discrete list of what kind of formatting indicates a heading, as it changes from file to file and is highly contextual. Instead, each individual document and its formatting must be analyzed as a whole before making guesses about headings. Format-based heading promotion does just this.
This approach works well for some documents and poorly for others. One size does not fit all, and the approach is simply to optimize for what works well with the greatest number of documents. Table of contents and reference files often contain many short paragraphs, leading to erroneous heading promotion.
The `header-promote/digest-paragraphs.xsl` sheet performs this file analysis. It makes a representation of every `<p>` in the document with relevant formatting properties:
* `font-size`
* `font-style`
* `font-weight`
* `text-decoration`
* `color`
* `text-align`
Next, it sorts paragraphs into groups that share identical formatting, one group for each distinct combination of properties. These groups are candidates for promotion from `<p>` to `<h1-6>`. HTMLevator considers:
* How many paragraphs are formatted the same way
* The average paragraph length in each format group
* How often paragraphs in one format group appear in continuous runs (and thus probably aren't headings)
* Whether paragraphs are all caps
Decisions about what to consider headings are made as follows:
* Anything that is right-aligned is not considered for heading promotion
* The most common type of paragraph in the document (i.e. the combination of paragraph properties that occurs the most) is not considered for heading promotion
* Promote a paragraph group to headings if:
* The average run of consecutive paragraphs styled the same way is 4 or fewer (long runs of `<p>`s with the same styling suggest the paragraphs aren't headings), AND
* The font size specified is not the smallest font size found in the document, AND
* The average length of paragraphs with the given set of properties is not more than 120 characters
* Promote a paragraph group if it is:
* Centered, AND
* Less than 200 characters in average length, AND
* The average consecutive paragraph run is less than 2
* Promote a paragraph group if it never ends in a period
After HTMLevator has identified paragraph groups to mark as headings, it must guess the outline level. It does so based on the following attributes, in these order:
1. Font size (bigger = higher heading level)
2. Italics
3. Bold
4. Underline
5. Always caps
Generally speaking, HTMLevator's heading detection does a better job detecting headings than it does at guessing the heading's level.
#### XSLT sequence
This is the default heading promotion method, run if outline level data is not present. You can also run this `method=ranked-format`
1. First, `header-promote/digest-paragraphs.xsl` makes the paragraph groupings, and guesses what formats should be headings (and what level those headings should be).
2. The `header-promote/make-header-escalator-xslt.xsl` sheet uses the `digest-paragraphs.xsl` output as its input, which it uses to produce a bespoke `XSL` sheet.
3. Running this sheet on the original HTML file implements the heading promotion, replacing the `<p>`s thought to be headings with `<h1-6>`.
### Outline-level heading promotion
An outline level can be specified on a paragraph in Word (which often comes from a named Word style. Some writers use this outlining functionality in Word, either deliberately, or implicitly through careful use of named styles. In these instances, outline levels are often a reliable indicator of headings and heading levels.
When outline levels are specified in Word's XML (e.g. `<w:outlineLvl w:val="0"/>`), they are extracted by XSweet as an `-xsweet-outline-level` property on the `<p>`.
When this property is present at least twice in the HTML document, the `header-promote/header-promotion-CHOOSE.xsl` sheet will elect to use outline levels to promote headings.
### Custom configuration
To create a custom configuration:
1. Create a custom mapping (`my-styles.xml` or what have you). See the example provided in `config-mockup.xml` for syntax.
2. Run the `header-promotion-CHOOSE.xsl` sheet, passing the custom mapping `.xml` sheet as a runtime parameter (`method=my-styles.xml`)
3. The `make-header-mapper-xslt.xsl` will generate and apply custom XSL sheet based on your XML file
## Plain text URL linking
`hyperlink-inferencer/hyperlink-inferencer.xsl`
This sheet searches for plain-text URLs and automatically links them. It can recognize links with the following TLDs:
* .com
* .org
* .net
* .gov
* .mil
* .edu
* .io
* .foundation
* country TLDs
XSweet looks for a top level domain preceded by preceded by one or more strings that contain only letters, numbers, underscores and dashes (no spaces or other punctuation). These strings can be separated by periods (".") Note that this rule will capture a `www.` if it is present.
XSweet will recognizes and include in the link the protocol, if it has been specified (`http://`, `https://`, `ftp:`). If the protocol has not been specified, the link's `href` will be appended with `http://`.
This sheet will also capture query strings on links.
## Plain text list tagging
`DETECT-ITEMIZE-LISTS.xsl`
This module will recognize plain text that looks like a numbered lists and mark the corresponding list (as an `<ol>`) and list items (`<li>`s).
`DETECT-ITEMIZE-LISTS.xsl` runs from within it 3 separate sheets in sequence:
* `detect-numbered-lists.xsl`, which detects lists and bookends them with `<xsw:list xmlns:xsw="http://coko.foundation/xsweet" level="0">`
* `itemize-detected-lists.xsl`, which converts the `<xsw:list>` tags to `<ol>`, and wraps each paragraph in `<li>`s
* `scrub-literal-numbering-lists.xsl`, which removes from each list item the leading whitespace, literal text numbering, the period, and the whitespace after it
Lists must match the following pattern to be detected and marked as a numbered list:
* Each list item paragraph may start with any amount of white space (including none), followed by
* a string of one or more numerals, followed by
* a period, followed by
* one or more white space characters.
* Further, at least two or more consecutive paragraphs must meet these criteria to be marked as a list
List items that meet this criteria are scrubbed of their literal numbering (and following white space) in favor of automatically generated `<ol>` numbering.
Note that this feature creates a flat list (one level), rather than nested lists based on indentation.
This module can be run before or after the `PROMOTE-lists.xsl` feature in XSweet Core. To use it, you can modify the `execute_chain.sh` file of the XSweet_runner_scripts to include this step before the `final-rinse.xsl` step.
See also the documentation for [marked list handling](/xsweet-core/#lists).
## Copyediting cleanups
`ucp-cleanup/ucp-text-macros.xsl`
This sheet contains a suite of text cleanups, built specifically for use by the [University of California Press](https://www.ucpress.edu/ "www.ucpress.edu"). It automates many copyediting improvements:
* Hyphens between numerals are converted to en dashes
* Two or more consecutive spaces are converted to a single space
* Any number of spaces before or after em dashes are removed
* Series of periods are converted to ellipses
* Two adjacent hyphens become an em dash
* En dashes surrounded on both sides by spaces are converted to an em dash
* Equal signs are normalized to be surrounded by one space on either side
* Spaces adjacent to tabs are removed
* Spaces at the beginning and end of paragraphs are removed
* Tabs at the end of paragraphs are removed
* Empty paragraphs are removed
* Single and double quotation marks (including backticks) are converted to directional quotation marks
* Hair spaces are inserted between single and double quotation marks
* Punctuation marks are coerced to match the formatting of the previous word; e.g. `<i>extraordinary</i>!` becomes `<i>extraordinary!</i>`. This rule applies to the following punctuation marks:
* "
* '
* :
* ;
* ?
* !
`ucp-cleanup/ucp-mappings.xsl`
In this step, underlining and bolding is converted to italics, either as inline tags or `style` CSS:
* `<b>`s and `<u>`s are replaced with `<i>`s
* `style="font-weight: bold"` and `style="text-decoration: underline"` become `style="font-style: italic"`
Short and sweet.
## Custom transformations (experimental)
The files in the `html-tweak` folder can be used to extend XSweet, by defining custom transformations to apply to the text. This can be done on a per-document basis, or to implement generic rules according to your use case.
Use is as follows:
1. Define the custom transformations to be applied in an `.xml` file
2. Run the `APPLY-html-tweaks.xsl` sheet, referencing the above transformations defined in your `xml` file. This:
(A) reads the user-defined transformations from your `.xml` file
(B) creates a new XSL sheet based on the `.xml` file that will implement the specified transformation (done with the `make-html-tweak-xslt.xsl` sheet)
(C) applies the created XSL sheet to the input file
Example use (exact script will depend upon how you are running your XSLT:
`XSLT my-source.html APPLY-html-tweaks.xsl config=my-html-tweaks.xml`
### Tweak definition syntax
The user-specified tweaks work by establishing matches between categories of HTML elements (most commonly but certainly not limited to `<p>`s or `<span>`s), as indicated by:
* CSS property or CSS property-value (on a `style` attribute), or
* Named classes (the `class` attribute)
The syntax to define HTML tweaks uses the following components:
* `where`: a wrapper for a rule
* `match`: conditions on an element for it to match
* `style`: a `style` property name or `property-name: value` combination
* `class`: a class value (name token)
### Example 1
Remove `Default` classes from HTML elements where they appear:
```html
<p class="Default">Here is default class paragraph</p>
```
becomes:
```html
<p>Here is default class paragraph</p>
```
HTML tweak rule:
```html
<where>
<match><class>Default</class></match>
<remove><class>Default</class></remove>
</where>
```
### Example 2
Remove a specific styling property wherever it's present:
```html
<p style="text-indent:1em; margin-bottom: 1em">Styling includes a property</p>
```
becomes:
```html
<p style="text-indent:1em">Styling includes a property</p>
```
HTML tweak rule:
```html
<where>
<match><style>margin-bottom</style></match>
<remove><style>margin-bottom</style></remove>
</where>
```
### Example 3
Remove a `style` property if it has a given value:
```html
<p style="font-family: Helvetica; font-size: 12pt">Remove a property if it has a specific value</p>
```
becomes:
```html
<p style="font-size: 12pt">Remove a property if it has a specific value</p>
```
HTML tweak rule:
```html
<where>
<match><style>font-family: Helvetica</style></match>
<remove><style>font-family</style></remove>
</where>
```
### Example 4
The following tweak rule will map a specific `class` and `style` to another `class` and `style`:
```html
<where>
<match>
<style>font-size: 18pt</style>
<class>FreeForm</class>
</match>
<remove>
<style>font-size</style>
<class>FreeForm</class>
</remove>
<add>
<class>FreeFormNew</class>
<style>color: red</style>
</add>
</where>
```
For further examples, see the demo files included in the repository:
* `html-tweak-map.xml` defines example transformation definitions
* `html-tweak-demo.xsl` is the resulting XSL sheet made by the `make-html-tweak-xslt.xsl`, which will effect the specified transformation. (This relies on the `html-tweak-lib.xsl` file as a dependency)
## Section inferrer (experimental)
This utility uses headings (`<h1-6>`) as markers and attempts to add `<section>`s to an HTML file. It is run as a single XSL sheet, `induce-sections/induce-sections.xsl`, which returns the document HTML file unchanged except for the addition of `<section>` tags.
* Sections are only added when higher-level headings wrap lower-level ones. Lower-level headings wrapping higher-level ones are not captured as `<section>`s
* Paragraphs and blocks preceding the first header, appear without a section wrapper
(before the first section)
* Paragraphs and all other elements travel with the immediately preceding header
* Files with no headings are unchanged
* Your document must be wrapped in a `<div class="docx-body">` for this sheet to work. It will be wrapped if it has been extracted by XSweet; otherwise you will have to add this element yourself
* If headings skip levels, a note will be added: `<!-- Headers out of regular order: h1, h2, h3, h1, h3-->`
Example:
```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"></meta>
<title>sections</title>
</head>
<body>
<div class="docx-body">
<h1>h1</h1>
<p>h1 para</p>
## h2
<p>h2 para</p>
### h3
<p>h3 para</p>
<p>h3 para</p>
<h1>h1</h1>
<p>h1 para</p>
### h3
<p>h3 para</p>
</div>
</body>
</html>
```
becomes
```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>sections</title>
</head>
<body>
<div class="docx-body">
<!-- Headers out of regular order: h1, h2, h3, h1, h3-->
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
## h2
<p>h2 para</p>
<section>
### h3
<p>h3 para</p>
<p>h3 para</p>
</section>
</section>
</section>
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
### h3
<p>h3 para</p>
</section>
</section>
</div>
</body>
</html>
```
`mark-sections.xsl` and `nest-sections.xsl` are deprecated; the `induce-sections.xsl` sheet encapsulates the functionality from both.
\ No newline at end of file
---
title: "Documentation"
---
\ No newline at end of file
---
title: "Editoria Typescript"
draft: false
weight: 230
part: 1
Intro : "Editoria Typescript transforms HTML into a format required for the Coko Foundation's"
class: documentation
---
Editoria Typescript transforms HTML into a format required for the Coko Foundation's [Wax](https://gitlab.coko.foundation/wax/wax "gitlab.coko.foundation/wax") WYSIWYG word processor for [Editoria](https://editoria.pub/ "editoria.pub"). While Wax has been built specifically for book editing and publication, it is by no means its only application, and it could be repurposed. Other similar chains could be implemented to target another format.
Editoria Typescript translates the document structure, inline and class formatting, endnotes and footnotes into a subset of near-HTML, while eliminating HTML attributes not used by Wax.
## Contents
* [Pipeline](#pipe)
* [p-split-around-br.xsl](#br)
* [editoria-basic.xsl](#basic)
* [editoria-reduce.xsl](#reduce)
## Pipeline
Editoria Typescript should be run in the following order:
1. `p-split-around-br.xsl`
2. `editoria-basic.xsl`
3. `editoria-reduce.xsl`
##### `p-split-around-br.xsl`
It is possible to specify line breaks within paragraphs in Word (`<w:br/>`, which are extracted as XHTML `<br class="br" />` tags).
As Wax does not support `<br>`s, this step simply divides paragraphs on breaks, removing the break and creating two separate `<p>` elements instead.
```html
<p style="font-family: Times New Roman; text-indent: 36pt">
Kṛṣṇadevarāya discusses this practice in the following verse:
<br class="br"/>
Make trustworthy Brahmins
</p>
```
becomes
```html
<p>Kṛṣṇadevarāya discusses this practice in the following verse:</p>
<p>Make trustworthy Brahmins</p>
<p>The commanders of your forts</p>
```
##### `editoria-basic.xsl`
XSweet's initial extraction divides the contents of the HTML document into sections: `<div class-"docx-content">`, `<div class-"docx-endnotes">`, and `<div class-"docx-footnotes">`. This step rearranges the content:
* `<div class="docx-content">` becomes `<container id="main">`
* Notes are reformatted and moved into a `<div id="notes">`
Notes and their `id`s are also rewritten, from:
```html
<div class="docx-endnotes">
<div class="docx-endnote" id="en1">
<p class="EndnoteText">
<span class="EndnoteReference">
<span class="endnoteRef">1</span>
</span> endnote</p>
</div>
</div>
<div class="docx-footnotes">
<div class="docx-footnote" id="fn1">
<p class="FootnoteText">
<span class="FootnoteReference">
<span class="footnoteRef">a</span>
</span> footnote</p>
</div>
</div>
```
to
```html
<div id="notes">
<note-container id="container-en1">
<p class="EndnoteText"> endnote</p>
</note-container>
<note-container id="container-fn1">
<p class="FootnoteText"> footnote</p>
</note-container>
</div>
```
These are then properly linked and nicely displayed in Wax. Endnotes and footnotes are combined into one sequential list:
{{< figure src="../images/wax_notes-768x472.png" >}}
`editoria-basic.xsl` writes some properties from CSS `style` attributes inline:
* `font-style: italic` is written to inline elements wrapped in an `<em>` tag
* `font-weight: bold` is written inline as `<strong>` tags
* `text-decoration: underline` is written inline as `<i>` tags, which is*
The following inline formatting tag mapping then occur:
* `<b>`s are converted to `<strong>`
* `<u>` is converted to `<i>`
* `<i>` is then converted to `<em>`
Note that we have made the decision convert underlining to italics. Wax does not currently support underlining.
##### `editoria-reduce.xsl</span>`
* All `class` and `style` information is dropped. Bye bye `class`, bye-bye `style`!
* `<p class="EndnoteText"> endnote</p>` becomes `<p> endnote</p>`
* Other tag attributes (e.g. `id`) are passed through
* `<sub>` and `<sup>` tags are passed through
* Inline markup on whitespace only (spaces, tabs) is removed, e.g. `<b> <b>`
* `tabs` are removed: `<span class="tab">`
* Paragraphs or headings with only whitespace or no content at all are removed, e.g. `<p></p>`, `<p> </p>`, `<h1></h1>`
* Internal-to-Word bookmarks (see this example](/xsweet-core/#links)) are removed
* `<head><style>` tag is removed
content/documentation/images/html-768x251.png

185 KiB

content/documentation/images/math_docx.png

35.6 KiB

content/documentation/images/math_ff.png

37.5 KiB

<!DOCTYPE html>
<html lang="en" dir="ltr"><head>
<meta charset="UTF-8">
<style type="text/css"></style>
</head>
<body>
<table>
<tbody><tr>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p>right-align</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p style="text-align: center">center</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p style="text-align: right">left-align</p>
</td>
</tr>
<tr>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top" rowspan="2">
<p>vertical merges</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top" colspan="2">
<p>horizontal merges</p>
</td>
</tr>
<tr>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p>A</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p>B</p>
</td>
</tr>
<tr>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: top">
<p>Vertical alignment – top</p>
<p>
<!-- empty -->
</p>
<p>
<!-- empty -->
</p>
<p>
X
</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: middle">
<p>center</p>
</td>
<td style="border-bottom-style: solid; border-bottom-width: 0.5pt; border-left-style: solid; border-left-width: 0.5pt; border-right-style: solid; border-right-width: 0.5pt; border-top-style: solid; border-top-width: 0.5pt; vertical-align: bottom">
<p>bottom</p>
</td>
</tr>
</tbody></table>
</body></html>
\ No newline at end of file
content/documentation/images/table_html.png

39.6 KiB

content/documentation/images/table_word.png

37.6 KiB

content/documentation/images/wax_notes-768x472.png

142 KiB

content/documentation/images/word_xml-768x1504.png

459 KiB

---
title: "Overview"
draft: false
weight: 200
part: 1
Intro : "XSweet is divided into three separate repositories"
class: documentation
---
XSweet is divided into three separate repositories, which are grouped by their primary concerns:
## XSweet Core
XSweet Core is designed to extract data from MS Word, clean it up, and produce a good representation of the contents as HTML.
Word XML:
{{< figure src="../images/word_xml-768x1504.png" >}}
Extracted HTML:
{{< figure src="../images/html-768x251.png" >}}
[Documentation](/xsweet-core)
[Repository](https://gitlab.coko.foundation/XSweet/XSweet "gitlab.coko.foundation/XSweet/XSweet")
## HTMLevator
HTMLevator contains optional enhancements for the HTML, above and beyond simple extraction. This includes features such as plain text URL recognition and linking, heading inferring, copyediting cleanups, and more.
[Documentation](/editoria-typescript)
[Repository](https://gitlab.coko.foundation/XSweet/HTMLevator "gitlab.coko.foundation/XSweet/HTMLevator")
## Editoria Typescript
Editoria Typescript transforms HTML to be loaded into the [Wax](https://gitlab.coko.foundation/wax/wax "/gitlab.coko.foundation/wax") WYSIWYG word processor for [Editoria](https://editoria.pub/ "editoria.pub"), where it can be styled, revised, and collaborated upon. This is a use-case-specific transformation chain, and a demonstration of how the HTML produced by XSweet Core and HTMLevator can be used as a pass-through format for conversions. Similar conversion chains can target other specific use cases in the same way.
[Documentation](http://xsweet.coko.foundation/editoria-typescript/ "xsweet.coko.foundation/editoria-typescript")
[Repository](https://gitlab.coko.foundation/XSweet/editoria_typescript "gitlab.coko.foundation/XSweet/editoria_typescript")
## XSLT tools
### Saxon
XSweet ships with [Saxon HE 9.8](https://www.saxonica.com/documentation/documentation.xml "www.saxonica.com/documentation"), which can be used to run the XSweet pipeline. Most of the testing has been done with this version of Saxon. For example syntax for usage from the command line and syntax, see its invocation in [this script](https://gitlab.coko.foundation/XSweet/XSweet_runner_scripts/blob/master/execute_chain.sh#L56 "gitlab.coko.foundation/XSweet").
### XSLT versions
XSweet is built using XSLT v2.0 and XSLT v3.0 stylesheets. Saxon HE 9.8 is an XSLT 3.0 processor. You may see warning messages that you are `Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor`. This has not caused any issues in testing and development, but be aware of this if you add your own XSLT sheets and use features specific to one version or the other.
### A Note on .xpl files
An `.xpl` file is an XML document instance using the XProc pipelining language. XProc is a W3C Recommendation that describes the definition and arrangement of XML transformation and modification operations as **pipelines**, typically sequences or series of operations with defined inputs and outputs ("ports"). Inasmuch as XSweet's architectural model is exactly such a pipeline of transformations, this becomes a utilitarian way for us to stand up processes for development and testing. An XProc file describes a chain of processes along with the resources required along the way (such as stylesheets or configurations). Run an XProc pipeline using an XProc processor such as XML Calabash, or using an XML IDE with XProc support. In addition to development and testing, we believe XProc (albeit not exactly these pipelines) could potentially be useful in some deployments. However, XProc is only one of multitudinous ways of orchestrating pipelines -- and we have INK, so we don't need, or use, any of these files to run XSweet in production.
Feel free to use the `.xpl` files included in XSweet's repositories but be aware they are for testing/development purposes and are not supported or externally documented.
\ No newline at end of file
---
title: "XSweet Core"
draft: false
weight: 210
part: 1
Intro : "The purpose of XSweet Core is to extract the underlying XML from Word documents "
class: documentation
---
The purpose of XSweet Core is to extract the underlying XML from Word documents (.docx) into valid, clean, and reusable HTML. XSweet Core extracts and saves “important” information from the .docx, such as:
* Font type and size
* Formatting, including bold, italic, underlining, text justification (left, right, center), subscript/superscript, etc.
* Paragraph indentation
* Endnotes and footnotes
* Outline and list levels
After initially extracting selected information, XSweet Core removes the (voluminous) extra noise in the XML, combines similar tags, abstracts away inline or repetitive formatting where possible, and rewrites word-specific feature implementations for HTML (e.g. notes, lists).
From an inscrutable mess comes good, clean, and human readable HTML.
## Contents
* [Pipeline](#pipeline)
* [Initial extraction](#extraction)
* [File structure](#structure)
* [Formatting](#formatting)
* [Word styles](#styles)
* [Links](#links)
* [Tables](#tables)
* [Notes](#notes)
* [Further cleanups](#cleanups)
* [scrub.xsl](#scrub)
* [join-elements.xsl](#join)
* [collapse-paragraphs.xsl](#collapse)
* [Lists](#lists)
* [Math](#math)
* [Final Rinse](#rinse)
* [Additional XSL sheets](#additional)
* [css-abstract.xsl](#css)
* [html-analysis.xsl](#analysis)
* [Serialization](#serialize)
* [html5-serialize.xsl](#html5)
* [xhtml-serialize.xsl](#xhtml)
* [plaintext.xsl](#plaintext)
## Pipeline
Initial extraction is achieved by running the `docx-extract/EXTRACT-docx.xsl` sheet, which in turn runs the following sequence of steps:
1. `docx-extract/docx-html-extract.xsl`
* which also runs `docx-extract/docx-table-extract.xsl`
2. `docx-extract/handle-notes.xsl`
3. `docx-extract/scrub.xsl`
4. `docx-extract/join-elements.xsl`
5. `docx-extract/collapse-paragraphs.xsl`
Running the `EXTRACT-docx.xsl` sheet has exactly the same effect as running these five sheets in sequence, using each step's output as the input for the next.
Next, `list-promote/PROMOTE-lists.xsl` creates HTML representations of lists.
The math sheet, `xsweet_tei_omml2mml.xsl` captures math and equations added in Word.
Finally, run `html-polish/final-rinse.xsl` for additional cleanup to the HTML file.
Note that the outputs of all of these steps technically produce XHTML, not true HTML. This is expedient for chaining transformation steps into a pipeline. You may want to serialize to HTML5 as the final step (see [html5-serialize.xsl](#html5), although it is not recommended to do so until all other desired transformations are finished.
## Initial extraction
##### `docx-html-extract.xsl`
This sheet does the heaving lifting of extracting information from the MS Word XML. Contents are extracted from the body of the `.docx`. Information from headers and footers is _not_ currently extracted.
### File structure
The contents of the HTML `body` section are wrapped in the following section containers:
* `<div class="docx-body">`: the main content of the file
* `<div class="docx-endnotes">`: wraps all endnotes
* `<div class="docx-footnotes">`: wraps all footnotes
### Formatting
There is a surprising amount of redundant and conflicting information in Word's XML, so it's important to choose the right level from which to extract properties. Many properties can be specified either on the paragraph level or the text run level.
XSweet extracts the following properties from the paragraph-level (from inside `<w:pPr>` paragraph properties tags):
* Named Word styles
* MS Word outline level
* MS Word list level
Other formatting information specified on the paragraph level is ignored, as it is unreliable and often overridden by properties specified on the text run level (`<w:rPr>`). The following are extracted from the run level:
* Font family
* Font size
* Text alignment
* Font color
* Italicization, bolding, underline, sub- and superscript
Unless formatting in Word is achieved by applied styles (see below), formatting is extracted inline or as element-level CSS (`<p style="[formatting]">`).
Inline elements preserved in HTML:
* Bold `<b>`
* Italic `<i>`
* Underline `<u>`
* Subscript `<sub>`
* Superscript `<sup>`
* Line breaks `<br>`
CSS properties captured include:
* font `font-family`
* font size `font-size`
* `font-weight`:`bold`, `normal`
* `font-style`: `italic`, `normal`
* `text-decoration: underline`, `none`
* `font-variant`: `normal` or `small-caps`
* text alignment `text-align`: `left`, `right`, `center`
* indentation `text-indent`, `padding-left`, `padding-right`
* margins `margin-top`, `margin-bottom`, `margin-left`, `margin-right`
* color `color`
The following pseudo-properties are also extracted:
* list level `-xsweet-list-level`: used for list extraction
* outline level `-xsweet-outline-level`: used to aid in heading detection
Be aware that XSweet extracts only what information is present in the Word XML. Some information, such as font type and size, etc., is not always explicitly defined in the XML. Word applies formatting from the default `Normal` style when no other style is specified. This can lead to some minor formatting inconsistencies.
#### Word styles
Word styles are referenced by name inside the Word file's main `document.xml` file, and defined in the `styles.xml` file in the same directory, which contains style formatting and property information. XSweet's initial extraction captures applied Word styles as CSS `class` attributes, and extracts the relevant information from `styles.xml` into a CSS `<style>` tag in the HTML `<head>`:
```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8" />
<style type="text/css">
.Title {
;
/* Normal*/
font-family: Arial Unicode MS;
/* Title*/
margin-bottom: 0pt;
text-align: center;
font-family: Times New Roman;
font-weight: bold;
font-variant: small-caps
}
</style>
</head>
<body>
<div class="docx-body">
<p class="Title" style="font-family: Times New Roman; font-variant: small-caps; font-weight: bold; margin-bottom: 0pt; text-align: center">Chapter Two</p>
...
```
Note that formatting information is often repeated: both inline as `style` and from `class` - there is analytical value to having the CSS inline (e.g. for [heading promotion](/htmlevator/#heading).
### Links
Where they exist, hyperlinks from the Word document are preserved and passed through to the HTML as `<a href>`s. Link text formatting is preserved.
```html
<a href="https://www.nytimes.com/">
<span style="color: #0000FF; text-decoration: underline">
<span class="Hyperlink">https://www.nytimes.com/</span>
</span>
</a>
```
Internal-to-Word bookmarks are also extracted, although without additional processing (not provided by XSweet), they are merely placeholders and are removed during [Editoria Typescript](editoria-typescript).
```html
<a class="bookmarkStart" id="docx-bookmark_0">
<!-- bookmark ='_GoBack'-->
</a>
<a href="#docx-bookmark_0">
<!-- bookmark end -->
</a>
```
### Tables
Tables are created by the `docx-table-extract.xsl` sheet. This sheet is invoked from inside the `docx-html-extract.xsl` sheet. It recreates Word tables in HTML, capturing:
* horizontal and vertical cell text alignment
* horizontally and vertically merged cells
Word table:
{{< figure src="../images/table_word.png" >}}
HTML recreation:
{{< figure src="../images/table_html.png" >}}
Download the HTML table code here: [table-demo.html](/images/table-demo.html)
### Notes
In the `handle-notes.xsl` step, both endnotes and footnotes are added to the end of the HTML document and linked to the note callouts in the text. Notes come from the MS Word document's `endnotes.xml` and `footnotes.xml` files.
Endnotes and footnotes are indexed separately. Endnotes are listed first, referenced as sequential integers. Footnotes are ordered as alphabetical callouts (a-z, aa, ab, etc.), and listed at the end of the HTML file after endnotes.
Endnote and footnote references are normalized and dynamically ordered by this step, by order of appearance. For example, even if the first endnote in Word references is labeled "2", the converted HTML will renumber it to "1". Notes that are never referenced in the text are removed.
Endnotes are placed inside a `<div class="docx-endnotes">`. Footnotes are placed inside a `<div class="docx-footnotes">`. The extracted syntax is as follows:
```html
<div class="docx-endnotes">
<div class="docx-endnote" id="en1">
<p class="EndnoteText">
<span class="EndnoteReference">
<span class="endnoteRef">1</span>
</span> endnote</p>
</div>
</div>
<div class="docx-footnotes">
<div class="docx-footnote" id="fn1">
<p class="FootnoteText">
<span class="FootnoteReference">
<span class="footnoteRef">a</span>
</span> footnote</p>
</div>
</div>
```
The above code is the result of the `docx-extract/handle-notes.xsl` step. In the initial extraction, the `< span class="endnoteRef">` and `<span class="footnoteRef">` are left empty. The `handle-notes.xsl` step assigns each note sequential references.
### Images
The initial extraction sheet (`docx-extract/docx-html-extract.xsl`) will insert image references to the location of the unzipped file. This feature relies on image files being available locally, with paths assembled assuming you've used the [XSweet_runner_scripts](https://gitlab.coko.foundation/XSweet/XSweet_runner_scripts "https://gitlab.coko.foundation/XSweet") and have left the source files and unzipped .docx directory in their original location.
## Further cleanups
`docx-extract/scrub.xsl`
Previous steps generally pass elements through, even if they haven't seen them before. This step removes these extra tags as unhelpful for our purposes unless they are explicitly caught and passed through. Examples of tags removed include:
* position
* iCs
* lang
* vertAlign
* noProof
* kern
Empty inline elements are removed, as is formatting applied to whitespace only, such as tabs and spaces. CSS `style` properties are normalized and put in a consistent order.
##### `docx-extract/join-elements.xsl`
This step combines strings of elements into one element when:
* More than one element of the same type occurs in a row, and
* The two tags have similar style attributes
This step does **not combine runs of `<div>`s, `<p>`s, or `<tab>`s.
Example:
```html
<p style="text-align: center">
<b>Part I: </b><b>United</b><b> and </b><b>Divided</b>
</p>
```
becomes
```html
<p style="text-align: center">
<b>Part I: United and Divided</b>
</p>
```
##### `docx-extract/collapse-paragraphs.xsl`
In this step, inline formatting gets copied to the paragraph level wherever possible. Elements that contain only formatting information that can be expressed on the paragraph level are removed after the formatting has been moved (see the span in the example below).
Example:
```html
<p style="color: blue"><span font-weight: bold>blue bold text</span></p>
```
becomes
```html
<p style="color: blue; font-weight: bold">bold blue text</p>
```
Note that inline styling information isn't removed even if it is copied to a higher level. This allows maximum flexibility for further transformation, at the cost of a bit more "noise" in the HTML.
## Lists
Lists are promoted according to the `xsweet-list-level` property, extracted from the MS Word <`w:ilvl w:val=[integer]>` property. Each list item becomes a `<li>` with a `<p>` inside it. XSweet currently extracts all lists as unordered lists.
##### `PROMOTE-lists.xsl`
A wrapper that runs `mark-lists.xsl` then `itemize-lists.xsl` in sequence
##### `mark-lists.xsl`
Wraps elements that have the `xsweet-list-level` property in a wrapper to mark them as lists, grouping and nesting according to `xsweet-list-level` value.
##### `itemize-lists.xsl`
Adds the `<ul>` and `<li>` wrappers to the lists marked by the previous step.
**Example**
`mark-lists.xsl` input:
```html
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 0">top-level list item</p>
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 1">nested list item</p>
```
`mark-lists.xsl` output:
```html
<xsw:list xmlns:xsw="http://coko.foundation/xsweet" level="0">
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 0">top-level list item</p>
<xsw:list level="1">
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 1">nested list item</p>
</xsw:list>
</xsw:list>
<p/>
```
`itemize-lists.xsl` output:
```html
*
<ul>
<li>
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 0">top-level list item</p>
<ul>
<li>
<p class="ListParagraph" style="margin-left: 36pt; xsweet-list-level: 1">nested list item</p>
</li>
</ul>
</li>
<p/>
```
See also the documentation for [plain text numbered list detection](/htmlevator/list-detection).
### Math
`xsweet_tei_omml2mml.xsl`
XSweet captures equations created with Word’s built-in equation tool (Office MathML format) and passes them through to the HTML as standard MathML. This features should "just work" with MS Word-native equations.
Not all browsers treat MathML exactly the same, but Javascript libraries such as MathJax can be used to standardize equation appearances. Also note that math in Word comes in many formats, not all of which are supported (LaTeX, MathType, images, etc.).
Math in Word:
{{< figure src="../images/math_docx.png" >}}
Extracted MathML in Firefox:
{{< figure src="../images/math_ff.png" >}}
### Final Rinse
`html-polish/final-rinse.xml`
The last step in the XSweet pipeline targeting HTML (except serialization to HTML if desired):
* Removes redundant inline tags (e.g. `<b>`, `<i>`, `<u>`) that are expressed instead as CSS `style` attributes
* Inserts placeholder comments into empty `<div>`s and `<p>`s to ensure they are retained
* Removes extraneous noise from endnote and footnote references
* When possible, removes redundant styling repeated on child elements
## Additional XSL sheets
##### `css-abstract`
An XSL sheet that attempts to abstract style information specified on HTML tags into classes with similar formatting. This sheet is experimental and not actively maintained. For our purposes, it turns out having style information specified at a lower level enables much better analysis in other steps (e.g. heading promotion).
##### `html-analysis.xsl`
This sheet produces an analysis of the HTML document, including a tree structure and a count of the HTML elements.
### Serialization
##### `html-polish/html5-serialize.xsl`
Serializes an input into valid HTML5. The primary benefit is to help normalize errors that may have occurred in the conversion wherever possible. For example, this can catch orphaned closing HTML tags without an opening counterpart and insert one, preventing errors.
##### `html-polish/xhtml-serialize.xsl`
serializes an input using XML syntax and HTML tags, nominally XHTML. As such it is syntactically well-formed, "standalone", and suitable for further processing using an XML parser (although not _necessarily_ valid to any particular XHTML schema).
##### `produce-plaintext/plaintext.xsl`
This step simply reads an HTML or XML file in and output the contents as plain text.
\ No newline at end of file
---
title: "Get Involved"
---
\ No newline at end of file
---
title: "How to Get Involved"
draft: false
weight: 400
part: 1
Intro : "Editoria Typescript transforms HTML into a format required for the Coko Foundation's"
class: documentation
---
## Contributing
We would like it very much if you could help us improve XSweet. There are essentially two ways to do this.
1. XSLT Pros - you can help by tracking the <a href="">[Issues in the XSweet GitLab](https://gitlab.coko.foundation/XSweet/XSweet/issues "gitlab.coko.foundation/XSweet/XSweet/issues") and either:
(A) participating in the discussions on a per Issue basis, or
(B) cloning the repo and fixing some of the Issues that come up, then making a merge request for Wendell and Alex to review
2. QA Folk - if you run files through XSweet and discover issues, please log these as an Issue <a href="">[in GitLab](https://gitlab.coko.foundation/XSweet/XSweet/issues "gitlab.coko.foundation/XSweet/XSweet/issues").
## Specific Asks
At this moment we would love help with these specific items:
* QA - run your test corpus of docs through (even if small) and *carefully* document what fails. Write it as a Issue in Gitlab describing the issue, include XML and HTML snippets for clarification, and (if possible) include the test document.
* Dev - conversion of Math. This is a complex issue comprising of many constituent parts including (but not exclusively):
* conversion of MS Word non-standard MathML to standardised MathML
* conversion of MathType to MathML
* Dev - writing unit tests
## Contact us
The main avenues for communication are via the Issues, but also in the <a href="">[Coko mattermost XSweet chat channel](https://mattermost.coko.foundation/coko/channels/xsweet "mattermost.coko.foundation/coko/channels/xsweet").
You can also contact [Alex by email](mailto:charles.theg@gmail.com).
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment