docx -> Editoria step manifest
Production of HTML from a .docx file input is accomplished by executing a sequence of XSLT 2.0 transformations (tested under Saxon HE). We do not specify how these transformations are accomplished, maintained or managed internally, or whether/where their intermediate results are exposed or made accessible. (At least in some scenarios it is expected that intermediate results may be of use or interest to consumers.)
A particular XSLT stylesheet (aka 'transformation specification' or 'transform') will be found in one of two Gitlab repositories (so far): XSweet or Editoria Typescript (this repository). Within the pipeline, these stylesheets are grouped functionally, but such grouping is internal, not exposed.
For the most part, the output of each XSLT in the sequence is designated as primary input to the next transformation. There may be occasional complications, for example a pipeline that generates an XSLT dynamically and then applies it.
The primary input of the first transformation in the sequence is assumed to be a document.xml
document extracted from an MS Word .docx file (as a zip). Its neighbor files including endnotes.xml
, footnotes.xml
and styles.xml
must be present for their contents or settings to be included.
Formally, each intermediate file is expected to be (or can be regarded or expressed as) an XML file sans XML declaration, using HTML5 element names and semantics (as optimized for Editoria intake), but no DOCTYPE declaration. (Thus: the <html>
tag is on the first line, with no prologue.) Such a file is "system agnostic" and can be read as either XML or HTML. (An exception to this rule is the XSLT that is generated dynamically for header promotion, which will of course not be HTML of any sort.)
XSLT Repositories
Find the repositories here:
- https://gitlab.coko.foundation/wendell/XSweet/tree/ink-api-publish, or a copy
- https://gitlab.coko.foundation/wendell/editoria_typescript/tree/ink-api-publish (this repository)
When calling the files in Gitlab, in the file paths listed:
- Expand
$XSweet
tohttps://gitlab.coko.foundation/wendell/XSweet/raw/ink-api-publish/applications
- note extraapplications
subdirectory - Expand
$editoria-typescript
ishttps://gitlab.coko.foundation/wendell/editoria_typescript/raw/ink-api-publish
Process sequence
- docx extraction
$XSweet/docx-extract/docx-html-extract.xsl
$XSweet/docx-extract/handle-notes.xsl
$XSweet/docx-extract/scrub.xsl
$XSweet/docx-extract/join-elements.xsl
$XSweet/docx-extract/collapse-paragraphs.xsl
- Header promotion (note process branch)
$XSweet/header-promote/digest-paragraphs.xsl
$XSweet/header-promote/make-header-escalator-xslt.xsl
- Apply resulting XSLT back to
collapse-paragraphs.xsl
output (result) to produce HTML input to next step
-
Finalize XSweet / HTML Typescript
-
$XSweet/html-polish/final-rinse.xsl
-
Prep for Editoria (Editoria Typescript)
$editoria-typescript/editoria-notes.xsl
$editoria-typescript/editoria-basic.xsl
$editoria-typescript/editoria-reduce.xsl
XSweet also includes other modules and functionalities not presently being used for the Editoria load. These include html-tweak
, css-abstract
, or post-processes that produce plain text outputs or analytic profiles of inputs.