docx -> Editoria step manifest
Production of HTML from a .docx file input is accomplished by executing a sequence of XSLT 2.0 transformations (tested under Saxon HE). We do not specify how these transformations are accomplished, maintained or managed internally, or whether/where their intermediate results are exposed or made accessible. (At least in some scenarios it is expected that intermediate results may be of use or interest to consumers.)
A particular XSLT stylesheet (aka 'transformation specification' or 'transform') will be found in one of two Gitlab repositories (so far): XSweet or Editoria Typescript (this repository). Within the pipeline, these stylesheets are grouped functionally, but such grouping is internal, not exposed.
For the most part, the output of each XSLT in the sequence is designated as primary input to the next transformation. There may be occasional complications, for example a pipeline that generates an XSLT dynamically and then applies it.
The primary input of the first transformation in the sequence is assumed to be a
document.xml document extracted from an MS Word .docx file (as a zip). Its neighbor files including
styles.xml must be present for their contents or settings to be included.
Formally, each intermediate file is expected to be (or can be regarded or expressed as) an XML file sans XML declaration, using HTML5 element names and semantics (as optimized for Editoria intake), but no DOCTYPE declaration. (Thus: the
<html> tag is on the first line, with no prologue.) Such a file is "system agnostic" and can be read as either XML or HTML. (An exception to this rule is the XSLT that is generated dynamically for header promotion, which will of course not be HTML of any sort.)
Find the repositories here:
- https://gitlab.coko.foundation/wendell/XSweet/tree/ink-api-publish, or a copy
- https://gitlab.coko.foundation/wendell/editoria_typescript/tree/ink-api-publish (this repository)
When calling the files in Gitlab, in the file paths listed:
https://gitlab.coko.foundation/wendell/XSweet/raw/ink-api-publish/applications- note extra
- docx extraction
- Header promotion (note process branch)
- Apply resulting XSLT back to
collapse-paragraphs.xsloutput (result) to produce HTML input to next step
Finalize XSweet / HTML Typescript
Prep for Editoria (Editoria Typescript)
XSweet also includes other modules and functionalities not presently being used for the Editoria load. These include
css-abstract, or post-processes that produce plain text outputs or analytic profiles of inputs.