|
|
# Zorba sample
|
|
|
|
|
|
Pipeline currently has four steps (details to come):
|
|
|
## Script
|
|
|
|
|
|
(Note - untested as of 2016-09-19 pls help debug - wap)
|
|
|
|
|
|
```
|
|
|
# For producing HTML5 outputs via XSweet XSLT from sources extracted from .docx (Office Open XML)
|
|
|
|
|
|
# $DOCXdocumentXML is the 'word/document.xml' file extracted (unzipped) from a .docx file
|
|
|
# (Also, its neighbor files from the .docx package should be available.)
|
|
|
DOCXdocumentXML="path/to/docx/word/document.xml"
|
|
|
|
|
|
# $FILE is a short identifier.
|
|
|
FILE="Zorba"
|
|
|
|
|
|
saxonHE="java -jar:path/to/saxon.jar"
|
|
|
EXTRACT="docx-html-extract1.xsl'
|
|
|
REFINE1="handle-notes.xsl"
|
|
|
REFINE2="scrub.xsl"
|
|
|
REFINE3="join-elements.xsl"
|
|
|
|
|
|
$saxonHE -xsl:$EXTRACT -s:$DOCXdocumentXML -o:$FILE-$EXTRACT_out.html
|
|
|
$saxonHE -xsl:$REFINE1 -s:$FILE-$EXTRACT_out.html -o:$FILE-$REFINE1_out.html
|
|
|
$saxonHE -xsl:$REFINE2 -s:$FILE-$REFINE1_out.html -o:$FILE-$REFINE2_out.html
|
|
|
$saxonHE -xsl:$REFINE3 -s:$FILE-$REFINE2_out.html -o:$FILE-$REFINE3_out.html
|
|
|
|
|
|
```
|
|
|
|
|
|
### Steps
|
|
|
|
|
|
## `docx-html-extract.xsl`
|
|
|
|
|
|
Extracts data from the Word as literal-mindedly as we can make it, producing outputs that are nominally HTML5 (syntactically and idiomatically) while also capturing all *relevant* information from the Word document source (as data object representing a formatted artifact in print or UI).
|
|
|
|
|
|
## `handle-notes.xsl`
|
|
|
|
|
|
Resolves and re-renders `endnote` constructs from the Word into a normalized form.
|
|
|
|
|
|
## `scrub.xsl`
|
|
|
|
|
|
Performs certain cleanup operations, such as regularizing CSS on `@style` (one of the ways info is captured from the source) and removing other noise (e.g. spurious and redundant element types captured from the Word etc., paragraphs or formatting wrappers with no contents, etc.).
|
|
|
|
|
|
## `join-elements.xsl`
|
|
|
|
|
|
Collapses runs of contiguous tagging to the same effect. I.e. `<u>Moby </u><u>Dick</u>` will be rewritten as `<u>Moby Dick</u>`. (Word files that have been worked over a lot have this problem especially.) |