... | ... | @@ -4,33 +4,9 @@ Ultimately the docs here may belong on a separate page (if/as the pipeline is us |
|
|
|
|
|
## Script
|
|
|
|
|
|
(Note - untested as of 2016-09-19 pls help debug - wap)
|
|
|
See file [ExtractandRefine.sh](ExtractandRefine.sh) for a bash script - hopefully current.
|
|
|
|
|
|
```
|
|
|
|
|
|
#!/bin/bash
|
|
|
# For producing HTML5 outputs via XSweet XSLT from sources extracted from .docx (Office Open XML)
|
|
|
|
|
|
# $DOCXdocumentXML is the 'word/document.xml' file extracted (unzipped) from a .docx file
|
|
|
# (Also, its neighbor files from the .docx package should be available.)
|
|
|
DOCXdocumentXML="path/to/docx/word/document.xml"
|
|
|
|
|
|
# $FILE is a short identifier
|
|
|
FILE="Zorba"
|
|
|
|
|
|
saxonHE="java -jar:path/to/saxon.jar" # SaxonHE (XSLT 2.0 processor)
|
|
|
EXTRACT="docx-html-extract.xsl" # "Extraction" stylesheet
|
|
|
REFINE1="handle-notes.xsl" # "Refinement" stylesheets
|
|
|
REFINE2="scrub.xsl"
|
|
|
REFINE3="join-elements.xsl"
|
|
|
|
|
|
# Intermediate and final outputs (serializations) are all left on the file system
|
|
|
$saxonHE -xsl:$EXTRACT -s:$DOCXdocumentXML -o:$FILE-$EXTRACT_out.html
|
|
|
$saxonHE -xsl:$REFINE1 -s:$FILE-$EXTRACT_out.html -o:$FILE-$REFINE1_out.html
|
|
|
$saxonHE -xsl:$REFINE2 -s:$FILE-$REFINE1_out.html -o:$FILE-$REFINE2_out.html
|
|
|
$saxonHE -xsl:$REFINE3 -s:$FILE-$REFINE2_out.html -o:$FILE-$REFINE3_out.html
|
|
|
|
|
|
```
|
|
|
Bash scripts are an expediency. Soon we should be able to run these from INK.
|
|
|
|
|
|
### Steps
|
|
|
|
... | ... | @@ -49,3 +25,7 @@ Performs certain cleanup operations, such as regularizing CSS on `@style` (one o |
|
|
## `join-elements.xsl`
|
|
|
|
|
|
Collapses runs of contiguous tagging to the same effect. I.e. `<u>Moby </u><u>Dick</u>` will be rewritten as `<u>Moby Dick</u>`. (Word files that have been worked over a lot have this problem especially.)
|
|
|
|
|
|
## `zorba-map.xsl`
|
|
|
|
|
|
Handles some mappings of element patterns specific to "Zorba" sample inputs, such as patterns of font/bold/italic into headers. |