... | ... | @@ -34,15 +34,25 @@ Development of a formal spec for such a format is an item tbd. For now, we inten |
|
|
|
|
|
"HTML slops" or HTML soup is what we call the output - it is messy, but nutritious. Broadly speaking, it should be considered a (fairly weak) form of HTML or HTML5, using XML syntax (albeit an empty XML prologue) making it amenable to both XML and HTML parsers.
|
|
|
|
|
|
## Extract, then clean up.
|
|
|
|
|
|
In the first step, extraction, data is pulled from the Word document. Its representation is "folded" into a markup-idiomatic version via a peculiar kind of structural transformation (XSLT Level 10 spell).
|
|
|
|
|
|
The resulting HTML will be a correct and fairly transparent representation (albeit expressed in HTML-ese) of the content of the Word document. It will not be optimal: for example, it is likely to be very redundant and repetitive. We tolerate this in data extraction for the sake of transparency and traceability.
|
|
|
|
|
|
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that <i>Wuthering</i><i> Heights</i> becomes <i>Wuthering Heights</i>.)
|
|
|
|
|
|
The goal of the entire pipeline (just not the first step) is as clean and simple a representation as possible of the 'labeling' of document parts implicit in formatting and style names in the Word, with minimal (ideally no) 'interpolation' (only representation) of (nominal) semantics as given in the source data.
|
|
|
|
|
|
## Iterative development model
|
|
|
|
|
|
Since many of the particular requirements for data capture and representation can only be defined in use, project feedback is essential to further development of these stylesheets.
|
|
|
|
|
|
For the time being this will involve old-fashioned comparisons between Word source data (viewed both as a printed page, and as an artifact in Word) and HTML produced by the XSLT and as exposed in / interfaced with (tools in) PubSweet, and reporting lapses, discrepancies, and opportunities for improvement. (First question is always: Did we get it all?)
|
|
|
|
|
|
As the system matures it should require less and less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills, but it should not require deep XSLT-fu to keep it running or extend to new cases.
|
|
|
As the system matures it should require less and less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills; but it should not require deep XSLT knowledge to keep it running or extend to new cases.
|
|
|
|
|
|
Since there is much to be done to get to that point, this means being vigilant for opportunities for improvement.
|
|
|
Since there is much to be done to get to that point, this means being vigilant for opportunities both for improvement and for skills development.
|
|
|
|
|
|
Keep in mind that another advantage of a pipelining architecture is that XSLT can be combined into pipelines with transformations implemented in other languages.
|
|
|
|
... | ... | |