... | ... | @@ -4,6 +4,14 @@ A suite of XSLT stylesheets for .docx data extraction and representation. |
|
|
|
|
|
(Needs a better name? As it also include other XSLTs not just for .docx extraction? XSLTSweet? )
|
|
|
|
|
|
## Scope and goals
|
|
|
|
|
|
We aim to develop and share an open source toolkit on a commodity platform (XSLT) that provides good-enough data extraction from arbitrary Word (.docx) files into publishing workflows (editing and production) able to exploit HTML markup for ingest.
|
|
|
|
|
|
"Good enough" means that the tools are serviceable (or better) in actual document conversion workflows, while producing results at least as good (for these purposes) as other open-source alternatives and pathways (e.g. Pandoc; OxGarage; OpenOffice).
|
|
|
|
|
|
An important consideration for these purposes are that these stylesheets need to work on arbitrary Word inputs, not just Word documents written to templates or other constraint sets. Another is that the results do not have to be good enough to publish, just good enough to be worth editing further.
|
|
|
|
|
|
## A pipeline, not a stylesheet
|
|
|
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a sequence (in which each XSLT reads input from the result of the previous XSLT) is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment; this is commodity/standard XSLT 2.0.)
|
... | ... | @@ -24,7 +32,7 @@ However an important consideration is that none of these are in scope for this t |
|
|
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures, boxed text, margin callouts etc., where the Word source encoding does not have (explicit) structure, the HTML will not seek to represent it.
|
|
|
|
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
|
* Any effort to identify or extract metadata fields. Metadata will be provided via another channel; if we want, an additional pipeline step (XSLT) can provide for metadata injection (like any enhancement) but its source is not expected to be the Word (.docx) data.
|
|
|
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" may be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
|
... | ... | @@ -34,7 +42,7 @@ Development of a formal spec for such a format is an item tbd. For now, we inten |
|
|
|
|
|
"HTML slops" or HTML soup is what we call the output - it is messy, but nutritious. Broadly speaking, it should be considered a (fairly weak) form of HTML or HTML5, using XML syntax (albeit an empty XML prologue) making it amenable to both XML and HTML parsers.
|
|
|
|
|
|
## Extract, then clean up.
|
|
|
## Extract, then refine.
|
|
|
|
|
|
In the first step, extraction, data is pulled from the Word document. Its representation is "folded" into a markup-idiomatic version via a peculiar kind of structural transformation (XSLT Level 10 spell).
|
|
|
|
... | ... | |