... | ... | @@ -8,7 +8,7 @@ A suite of XSLT stylesheets for .docx data extraction and representation. |
|
|
|
|
|
We aim to develop and share an open source toolkit on a commodity platform (XSLT) that provides good-enough data extraction from arbitrary Word (.docx) files into publishing workflows (editing and production) able to exploit HTML markup for ingest.
|
|
|
|
|
|
"Good enough" means that the tools are serviceable (or better) in actual document conversion workflows, while producing results at least as good (for these purposes) as other open-source alternatives and pathways (e.g. Pandoc; OxGarage; OpenOffice).
|
|
|
"Good enough" means that the tools are serviceable (or better) in actual document conversion workflows, while producing results at least as good (for these purposes) as available alternatives and pathways.
|
|
|
|
|
|
An important consideration for these purposes are that these stylesheets need to work on arbitrary Word inputs, not just Word documents written to templates or other constraint sets. Another is that the results do not have to be good enough to publish, just good enough to be worth editing further.
|
|
|
|
... | ... | @@ -16,7 +16,7 @@ An important consideration for these purposes are that these stylesheets need to |
|
|
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a sequence (in which each XSLT reads input from the result of the previous XSLT) is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment; this is commodity/standard XSLT 2.0.)
|
|
|
|
|
|
Among other advantages this gives us is transparency. Since each XSLT does less, lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
Among other advantages this gives us is transparency. Since each XSLT does less, holes and bugs are easier to find and fill than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
|
|
|
Another advantage is flexibility. We may be able to deploy suites of modules to be used together and separately in "mix-and-match" combinations.
|
|
|
|
... | ... | |