|
|
|
# DOCX-xslt
|
|
|
|
|
|
|
|
A suite of XSLT stylesheets for .docx data extraction and representation.
|
|
|
|
|
|
|
|
(Needs a better name?)
|
|
|
|
|
|
|
|
## A pipeline, not a stylesheet
|
|
|
|
|
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a pipeline is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology.
|
|
|
|
|
|
|
|
Among other advantages this gives the developer (and maintainer) is transparency. Because there are intermediate files to look at (when they are exposed), lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
|
|
|
|
|
Another advantage is flexibility. We may end up with a suite of modules that are used together and separately in "mix-and-match" combinations.
|
|
|
|
|
|
|
|
If we run into performance issues due to overhead in this architecture (e.g. for parsing/serialization of temporary results) we can consider alternatives or improvements.
|
|
|
|
|
|
|
|
Experience has shown that the Word .docx -> structured markup is a hard problem. We believe one reason it has been difficult is because assumptions have been made regarding requirements, which do not actually apply in many or most cases -- and in particular, which do not apply in a situation in which a significant editing phase is planned for _after_ conversion.
|
|
|
|
|
|
|
|
## HTML as a target format
|
|
|
|
|
|
|
|
The main (only) operational target for the system is (so far) ingest by PubSweet. Any spinoff applications for the HTML produced by this pipeline are nice-to-haves, not presently requirements. This gives us a great deal of flexibility in the design of an optimal format for use in PubSweet by its users (and client applications).
|
|
|
|
|
|
|
|
However an important consideration is that none of these are in scope for this transformation:
|
|
|
|
|
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case.
|
|
|
|
|
|
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
|
|
|
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" will be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
|
|
|
|
|
As long as we do not lose data content coming across, indeed, the only thing we really care about our format is that it be (a) legible and intelligible to target users and applications, (b) as economical and tractable as possible, and (c) 'truthful' in its representations. However, requirements b and c are at odds, since economy means leaving things out. We want to represent only what is both true, and useful. Because this is not yet known (and indeed because it may vary from one case to the next), the particulars of the target format (as respects element types, attribute values etc.) are probably best defined "under load" (that is, in use). We like HTML because it is a vernacular and developers know what to expect from it -- so it gives us some (broad) boundaries going forward.
|
|
|
|
|
|
|
|
Development of a formal spec for such a format is an item tbd. For now, we intend to "produce pudding" that can be proven by eating it.
|
|
|
|
|
|
|
|
"HTML slops" or HTML soup is what we call the output - it is nutritious, but messy.
|
|
|
|
|
|
|
|
## Iterative development model
|
|
|
|
|
|
|
|
Since many of the particular requirements can only be defined in use, project feedback is essential to further development of these stylesheets.
|
|
|
|
|
|
|
|
For the time being this will involve old-fashioned comparisons between Word source data and HTML produced by the XSLT and as exposed in PubSweet, and reporting lapses and possible discrepancies.
|
|
|
|
|
|
|
|
We hope and intend to ride a comfortable slope, whereon the time required by an XSLT developer to process a single (set of) Word documents for PubSweet drops to less and less, until it is nothing at all, unless you include configuration-level stuff or tweaks performed routinely by team members with XSLT (or other) skills.
|
|
|
|
|
|
|
|
Since there is much to be done to get to that point, this means being vigilant for opportunities for improvement.
|
|
|
|
|
|
|
|
## Big Questions
|
|
|
|
|
|
|
|
### What should come through and what shouldn't and how do we know?
|
|
|
|
|
|
|
|
No provision is made for passing through, for example, page headers, into the HTML, in any form.
|
|
|
|
|
|
|
|
However, at deployment time, no provision is made for handling tables, for example, and we know we will have to handle them. So we already know we will be fixing up the XSLT to work for these cases. But what about cases we haven't seen yet?
|
|
|
|
We need to have robust mechanisms for detecting problems in data extraction (or any transformation_ _especially lost data_, for ameliorating such problems in the instance (sometimes they may not be fatal errors), and for maintaining and improving the XSLTs.
|
|
|
|
|
|
|
|
Operationally, what will be the best way to specify corrections / improvements? (Could use Issues on this here gitlab.)
|
|
|
|
|
|
|
|
### Funky structures and math
|
|
|
|
|
|
|
|
### Citations and bibliography? |
|
|
|
\ No newline at end of file |