... | ... | @@ -6,11 +6,11 @@ A suite of XSLT stylesheets for .docx data extraction and representation. |
|
|
|
|
|
## A pipeline, not a stylesheet
|
|
|
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a pipeline is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology.
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a sequence (in which each XSLT reads input from the result of the previous XSLT) is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment.)
|
|
|
|
|
|
Among other advantages this gives the developer (and maintainer) is transparency. Because there are intermediate files to look at (when they are exposed), lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
Among other advantages this gives us is transparency. Since each XSLT does less, lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
|
|
|
Another advantage is flexibility. We may end up with a suite of modules that are used together and separately in "mix-and-match" combinations.
|
|
|
Another advantage is flexibility. We may be able to deploy suites of modules to be used together and separately in "mix-and-match" combinations.
|
|
|
|
|
|
If we run into performance issues due to overhead in this architecture (e.g. for parsing/serialization of temporary results) we can consider alternatives or improvements.
|
|
|
|
... | ... | @@ -26,7 +26,7 @@ However an important consideration is that none of these are in scope for this t |
|
|
|
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
|
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" will be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" may be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
|
|
|
None of these rules are absolute. In particular because it will be difficult to be both comprehensive and succinct (economical), the particulars of the target format (as respects element types, attribute values etc.) are probably best defined "under load" (that is, in use). We like HTML because it is a vernacular and developers know what to expect from it -- so it gives us some (broad) boundaries going forward.
|
|
|
|
... | ... | |