|
|
# XSweet
|
|
|
|
|
|
A suite of XSLT stylesheets for .docx data extraction and representation.
|
|
|
|
|
|
(Needs a better name? As it also include other XSLTs not just for .docx extraction? XSLTSweet? )
|
|
|
A suite of XSLT 2.0 stylesheets for `.docx` data extraction and refinement into HTML, for editorial workflows. (And perhaps eventually for other tasks. And also, possibly not always XSLT 2.0.)
|
|
|
|
|
|
## Scope and goals
|
|
|
|
... | ... | @@ -22,7 +20,7 @@ Another advantage is flexibility. We may be able to deploy suites of modules to |
|
|
|
|
|
If we run into performance issues due to overhead in this architecture (e.g. for parsing/serialization of temporary results) we can consider strategies for mitigation.
|
|
|
|
|
|
Experience has shown that the Word .docx -> structured markup is a hard problem. We believe one reason it has been difficult is because assumptions have been made regarding requirements, which do not actually apply in many or most cases -- and in particular, which do not apply in a situation in which a significant editing phase is planned for _after_ conversion.
|
|
|
Experience has shown that the transformation Word .docx -> structured markup is difficult to specify and implement. We believe one reason it has been difficult is because assumptions have been made regarding requirements, which do not actually apply in many or most cases -- and in particular, which do not apply in a situation in which a significant editing phase is planned for _after_ conversion.
|
|
|
|
|
|
## HTML as a target format
|
|
|
|
... | ... | @@ -32,7 +30,7 @@ However an important consideration is that none of these are in scope for this t |
|
|
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures, boxed text, margin callouts etc., where the Word source encoding does not have (explicit) structure, the HTML will not seek to represent it.
|
|
|
|
|
|
* Any effort to identify or extract metadata fields. Metadata will be provided via another channel; if we want, an additional pipeline step (XSLT) can provide for metadata injection (like any enhancement) but its source is not expected to be the Word (.docx) data.
|
|
|
* Any effort to identify or extract metadata fields. Metadata will be provided via another channel; if we want, an additional pipeline step (XSLT) can provide for metadata injection (like any enhancement) but its source is not expected necessarily to be the Word (.docx) data.
|
|
|
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" may be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
|
... | ... | @@ -60,7 +58,7 @@ Since many of the particular requirements for data capture and representation ca |
|
|
|
|
|
For the time being this will involve old-fashioned comparisons between Word source data (viewed both as a printed page, and as an artifact in Word) and HTML produced by the XSLT and as exposed in / interfaced with (tools in) PubSweet, and reporting lapses, discrepancies, and opportunities for improvement. (First question is always: Did we get it all?)
|
|
|
|
|
|
As the system matures it should require less and less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills; but it should not require deep XSLT knowledge to keep it running or extend to new cases.
|
|
|
As the system matures it should require less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills; but it should not require deep XSLT knowledge to keep it running or extend to new cases.
|
|
|
|
|
|
Since there is much to be done to get to that point, this means being vigilant for opportunities both for improvement and for skills development.
|
|
|
|
... | ... | @@ -70,11 +68,11 @@ Keep in mind that another advantage of a pipelining architecture is that XSLT ca |
|
|
|
|
|
### What should come through and what shouldn't and how do we know?
|
|
|
|
|
|
While the transformation is lossless as respects "main document content" (that is, data that is stored as the nominal "document text" in the Word file), there is also much other data in a Word source file that should come with it (the most obvious: footnotes and figures), and due to the ("database-like") organization of the .docx source data, it is difficult or impossible to ensure a transformation will never drop data. Especially since some of the info in the Word document (page headers come to mind as an example) should arguably not come into the HTML in any case. (What can be ensured, at least theoretically, is that every new case of dropped data detected is also the last time that particular case is seen.)
|
|
|
While the transformation is lossless as respects "main document content" (that is, data that is stored as the nominal "document text" in the Word file), there is also much other data in a Word source file that should come with it (the most obvious: footnotes and figures), and due to the ("database-like") organization of the .docx source data, it is difficult or impossible to ensure a transformation will never drop data. (Especially since some of the info in the Word document -- page headers come to mind as an example -- should arguably not come into the HTML in any case.)
|
|
|
|
|
|
In addition to some means of representing and aligning specifications for the different XSLTs (such as inline documentation), we need to have robust mechanisms for detecting problems in data extraction _especially lost data_; for ameliorating such problems in the instance (sometimes they may not be fatal errors); and for maintaining and improving the XSLTs so they don't happen.
|
|
|
The solution is vigilance, contextual awareness (both processes and contingencies) and testing. In addition to some means of representing and aligning specifications for the different XSLTs (such as inline documentation), we need to have robust mechanisms for detecting problems in data extraction _especially lost data_; for ameliorating such problems in the instance (sometimes they may not be fatal errors); and for maintaining and improving the XSLTs so they don't happen.
|
|
|
|
|
|
Operationally, what will be the best way to specify corrections and feature requests? (Could use Issues on this here gitlab thing? Chat room for Quick Assist?)
|
|
|
For now this is happening under [Issues](../../issues) on this-here wiki.
|
|
|
|
|
|
### Funky structures and math
|
|
|
|
... | ... | |