... | @@ -2,11 +2,11 @@ |
... | @@ -2,11 +2,11 @@ |
|
|
|
|
|
A suite of XSLT stylesheets for .docx data extraction and representation.
|
|
A suite of XSLT stylesheets for .docx data extraction and representation.
|
|
|
|
|
|
(Needs a better name?)
|
|
(Needs a better name? Might it also include other XSLTs not just for .docx? XSLTSweet? )
|
|
|
|
|
|
## A pipeline, not a stylesheet
|
|
## A pipeline, not a stylesheet
|
|
|
|
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a sequence (in which each XSLT reads input from the result of the previous XSLT) is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment.)
|
|
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Considered as a black box, such a sequence (in which each XSLT reads input from the result of the previous XSLT) is the same as a transformation. Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline) we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment; this is commodity/standard XSLT 2.0.)
|
|
|
|
|
|
Among other advantages this gives us is transparency. Since each XSLT does less, lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
Among other advantages this gives us is transparency. Since each XSLT does less, lapses and bugs are easier to diagnose and correct and extensions easier to make than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
|
|
|
|
|
... | @@ -22,7 +22,7 @@ The main (only) operational target for the system is (so far) ingest by PubSweet |
... | @@ -22,7 +22,7 @@ The main (only) operational target for the system is (so far) ingest by PubSweet |
|
|
|
|
|
However an important consideration is that none of these are in scope for this transformation:
|
|
However an important consideration is that none of these are in scope for this transformation:
|
|
|
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures and boxed text (callouts), where the Word does not have (explicit) structure, the HTML will not seek to represent it.
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures, boxed text, margin callouts etc., where the Word source encoding does not have (explicit) structure, the HTML will not seek to represent it.
|
|
|
|
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
|
|
|
... | @@ -38,7 +38,7 @@ Development of a formal spec for such a format is an item tbd. For now, we inten |
... | @@ -38,7 +38,7 @@ Development of a formal spec for such a format is an item tbd. For now, we inten |
|
|
|
|
|
Since many of the particular requirements for data capture and representation can only be defined in use, project feedback is essential to further development of these stylesheets.
|
|
Since many of the particular requirements for data capture and representation can only be defined in use, project feedback is essential to further development of these stylesheets.
|
|
|
|
|
|
For the time being this will involve old-fashioned comparisons between Word source data and HTML produced by the XSLT and as exposed in PubSweet, and reporting lapses and possible discrepancies.
|
|
For the time being this will involve old-fashioned comparisons between Word source data (viewed both as a printed page, and as an artifact in Word) and HTML produced by the XSLT and as exposed in / interfaced with (tools in) PubSweet, and reporting lapses, discrepancies, and opportunities for improvement. (First question is always: Did we get it all?)
|
|
|
|
|
|
As the system matures it should require less and less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills, but it should not require deep XSLT-fu to keep it running or extend to new cases.
|
|
As the system matures it should require less and less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills, but it should not require deep XSLT-fu to keep it running or extend to new cases.
|
|
|
|
|
... | @@ -54,7 +54,7 @@ While the transformation is lossless as respects "main document content" (that i |
... | @@ -54,7 +54,7 @@ While the transformation is lossless as respects "main document content" (that i |
|
|
|
|
|
In addition to some means of representing and aligning specifications for the different XSLTs (such as inline documentation), we need to have robust mechanisms for detecting problems in data extraction _especially lost data_; for ameliorating such problems in the instance (sometimes they may not be fatal errors); and for maintaining and improving the XSLTs so they don't happen.
|
|
In addition to some means of representing and aligning specifications for the different XSLTs (such as inline documentation), we need to have robust mechanisms for detecting problems in data extraction _especially lost data_; for ameliorating such problems in the instance (sometimes they may not be fatal errors); and for maintaining and improving the XSLTs so they don't happen.
|
|
|
|
|
|
Operationally, what will be the best way to specify corrections and feature requests? (Could use Issues on this here gitlab.)
|
|
Operationally, what will be the best way to specify corrections and feature requests? (Could use Issues on this here gitlab thing? Chat room for Quick Assist?)
|
|
|
|
|
|
### Funky structures and math
|
|
### Funky structures and math
|
|
|
|
|
... | | ... | |