... | ... | @@ -2,7 +2,7 @@ |
|
|
|
|
|
A suite of XSLT stylesheets for .docx data extraction and representation.
|
|
|
|
|
|
(Needs a better name? Might it also include other XSLTs not just for .docx? XSLTSweet? )
|
|
|
(Needs a better name? As it also include other XSLTs not just for .docx extraction? XSLTSweet? )
|
|
|
|
|
|
## A pipeline, not a stylesheet
|
|
|
|
... | ... | @@ -40,7 +40,7 @@ In the first step, extraction, data is pulled from the Word document. Its repres |
|
|
|
|
|
The resulting HTML will be a correct and fairly transparent representation (albeit expressed in HTML-ese) of the content of the Word document. It will not be optimal: for example, it is likely to be very redundant and repetitive. We tolerate this in data extraction for the sake of transparency and traceability.
|
|
|
|
|
|
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that <i>Wuthering</i><i> Heights</i> becomes <i>Wuthering Heights</i>.)
|
|
|
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that \<i>Wuthering\</i>\<i> Heights\</i> becomes \<i>Wuthering Heights\</i>.)
|
|
|
|
|
|
The goal of the entire pipeline (just not the first step) is as clean and simple a representation as possible of the 'labeling' of document parts implicit in formatting and style names in the Word, with minimal (ideally no) 'interpolation' (only representation) of (nominal) semantics as given in the source data.
|
|
|
|
... | ... | |