... | ... | @@ -42,15 +42,15 @@ Development of a formal spec for such a format is an item tbd. For now, we inten |
|
|
|
|
|
## Extract, then refine.
|
|
|
|
|
|
In the first step, extraction, data is pulled from the Word document. Its representation is "folded" into a markup-idiomatic version via a peculiar kind of structural transformation (XSLT Level 10 spell).
|
|
|
In the first step, extraction, data is pulled from the Word document. Because WordML and its ilk (Office Open XML) is less of a "markup language" than it is an object serialization, its representation of the text must be "folded" into a markup-idiomatic version via a peculiar kind of structural transformation (XSLT Level 10 spell).
|
|
|
|
|
|
The resulting HTML will be a correct and fairly transparent representation (albeit expressed in HTML-ese) of the content of the Word document. It will not be optimal: for example, it is likely to be very redundant and repetitive. We tolerate this in data extraction for the sake of transparency and traceability.
|
|
|
|
|
|
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that \<i>Wuthering\</i>\<i> Heights\</i> becomes \<i>Wuthering Heights\</i>.)
|
|
|
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that `<i>Wuthering</i><i> Heights</i>` becomes `<i>Wuthering Heights</i>`.)
|
|
|
|
|
|
The goal of the entire pipeline (just not the first step) is as clean and simple a representation as possible of the 'labeling' of document parts implicit in formatting and style names in the Word, with minimal (ideally no) 'interpolation' (only representation) of (nominal) semantics as given in the source data.
|
|
|
The goal of the entire pipeline (just not the first step) is as clean and simple a representation as possible of the 'labeling' of document parts implicit in formatting properties and style names assigned in the WordML data to regions of text, with minimal (ideally no) 'interpolation' of 'meaning' beyond what is given in the source data (either indirectly via formatting properties, or nominally via style names).
|
|
|
|
|
|
Note that separating requirements into extract and refine permits us to design each separately. Probably both phases will ultimately include customization layers. Initially our goal is to see how much we can do with only generic logic.
|
|
|
Note that separating requirements into "extract" and "refine" permits us to design each separately. Possibly either phase could support customization layers. Initially our goal is to see how much we can do with only generic logic.
|
|
|
|
|
|
## Iterative development model
|
|
|
|
... | ... | |