... | ... | @@ -47,11 +47,11 @@ This is because WORDML IS NOT WHAT (practitioners call) GENERIC MARKUP |
|
|
|
|
|
## Aiming for the right target
|
|
|
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- we can agree about that -- but the way to produce it is to go "up hill" (towards the goal of a putatively clean and economical representation of the document and all its constituent parts) only by stages.
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- we can agree about that -- but the way to produce it is to go "up hill" only by stages.
|
|
|
|
|
|
Because our intermediate formats, however, will (also) be HTML, they may be immediately useful, or at least legible. That is, we take advantage of the fact that HTML (specifically HTML5) proves to be a fairly tractable carrier for the kind of *presentational encoding* found in the WordML source, in two ways: because it gives us something that can be very loose and messy (and still be HTML), it is fairly forgiving; and because we and our tools already "know what to do with it" (it is HTML), we can use HTML tools on it.
|
|
|
|
|
|
Interestingly enough, we can do this all with an XML and specifically an XSLT-based pipeline architecture. Not only that, but if we take care that our HTML5 outputs are also well-formed XML, we can attach the extraction component to further processes (including XSLT processes) to provide missing parts of a complete solution.
|
|
|
Interestingly enough, the necessary extractions and modifications to get data out of the XML format embedded in Word .docx (nominally 'WordML'), can be implemented on an XML platform and specifically with an XSLT-based pipeline architecture. Indeed if we remain conformant to XML syntactically, ensuring at every point that our HTML5 outputs are also well-formed XML, we can attach the extraction component to further processes (including XSLT processes) to provide missing parts of a complete solution.
|
|
|
|
|
|
### Generic Markup (Considered as One of the Fine Arts)
|
|
|
|
... | ... | @@ -151,6 +151,10 @@ These are all more or less the same or at any rate semantically equivalent inasm |
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose the info (the 'made for purpose" tagging shown) even comes close. It turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.)
|
|
|
|
|
|
This leaves it to the editorial team to do what is really important. Rather than impose or infer and semantics, that is, we regard the proper function of the extraction process should be to reflect the distinctions already given in the WordML source, in whatever form they are given. These distinctions, being the necessary points of semantic inflection, may provide a basis on which semantics (whether of labels, or of structural relations) can be exposed and expressed, by an *editorial* process.
|
|
|
|
|
|
It is, in other words, the proper function not of an extractor, but its target editing environment, to permit users to provide any structure not given explicitly in the source data, as well as to discover rules depending on rationales not given (that can distinguish, for example, a 'title.cited' from some other sort of italics). Since we expect to have an editing environment that can provide us this level of control and capability - our .docx extraction doesn't really have to worry about it.
|
|
|
|
|
|
### Why HTML5 turns out to be 'most excellent'
|
|
|
|
|
|
For any target format serving a data conversion out of .docx, there would be three nice-to-haves:
|
... | ... | |