... | ... | @@ -39,30 +39,33 @@ Can we prioritize them based on which of them are more universal / ubiquitous, v |
|
|
|
|
|
The beauty of listing them in order is that we can see that at least the low end can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious), while even the high end could be addressed using a carefully defined and validated profile of HTML or (better) HTML5 (because of `section` etc.)
|
|
|
|
|
|
This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
|
This is because WORDML IS NOT WHAT (practitioners call) GENERIC MARKUP
|
|
|
|
|
|
## Aiming for the right target
|
|
|
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- but the way to produce it is to go "up hill" (towards a putatively clean and economical representation of the document and all its constituent parts) only by stages.
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- we can agree about that -- but the way to produce it is to go "up hill" (towards the goal of a putatively clean and economical representation of the document and all its constituent parts) only by stages.
|
|
|
|
|
|
Because our intermediate formats, however, will (also) be HTML, they may be immediately useful, or at least legible. That is, we take advantage of the fact that HTML (specifically HTML5) proves to be a fairly tractable carrier for the kind of *presentational encoding* found in the WordML source, in two ways: because it gives us something that can be very loose and messy (and still be HTML), it is fairly forgiving; and because we and our tools already "know what to do with it" (it is HTML), we can use HTML tools on it.
|
|
|
|
|
|
Interestingly enough, we can do this all with an XML and specifically an XSLT-based pipeline architecture. Not only that, but if we take care that our HTML5 outputs are also well-formed XML, we can attach the extraction component to further processes (including XSLT processes) to provide missing parts of a complete solution.
|
|
|
|
|
|
### Generic markup (and its discontents)
|
|
|
|
|
|
As an illustration, consider a microcosmic view of the problem, reduced to the barest possible example. (The rest of the problem is much like this, only greatly magnified in scale and complexity.) Consider the following line:
|
|
|
As an illustration of our problem in general, consider a microcosmic view, an example reduced to the barest possible. (The rest of the problem is much like this, only greatly magnified in scale and complexity.) Consider the following line:
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
We might prefer (for one reason or another) to have any of these as a nicely-tagged representation suitable for further processing in an appropriate toolchain --
|
|
|
We might prefer (for one reason or another) to have any of these as a nicely-tagged representation suitable for further processing in an appropriate toolchain -- these are all variant species of descriptive or generic markup:
|
|
|
|
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
|
- JATS ` <b>Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></b>`
|
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
|
- `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>` (Docbook )
|
|
|
- `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>` (TEI)
|
|
|
- ` <b>Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></b>` (NISO JATS/BITS)
|
|
|
- `<b>Gene Roddenberry's <cite>Star Trek</cite></b>` (DITA)
|
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where consistent tagging semantics can support querying), republishing in manifold formats, etc. etc.
|
|
|
|
|
|
It is reasonable to stipulate any or all of these as worthwhile end points only because we know of real systems that use all of them. The question here is not whether to aim for this, but how. Especially since despite their variations they all have one thing in common, namely how far they are from what is going to be discovered inside a Word document, such as:
|
|
|
It is reasonable to stipulate any or all of these as worthwhile end points, if only because we know of real systems that use all of them. The question here is not whether to aim for markup of this quality, but how. Especially since despite their variations these all have another thing in common, namely how far they are from what is going to be discovered inside a Word document, such as (hereis a sample of the XML we find buried deep in a .docx file):
|
|
|
|
|
|
WordML
|
|
|
```
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
<w:rPr>
|
... | ... | @@ -79,7 +82,7 @@ WordML |
|
|
</w:r>
|
|
|
```
|
|
|
|
|
|
or (what is actually much more likely):
|
|
|
or indeed (what is just as likely):
|
|
|
|
|
|
```
|
|
|
<w:r w:rsidRPr="007449A0">
|
... | ... | @@ -111,32 +114,32 @@ or (what is actually much more likely): |
|
|
</w:r>
|
|
|
```
|
|
|
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism. And yes, in general, the more you edit the document the worse this will get.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorically ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. BTW the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic either.) But before we can get even to that point, putting this back together may take some care.
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorically ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. Again, the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
Even setting aside this problem (and there are ways of dealing with all the noise and redundancy) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.)
|
|
|
Even setting aside the noise and redundancy, however (and there are ways of dealing with them) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.) What is worse, we face this impediment everywhere we turn.
|
|
|
|
|
|
This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx file) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it. This problem is called the "mapping problem" (or a problem of "semantic inference" or "semantic attribution"): we see italics, we recognize a title. But the next time we see italics it might be something else. How do we tell the difference?
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even as it is printed or displayed on screen -- but the book, article, research paper or other "thing" hiding inside it. (That is, as readers we see into and infer from the print, to construe the logical forms behind it.) This problem, in a data conversion system, is considered a species or variety of "mapping problem" (or a problem of "semantic inference" or "semantic attribution"): we see italics, we recognize a title. But the next time we see italics it might be something else. How do we tell the difference?
|
|
|
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you -- all of these will present challenges not just because the data is complex, messy and redundant, but also because those things (at least as such) are never given, and present only implicitly. Despite the fact that as writers, readers and editors we find these things to be obvious enough (at least, we know how to interpret the cues we see), at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we may need to see them to really know what they look like in any given case.
|
|
|
|
|
|
The solution here is to show ourselves the same information, but in a legible or tractable form. That is, to translate the Word into a form we can read, but not to try translating it all the way out of the language it uses in order to communicate what it says (whether to printer, PDF file generator or human reader) when it puts something in italics. In other words, our first task is to *extract* the data, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
However, recognizing the limitations here actually provides us a way forward. Rather than try to infer or construe information not given, the solution here is to show ourselves the information we have, but in a more legible and tractable form. That is, to translate the Word into a form we can read, but not to try translating it all the way out of the language it uses in order to communicate what it says (whether to printer, PDF file generator or human reader) when it (just for example) puts something in italics. In other words, our first task is to *extract* the data, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
|
|
|
Indeed only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
Indeed only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that will be as transparent as we need it to be, end to end.
|
|
|
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would (again) use either a standard or made-for-purpose vocabulary. The difference would be that it would translate renditional , maybe looking something like this:
|
|
|
|
|
|
- HTML ` <b>Gene Roddenberry's <i>Star Trek</i></b>`
|
|
|
- DITA `<b>Gene Roddenberry's <i>Star Trek</i></b>`
|
|
|
- JATS `<bold>Gene Roddenberry's <italic>Star Trek</italic></bold>`
|
|
|
- Docbook `<emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>`
|
|
|
- TEI ` <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>`
|
|
|
- made-for-purpose `<run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>`
|
|
|
- ` <b>Gene Roddenberry's <i>Star Trek</i></b>` (HTML)
|
|
|
- `<b>Gene Roddenberry's <i>Star Trek</i></b>` (DITA)
|
|
|
- `<bold>Gene Roddenberry's <italic>Star Trek</italic></bold>` (JATS/BITS)
|
|
|
- `<emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>` (Docbook)
|
|
|
- ` <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>` (TEI)
|
|
|
- `<run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>` (made-for-purpose)
|
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough for a first step.
|
|
|
|
... | ... | @@ -158,17 +161,20 @@ Or even (in the rare case) |
|
|
|
|
|
if the Word user had assigned a paragraph style "title.cited" to this range of text.
|
|
|
|
|
|
(And it turns out, this is doable -- essentially a matter of listening for the signal in the noise.)
|
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
|
|
|
- We can easily leave our documents 'flat' as long as we need to - structure can come later!
|
|
|
- We can easily leave our documents 'flat' as long as we need to - structure can come later! (This is a key distinction vs our final target format)
|
|
|
- It has `@class` and `@style`, fantastic escape hatches!
|
|
|
- One of the escape hatches gives us CSS! While the other can expose Word Styles!
|
|
|
- One of the escape hatches gives us CSS! (and we are describing presentational features.) While the other can expose Word Styles (since it is for user-driven semantic labeling)
|
|
|
- Yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
|
|
- To top it off, HTML is a well-known vernacular --
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned; HTML lets us just fake it*;
|
|
|
- And since we are expecting to edit (at least initially) on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now*;
|
|
|
- And since we are expecting to edit (at least initially) on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan ...
|
|
|
- ... Illustrating the point: anyone can use HTML5 (especially wf XML HTML5) so let's use that
|
|
|
|
|
|
(* We can come back to formalize the target format later after we have some data and experience)
|
|
|
(* Later if need be we can come back to formalize the target format as a profile of HTML5.)
|
|
|
|
|
|
Note the non-canonical and arguably deprecated heavy use of @style - we justify this on the grounds that we are going *up hill* and *by the time we reach the top* we can *cast these properties aside as nothing more than the engine that has got us there*.
|
|
|
|
... | ... | |