... | ... | @@ -45,9 +45,9 @@ This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP |
|
|
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- but the way to produce it is to go "up hill" (towards a putatively clean and economical representation of the document) only by stages.
|
|
|
|
|
|
As an illustration, consider a microcosmic view of the problem, reduced to the barest possible example. (The rest of the problem is much like this, only greatly magnified in scale and complexity.)
|
|
|
### Generic markup (and its discontents)
|
|
|
|
|
|
The problem - take the following line:
|
|
|
As an illustration, consider a microcosmic view of the problem, reduced to the barest possible example. (The rest of the problem is much like this, only greatly magnified in scale and complexity.) Consider the following line:
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
... | ... | @@ -55,7 +55,7 @@ We might prefer (for one reason or another) to have any of these as a nicely-tag |
|
|
|
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
|
- JATS ` <b>Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></b>`
|
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where tag semantics can support querying), republishing in manifold formats, etc. etc.
|
... | ... | @@ -113,21 +113,23 @@ or (what is actually much more likely): |
|
|
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (It turns out the same thing may be said in many different ways as well as repeatedly. BTW the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic either.) But before we can get even to that point, putting this back together may take some care.
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorically ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. BTW the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic either.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
Even setting aside this problem (and there are ways of dealing with all the noise and redundancy) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text is a "cited title" by any means.)
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
|
This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
|
|
|
Even in this little sample, a case of this is evident if you make a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure <b>(bold <i>(italic)</i>)</b>, where the 'italic' span is nested inside the bold span. In contrast, the Word document shows <b>(bold)<i>(bold italic)</i></b> - two spans next to each other.
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
There is no absolute rule that says this bit of text "is" one or the other: our problem is only that we need to get from one representation to the next. And indeed this particular transposition in the treatment of paragraph contents, is one that can be handled with a general transposition, so it is "only painful". In other words, it is the kind of pain we can make go away with our black box, which has a component that performs this transposition for us.
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
|
|
More complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult, since they are never the same in the wild. Indeed we need to see them to really know what they look like. But here, agile development processes and an active feedback loop will help. The key is, to be attuned at all times to the particular formal aspects of the work at hand.
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult. Despite the fact that as writers, readers and editors these things are obvious and transparent, at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we need to see them to really know what they look like in any given case.
|
|
|
|
|
|
And this is why it needs to be a team activity: because the outlines of that form, in all its implications, will be critical to giving the text its affordances in the electronic environment. Everything about the text in production will spill out from this formalizing activity.
|
|
|
|
|
|
The way we approach discovering the outlines of this form is, first, by representing the Word data as itself, with no enhancement or improvement. Only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
Yet at the same time, asking team members to look at WordML code will never work. Better if we can *extract* it, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
|
|
|
The way we approach discovering the outlines of this form is, first, by representing the Word data as itself, with no enhancement or improvement -- yet, in a language we can read. Only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
... | ... | @@ -140,7 +142,11 @@ So far so good - but what will that format actually be? It is not hard to envisi |
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough!
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.) That is, we would like to say
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.)
|
|
|
|
|
|
### Why HTML5 turns out to be 'most excellent'
|
|
|
|
|
|
That is, we would like to say
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
... | ... | @@ -156,7 +162,7 @@ if the Word user had assigned a paragraph style "title.cited" to this range of t |
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
|
|
|
(a) we can easily leave our documents 'flat' as long as we need to
|
|
|
(a) we can easily leave our documents 'flat' as long as we need to - structure can come later!
|
|
|
(b) it has these fantastic escape hatches!
|
|
|
(c) did I mention escape hatches? One of them escapes into CSS! While the other can expose Word Styles!
|
|
|
(d) yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
... | ... | |