... | ... | @@ -43,7 +43,7 @@ This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP |
|
|
|
|
|
## Aiming for the right target
|
|
|
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- but the way to produce it is to go "up hill" (towards a putatively clean and economical representation of the document) only by stages.
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- but the way to produce it is to go "up hill" (towards a putatively clean and economical representation of the document and all its constituent parts) only by stages.
|
|
|
|
|
|
### Generic markup (and its discontents)
|
|
|
|
... | ... | @@ -115,21 +115,19 @@ or (what is actually much more likely): |
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorically ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. BTW the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic either.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
Even setting aside this problem (and there are ways of dealing with all the noise and redundancy) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text is a "cited title" by any means.)
|
|
|
Even setting aside this problem (and there are ways of dealing with all the noise and redundancy) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.)
|
|
|
|
|
|
This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx file) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it. This problem is called the "mapping problem" (or a problem of "semantic inference" or "semantic attribution"): we see italics, we recognize a title. But the next time we see italics it might be something else. How do we tell the difference?
|
|
|
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult. Despite the fact that as writers, readers and editors these things are obvious and transparent, at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we need to see them to really know what they look like in any given case.
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you -- all of these will present challenges not just because the data is complex, messy and redundant, but also because those things (at least as such) are never given, and present only implicitly. Despite the fact that as writers, readers and editors these things are obvious and transparent, at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we may need to see them to really know what they look like in any given case.
|
|
|
|
|
|
And this is why it needs to be a team activity: because the outlines of that form, in all its implications, will be critical to giving the text its affordances in the electronic environment. Everything about the text in production will spill out from this formalizing activity.
|
|
|
The solution here is to show ourselves the same information, but in a legible or tractable form. That is, to translate the Word into a form we can read, but not to try translating it all the way out of the language it uses in order to communicate what it says (whether to printer, PDF file generator or human reader) when it puts something in italics. In other words, our first task is to *extract* the data, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
|
|
|
Yet at the same time, asking team members to look at WordML code will never work. Better if we can *extract* it, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
|
|
|
The way we approach discovering the outlines of this form is, first, by representing the Word data as itself, with no enhancement or improvement -- yet, in a language we can read. Only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
Indeed only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
... | ... | @@ -142,7 +140,7 @@ So far so good - but what will that format actually be? It is not hard to envisi |
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough!
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.)
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose the info (the 'made for purpose" tagging shown) even comes close. It turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.)
|
|
|
|
|
|
### Why HTML5 turns out to be 'most excellent'
|
|
|
|
... | ... | @@ -152,7 +150,7 @@ It turns out, among these the choice is fairly clear. Only a made-for-purpose la |
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
```
|
|
|
|
|
|
Or even (in the rare case when someone was at the other end of the Word document)
|
|
|
Or even (in the rare case)
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <span class="title.cited">Star Trek</span></b>
|
... | ... | @@ -162,13 +160,15 @@ if the Word user had assigned a paragraph style "title.cited" to this range of t |
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
|
|
|
(a) we can easily leave our documents 'flat' as long as we need to - structure can come later!
|
|
|
(b) it has these fantastic escape hatches!
|
|
|
(c) did I mention escape hatches? One of them escapes into CSS! While the other can expose Word Styles!
|
|
|
(d) yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
|
|
(e) to top it off, HTML is a well-known vernacular --
|
|
|
(f) a custom vocabulary would have to be designed, tested, documented and learned; HTML lets us just fake it;
|
|
|
(g) and since we are expecting to edit (at least initially) on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan
|
|
|
- We can easily leave our documents 'flat' as long as we need to - structure can come later!
|
|
|
- It has `@class` and `@style`, fantastic escape hatches!
|
|
|
- One of the escape hatches gives us CSS! While the other can expose Word Styles!
|
|
|
- Yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
|
|
- To top it off, HTML is a well-known vernacular --
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned; HTML lets us just fake it*;
|
|
|
- And since we are expecting to edit (at least initially) on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan
|
|
|
|
|
|
(* We can come back to formalize the target format later after we have some data and experience)
|
|
|
|
|
|
Note the non-canonical and arguably deprecated heavy use of @style - we justify this on the grounds that we are going *up hill* and *by the time we reach the top* we can *cast these properties aside as nothing more than the engine that has got us there*.
|
|
|
|
... | ... | @@ -177,7 +177,3 @@ Next to these, the fact that HTML also has an element-type semantics albeit an i |
|
|
```
|
|
|
<p class="listing">Gene Roddenberry's <span class="title.cited">Star Trek</span></p>
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|