... | ... | @@ -63,7 +63,7 @@ We might prefer (for one reason or another) to have any of these as a nicely-tag |
|
|
|
|
|
- `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>` (Docbook )
|
|
|
- `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>` (TEI)
|
|
|
- ` <b>Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></b>` (NISO JATS/BITS)
|
|
|
- ` <b>Gene Roddenberry's <named-content content-type="title-cited">Star Trek</named-content></b>` (NISO JATS/BITS)
|
|
|
- `<b>Gene Roddenberry's <cite>Star Trek</cite></b>` (DITA)
|
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where consistent tagging semantics can support querying), republishing in manifold formats, etc. etc.
|
... | ... | @@ -122,11 +122,11 @@ or indeed (what is just as likely): |
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorical sense ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. Again, the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
Even setting aside the noise and redundancy, however (and there are ways of dealing with them) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.) What is worse, we face this impediment everywhere we turn.
|
|
|
Even setting aside the noise and redundancy, however (and there are ways of dealing with them) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title-cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.)
|
|
|
|
|
|
This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx file) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
What is worse, we face this gap everywhere we turn. This problem is pervasive in WordML, which lacks explicit indicators of intellectual structure, any "universe of things" (such as titles or citations), or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx file) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated togday by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is actually open and running in the application. A .docx Save As is almost like a Save Game.
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even as it is printed or displayed on screen -- but the book, article, research paper or other "thing" hiding inside it. (That is, as readers we see into and infer from the print, to construe the logical forms behind it.) This problem, in a data conversion system, is considered a species or variety of "mapping problem" (or a problem of "semantic inference" or "semantic attribution"): we see italics, we recognize a title. But the next time we see italics it might be something else. How do we tell the difference?
|
|
|
Yet we we aren't actually interested in the Word document "fully itself", even as it is printed or displayed on screen -- but the book, article, research paper or other "thing" hiding inside it. (That is, as readers we see into and infer from the print, to construe the logical forms behind it.) This problem, in a data conversion system, is considered a species or variety of "mapping problem" (or a problem of "semantic inference" or "semantic attribution"): we see italics, we recognize a title. But the next time we see italics it might be something else. How do we tell the difference?
|
|
|
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
... | ... | |