... | ... | @@ -58,7 +58,7 @@ We might prefer (for one reason or another) to have any of these as a nicely-tag |
|
|
- JATS ` <b>Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></b>`
|
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where tag semantics can support querying), republishing in manifold formats, etc. etc.
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where consistent tagging semantics can support querying), republishing in manifold formats, etc. etc.
|
|
|
|
|
|
It is reasonable to stipulate any or all of these as worthwhile end points only because we know of real systems that use all of them. The question here is not whether to aim for this, but how. Especially since despite their variations they all have one thing in common, namely how far they are from what is going to be discovered inside a Word document, such as:
|
|
|
|
... | ... | @@ -123,7 +123,7 @@ On the other hand we aren't actually interested in the Word document "fully itse |
|
|
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you -- all of these will present challenges not just because the data is complex, messy and redundant, but also because those things (at least as such) are never given, and present only implicitly. Despite the fact that as writers, readers and editors these things are obvious and transparent, at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we may need to see them to really know what they look like in any given case.
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you -- all of these will present challenges not just because the data is complex, messy and redundant, but also because those things (at least as such) are never given, and present only implicitly. Despite the fact that as writers, readers and editors we find these things to be obvious enough (at least, we know how to interpret the cues we see), at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we may need to see them to really know what they look like in any given case.
|
|
|
|
|
|
The solution here is to show ourselves the same information, but in a legible or tractable form. That is, to translate the Word into a form we can read, but not to try translating it all the way out of the language it uses in order to communicate what it says (whether to printer, PDF file generator or human reader) when it puts something in italics. In other words, our first task is to *extract* the data, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
|
|
|
... | ... | @@ -138,7 +138,7 @@ So far so good - but what will that format actually be? It is not hard to envisi |
|
|
- TEI ` <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>`
|
|
|
- made-for-purpose `<run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>`
|
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough!
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough for a first step.
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose the info (the 'made for purpose" tagging shown) even comes close. It turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.)
|
|
|
|
... | ... | @@ -177,3 +177,4 @@ Next to these, the fact that HTML also has an element-type semantics albeit an i |
|
|
```
|
|
|
<p class="listing">Gene Roddenberry's <span class="title.cited">Star Trek</span></p>
|
|
|
```
|
|
|
|