... | ... | @@ -130,22 +130,26 @@ The way we approach discovering the outlines of this form is, first, by represen |
|
|
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
|
|
- HTML <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
- DITA <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
- JATS <bold>Gene Roddenberry's <italic>Star Trek</italic></bold>
|
|
|
- Docbook <emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>
|
|
|
- TEI <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>
|
|
|
- made-for-purpose <run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>
|
|
|
- HTML ` <b>Gene Roddenberry's <i>Star Trek</i></b>`
|
|
|
- DITA `<b>Gene Roddenberry's <i>Star Trek</i></b>`
|
|
|
- JATS `<bold>Gene Roddenberry's <italic>Star Trek</italic></bold>`
|
|
|
- Docbook `<emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>`
|
|
|
- TEI ` <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>`
|
|
|
- made-for-purpose `<run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>`
|
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough!
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.) That is, we would like to say
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
```
|
|
|
|
|
|
Or even (in the rare case when someone was at the other end of the Word document)
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <span class="title.cited">Star Trek</span></b>
|
|
|
```
|
|
|
|
|
|
if the Word user had assigned a paragraph style "title.cited" to this range of text.
|
|
|
|
... | ... | @@ -155,13 +159,17 @@ Why does HTML make a good target vocabulary? |
|
|
(b) it has these fantastic escape hatches!
|
|
|
(c) did I mention escape hatches? One of them escapes into CSS! While the other can expose Word Styles!
|
|
|
(d) yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
|
|
(e) to top it off, HTML is a well-known vernacular and we are expecting to edit on an HTML platform, so why not?
|
|
|
(e) to top it off, HTML is a well-known vernacular --
|
|
|
(f) a custom vocabulary would have to be designed, tested, documented and learned; HTML lets us just fake it;
|
|
|
(g) and since we are expecting to edit (at least initially) on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan
|
|
|
|
|
|
Note the non-canonical and arguably deprecated heavy use of @style - we justify this on the grounds that we are going *up hill* and *by the time we reach the top* we can *cast these properties aside as nothing more than the engine that has got us there*.
|
|
|
|
|
|
Next to these, the fact that HTML also has an element-type semantics albeit an impoverished one - p, ul, li, the lists, tables - is useful, but not essential, as long as we have generic "hangers" we can use such as div, p and span.
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
```
|
|
|
<p class="listing">Gene Roddenberry's <span class="title.cited">Star Trek</span></p>
|
|
|
```
|
|
|
|
|
|
|
|
|
|
... | ... | |