... | ... | @@ -10,9 +10,9 @@ Works authored in Word are too variable in form and purpose, while at the same t |
|
|
|
|
|
Nor is this exactly because the work hasn't been done: indeed, solutions exist. But the very terms of their success also show the problem.
|
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization at the level of the document set or even individual document. "Customization" can take the form either of tool development, or handwork on the documents themselves, or both. This requires a level of expert assistance that often makes the work prohibitive.
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization at the level of the document set or even individual document. The required customizations can take the form either of tool development, or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results -- we do believe we may be able to advance the state of play.
|
|
|
Not being able to provide a 100% solution, however, does not make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
|
|
|
The best way we think we can do this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
|
|
... | ... | @@ -40,13 +40,15 @@ WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP |
|
|
|
|
|
This is a microcosmic view of the problem, reduced to the barest possible example. In reality it's like this, but greatly magnified in scale and complexity.
|
|
|
|
|
|
The problem
|
|
|
The problem - take the following line:
|
|
|
|
|
|
We might prefer to have any of these --
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
We might prefer (for one reason or another) to have any of these --
|
|
|
|
|
|
Docbook <emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>
|
|
|
TEI <emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>
|
|
|
JATS <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</title></emph>
|
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
|
|
|
|
|
|
|
All of them are a very far cry from what the WordML will have, something like this:
|
... | ... | @@ -104,25 +106,33 @@ or (what is actually much more likely): |
|
|
|
|
|
(If you squint you can see that Word is preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
But, as a Wikipedia author puts it, "In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations."
|
|
|
|
|
|
Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.)
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.) The syntax is awful.
|
|
|
|
|
|
More importantly, it lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As. And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
In this microcosmic sample, this is apparent by way of a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure (bold (italic)), where the 'italic' span is nested inside the bold span. In contrast, the Word document shows (bold)(bold italic) - two spans next to each other.
|
|
|
|
|
|
There is no absolute rule that says this bit of text "is" one or the other: our problem is only that we need to get from one representation to the next.
|
|
|
|
|
|
However, by getting to any of them, we hope to be able to get to all of them -- although this is only the beginning of the "impedence mismatch" between Word and the conceivable target formats. Indeed this particular transposition in the treatment of paragraph contents, is one that can be handled with a general transposition, so it is "only painful". In other words, it is the kind of pain we can make go away with our black box, which has a component that performs this transposition for us.
|
|
|
|
|
|
More complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult, since they are never the same in the wild. Indeed we need to see them to really know what they look like. But here, agile development processes and an active feedback loop will help. The key is, to be attuned at all times to the particular formal aspects of the work at hand.
|
|
|
|
|
|
And this is why it needs to be a team activity: because the outlines of that form, in all its implications, will be critical to giving the text its affordances in the electronic environment. Everything about the text in production will spill out from this formalizing activity.
|
|
|
|
|
|
The way we approach discovering the outlines of this form is, first, by representing the Word data as itself, with no enhancement or improvement. Only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
|
|
|
The only problem with this architecture is, it entails the specification of a new intermediary form. It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
|
|
HTML <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
JATS <bold>Gene Roddenberry's <italic>Star Trek</italic></bold>
|
|
|
Docbook <emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>
|
|
|
TEI <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>
|
|
|
made-for-purpose <run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>
|
|
|
- HTML <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
- JATS <bold>Gene Roddenberry's <italic>Star Trek</italic></bold>
|
|
|
- Docbook <emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>
|
|
|
- TEI <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>
|
|
|
- made-for-purpose <run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>
|
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others.
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others. Also note that none of them presents quite the level of semantic richness of the TEI and JATS examples above (which tell us, for example, that 'Star Trek' is a title, not only italicized.) This is merely what is called *presentational* markup. Yet maybe this is enough!
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.) That is, we would like to say
|
|
|
|
... | ... | |