... | ... | @@ -10,7 +10,7 @@ Works authored in Word are too variable in form and purpose, while at the same t |
|
|
|
|
|
Nor is this exactly because the work hasn't been done: indeed, solutions exist. But the very terms of their success also show the problem.
|
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization at the level of the document set or even individual document. The required customizations can take the form either of tool development, or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization even within this subset, on further subsets or even on the individual document. The required customizations can take the form either of tool development, or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
|
... | ... | @@ -24,21 +24,24 @@ In turn, this 'break-'em-up' approach to the problem domain will correspond to t |
|
|
|
|
|
If we break up the problem into detail, what are the pieces?
|
|
|
|
|
|
Can we prioritize them based on which of them are more universal / ubiquitous, vs which show the most variation (across documents, document types, publication types and workflows?)
|
|
|
Can we prioritize them based on which of them are more universal / ubiquitous, vs which show the most variation (across documents, document types, publication types and workflows) and will therefore be most problematic and peculiar?
|
|
|
|
|
|
- Main text including inline features such as bold, italics (and their warrants?)
|
|
|
- Footnotes, endnotes and textual apparatus with their cross-references
|
|
|
- 'Textual objects' including figures, tables, structured lists
|
|
|
(as represented typographically and by other means) with their cross-references
|
|
|
- Internal superstructure (parts, chapters, sections etc.)
|
|
|
- Internal superstructure (parts, chapters, sections etc.) - determines scope(s) of reference(s) - w/ ToC
|
|
|
- Bibliography / citations
|
|
|
- Specialized objects: math, formulae, drawings
|
|
|
- Specialized indexes
|
|
|
|
|
|
Low end of this can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious)
|
|
|
The beauty of listing them in order is that we can see that at least the low end can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious), while even the high end could be addressed using a carefully defined and validated profile of HTML or (better) HTML5 (because of `section` etc.)
|
|
|
|
|
|
WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
|
This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
|
|
|
|
This is a microcosmic view of the problem, reduced to the barest possible example. In reality it's like this, but greatly magnified in scale and complexity.
|
|
|
-----
|
|
|
|
|
|
Here is a microcosmic view of the problem, reduced to the barest possible example. The rest of the problem is much like this, only greatly magnified in scale and complexity.
|
|
|
|
|
|
The problem - take the following line:
|
|
|
|
... | ... | @@ -48,7 +51,8 @@ We might prefer (for one reason or another) to have any of these -- |
|
|
|
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
|
|
|
|
|
|
|
All of them are a very far cry from what the WordML will have, something like this:
|
... | ... | @@ -106,13 +110,13 @@ or (what is actually much more likely): |
|
|
|
|
|
(If you squint you can see that Word is preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.) The syntax is awful.
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. But XML really isn't it. Putting this back together may take some care.
|
|
|
|
|
|
More importantly, it lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As. And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As. And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
In this microcosmic sample, this is apparent by way of a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure (bold (italic)), where the 'italic' span is nested inside the bold span. In contrast, the Word document shows (bold)(bold italic) - two spans next to each other.
|
|
|
Even in this little sample, a case of this is evident if you make a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure (bold (italic)), where the 'italic' span is nested inside the bold span. In contrast, the Word document shows (bold)(bold italic) - two spans next to each other.
|
|
|
|
|
|
There is no absolute rule that says this bit of text "is" one or the other: our problem is only that we need to get from one representation to the next.
|
|
|
|
... | ... | @@ -127,6 +131,7 @@ The way we approach discovering the outlines of this form is, first, by represen |
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
|
|
- HTML <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
- DITA <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
- JATS <bold>Gene Roddenberry's <italic>Star Trek</italic></bold>
|
|
|
- Docbook <emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>
|
|
|
- TEI <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>
|
... | ... | |