|
# The XSweet Approach
|
|
# DOCX -> XML: the XSweet Approach
|
|
|
|
|
|
|
|
It's a hard problem: conversion or extraction of data encoded in Microsoft Word (`.docx`) format, pulling them into publication workflows - what does XSweet do differently from other tools and methods addressing the same problem?
|
|
|
|
|
|
|
|
We start with a couple of observations and assumptions --
|
|
|
|
|
|
|
|
Having a 'black box' solution to get data from MS Word into publishing workflows - even one that works well - is a necessary but not a sufficient condition for what will come next. It is difficult to make progress in developing these workflows without such a tool. Yet even when we have one that works tolerably well, it will not be enough: we can expect always to have to augment it.
|
|
|
|
|
|
|
|
Works authored in Word are too variable in form and purpose, while at the same time, for some works, the creative process does not always (or even usually) end when the book goes to the designer -- or even, today, the printer, or whatever the contemporary equivalents are. A One Size Fits All (or even most) solution would probably exist by now (much as HTML Tidy, CURL or other open source utilities exist for various common or ubiquitous tasks) if there were a clean, dependable and reliably way to get any kind of clean markup out of Word. Despite a couple of seeming near-misses, we don't have such a tool.
|
|
|
|
|
|
|
|
Nor is this exactly because the work hasn't been done: indeed, solutions exist. But the very terms of their success also show the problem.
|
|
|
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization at the level of the document set or even individual document. "Customization" can take the form either of tool development, or handwork on the documents themselves, or both. This requires a level of expert assistance that often makes the work prohibitive.
|
|
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results -- we do believe we may be able to advance the state of play.
|
|
|
|
|
|
|
|
The best way we think we can do this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
|
|
|
|
|
|
This means - we AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
|
|
|
|
|
|
|
XSweet does not have to try and solve the entire problem. Instead, we break the problem into chunks and pieces, and address them serially, "in detail".
|
|
|
|
|
|
|
|
In turn, this 'break-'em-up' approach to the problem domain will correspond to the mix-and-match organization of our solution framework.
|
|
|
|
|
|
|
|
If we break up the problem into detail, what are the pieces?
|
|
|
|
|
|
|
|
Can we prioritize them based on which of them are more universal / ubiquitous, vs which show the most variation (across documents, document types, publication types and workflows?)
|
|
|
|
|
|
|
|
- Main text including inline features such as bold, italics (and their warrants?)
|
|
|
|
- Footnotes, endnotes and textual apparatus with their cross-references
|
|
|
|
- 'Textual objects' including figures, tables, structured lists
|
|
|
|
(as represented typographically and by other means) with their cross-references
|
|
|
|
- Internal superstructure (parts, chapters, sections etc.)
|
|
|
|
- Bibliography / citations
|
|
|
|
- Specialized objects: math, formulae, drawings
|
|
|
|
|
|
|
|
Low end of this can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious)
|
|
|
|
|
|
|
|
WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
|
|
|
|
|
|
This is a microcosmic view of the problem, reduced to the barest possible example. In reality it's like this, but greatly magnified in scale and complexity.
|
|
|
|
|
|
|
|
The problem
|
|
|
|
|
|
|
|
We might prefer to have any of these --
|
|
|
|
|
|
|
|
Docbook <emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>
|
|
|
|
TEI <emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>
|
|
|
|
JATS <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</title></emph>
|
|
|
|
|
|
|
|
|
|
|
|
All of them are a very far cry from what the WordML will have, something like this:
|
|
|
|
|
|
|
|
WordML
|
|
|
|
```
|
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t>Gene Roddenberry's </w:t>
|
|
|
|
</w:r>
|
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
<w:i/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t>Star Trek</w:t>
|
|
|
|
</w:r>
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
or (what is actually much more likely):
|
|
|
|
|
|
|
|
```
|
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t>Gene Roddenberr</w:t>
|
|
|
|
</w:r>
|
|
|
|
<w:r>
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t>y</w:t>
|
|
|
|
</w:r>
|
|
|
|
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
|
|
|
|
<w:bookmarkEnd w:id="0"/>
|
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t xml:space="preserve">'s </w:t>
|
|
|
|
</w:r>
|
|
|
|
<w:r w:rsidRPr="007449A0">
|
|
|
|
<w:rPr>
|
|
|
|
<w:b/>
|
|
|
|
<w:i/>
|
|
|
|
</w:rPr>
|
|
|
|
<w:t>Star Trek</w:t>
|
|
|
|
</w:r>
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
(If you squint you can see that Word is preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
|
|
But, as a Wikipedia author puts it, "In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations."
|
|
|
|
|
|
|
|
Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.)
|
|
|
|
|
|
|
|
More importantly, it lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As. And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
|
|
The way we approach discovering the outlines of this form is, first, by representing the Word data as itself, with no enhancement or improvement. Only if we do *as little as possible* in changing the representation of the data, from Word's own obscure structures, into something relatively more legible and tractable, will we have a process that is as transparent as we need it to be.
|
|
|
|
|
|
|
|
The only problem with this architecture is, it entails the specification of a new intermediary form. It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would use either a standard or made-for-purpose vocabulary, maybe looking something like this:
|
|
|
|
|
|
|
|
HTML <b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
JATS <bold>Gene Roddenberry's <italic>Star Trek</italic></bold>
|
|
|
|
Docbook <emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>
|
|
|
|
TEI <hi rend="bold">Gene Roddenberry's <hi rend="italic">Star Trek</hi></hi>
|
|
|
|
made-for-purpose <run format="b">Gene Roddenberry's <run format="i">Star Trek</run></run>
|
|
|
|
|
|
|
|
These are all more or less the same or at any rate semantically equivalent inasmuch as any one of them could be mapped to write any of the others.
|
|
|
|
|
|
|
|
It turns out, among these the choice is fairly clear. Only a made-for-purpose language designed specifically to expose Indeed, because of -- it turns out that HTML5 is the clear winner among document formats as an initial target for a Word Extractor. (Note *initial* format - we say nothing of what we might improve this into eventually.) That is, we would like to say
|
|
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
|
|
Or even (in the rare case when someone was at the other end of the Word document)
|
|
|
|
|
|
|
|
<b>Gene Roddenberry's <span class="title.cited">Star Trek</span></b>
|
|
|
|
|
|
|
|
if the Word user had assigned a paragraph style "title.cited" to this range of text.
|
|
|
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
|
|
|
|
|
(a) we can easily leave our documents 'flat' as long as we need to
|
|
|
|
(b) it has these fantastic escape hatches!
|
|
|
|
(c) did I mention escape hatches? One of them escapes into CSS! While the other can expose Word Styles!
|
|
|
|
(d) yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do)
|
|
|
|
(e) to top it off, HTML is a well-known vernacular and we are expecting to edit on an HTML platform, so why not?
|
|
|
|
|
|
|
|
Note the non-canonical and arguably deprecated heavy use of @style - we justify this on the grounds that we are going *up hill* and *by the time we reach the top* we can *cast these properties aside as nothing more than the engine that has got us there*.
|
|
|
|
|
|
|
|
Next to these, the fact that HTML also has an element-type semantics albeit an impoverished one - p, ul, li, the lists, tables - is useful, but not essential, as long as we have generic "hangers" we can use such as div, p and span.
|
|
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
|
|
|
|
There are some interesting differences between XSweet and other tools and methods addressing the same problem, namely conversion or extraction of documentary data encoded in Microsoft Word (`.docx`) format into publication workflows. Why is this so hard; and what might we be able to do to make it easier?
|
|
|
|
|
|
|
|
|
|
|