|
# DOCX -> XML: the XSweet Approach
|
|
# DOCX -> XML: the XSweet Approach
|
|
|
|
|
|
|
|
## Background
|
|
|
|
|
|
It's a hard problem: conversion or extraction of data encoded in Microsoft Word (`.docx`) format, pulling them into publication workflows - what does XSweet do differently from other tools and methods addressing the same problem?
|
|
It's a hard problem: conversion or extraction of data encoded in Microsoft Word (`.docx`) format, pulling them into publication workflows - what does XSweet do differently from other tools and methods addressing the same problem?
|
|
|
|
|
|
We start with a couple of observations and assumptions --
|
|
We start with a couple of observations and assumptions --
|
|
|
|
|
|
Having a 'black box' solution to get data from MS Word into publishing workflows - even one that works well - is a necessary but not a sufficient condition for what will come next. It is difficult to make progress in developing these workflows without such a tool. Yet even when we have one that works tolerably well, it will not be enough: we can expect always to have to augment it.
|
|
Having a 'black box' solution to get data from MS Word into publishing workflows - even one that works well - is a necessary but not a sufficient condition for what will come next. It is difficult to make progress in developing these workflows without such a tool. Yet even when we have one that works tolerably well, it will not be enough: we can expect always to have to augment it.
|
|
|
|
|
|
Works authored in Word are too variable in form and purpose, while at the same time, for some works, the creative process does not always (or even usually) end when the book goes to the designer -- or even, today, the printer, or whatever the contemporary equivalents are. A One Size Fits All (or even most) solution would probably exist by now (much as HTML Tidy, CURL or other open source utilities exist for various common or ubiquitous tasks) if there were a clean, dependable and reliably way to get any kind of clean markup out of Word. Despite a couple of seeming near-misses, we don't have such a tool.
|
|
Works authored in Word are too variable in form and purpose, while at the same time, for some works, the creative process does not always (or even usually) end when the book goes to the designer -- or even, today, the printer, or whatever the contemporary equivalents are. One Size Fits All (or even most) is a worthy goal, but if such a solution were possible, one imagines it would exist by now -- such is the demand for it -- much as HTML Tidy, CURL or other open source utilities exist for various common or ubiquitous tasks. Despite a couple of seeming near-misses and plenty of workaround-pathways, we don't have a tool that can reliably and simply deliver clean markup out of Word -- or at least, the clean markup we need, out of the Word data we have. Nor is this exactly because the work hasn't been done. But the very terms of their success also show the problem.
|
|
|
|
|
|
Nor is this exactly because the work hasn't been done: indeed, solutions exist. But the very terms of their success also show the problem.
|
|
|
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization even within this subset, on further subsets or even on the individual document. The required customizations can take the form either of tool development or tuning (configuration, extension or modification of the tool), or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization even within this subset, on further subsets or even on the individual document. The required customizations can take the form either of tool development or tuning (configuration, extension or modification of the tool), or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
|
|
|
|
|
## Towards a solution
|
|
|
|
|
|
The best way we think we can do this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
The best way we think we can do this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
|
|
|
|
This means - we AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
|
This means - we AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
... | @@ -39,23 +41,26 @@ The beauty of listing them in order is that we can see that at least the low end |
... | @@ -39,23 +41,26 @@ The beauty of listing them in order is that we can see that at least the low end |
|
|
|
|
|
This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
This is because WORDML IS NOT WHAT (PRACTITIONERS CALL) GENERIC MARKUP
|
|
|
|
|
|
-----
|
|
## Aiming for the right target
|
|
|
|
|
|
|
|
Short version: "Generic markup" (sometimes called 'descriptive markup') is what we are aiming for ultimately -- but the way to produce it is to go "up hill" (towards a putatively clean and economical representation of the document) only by stages.
|
|
|
|
|
|
Here is a microcosmic view of the problem, reduced to the barest possible example. The rest of the problem is much like this, only greatly magnified in scale and complexity.
|
|
As an illustration, consider a microcosmic view of the problem, reduced to the barest possible example. (The rest of the problem is much like this, only greatly magnified in scale and complexity.)
|
|
|
|
|
|
The problem - take the following line:
|
|
The problem - take the following line:
|
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
We might prefer (for one reason or another) to have any of these --
|
|
We might prefer (for one reason or another) to have any of these as a nicely-tagged representation suitable for further processing in an appropriate toolchain --
|
|
|
|
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
- Docbook `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>`
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
- TEI `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>`
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
- JATS ` <named-content content-type="styled.bold">Gene Roddenberry's <named-content content-type="title.cited" >Star Trek</named-content></named-content>`
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
- DITA `<b>Gene Roddenberry's <cite>Star Trek</cite></b>`
|
|
|
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where tag semantics can support querying), republishing in manifold formats, etc. etc.
|
|
|
|
|
|
All of them are a very far cry from what the WordML will have, something like this:
|
|
It is reasonable to stipulate any or all of these as worthwhile end points only because we know of real systems that use all of them. The question here is not whether to aim for this, but how. Especially since despite their variations they all have one thing in common, namely how far they are from what is going to be discovered inside a Word document, such as:
|
|
|
|
|
|
WordML
|
|
WordML
|
|
```
|
|
```
|
... | @@ -74,7 +79,6 @@ WordML |
... | @@ -74,7 +79,6 @@ WordML |
|
</w:r>
|
|
</w:r>
|
|
```
|
|
```
|
|
|
|
|
|
|
|
|
|
or (what is actually much more likely):
|
|
or (what is actually much more likely):
|
|
|
|
|
|
```
|
|
```
|
... | @@ -107,20 +111,17 @@ or (what is actually much more likely): |
... | @@ -107,20 +111,17 @@ or (what is actually much more likely): |
|
</w:r>
|
|
</w:r>
|
|
```
|
|
```
|
|
|
|
|
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
(If you squint you can see that Word is preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (It turns out the same thing may be said in many different ways as well as repeatedly. BTW the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic either.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (It turns out the same thing may be said in many different ways as well as repeatedly.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. But XML really isn't it. Putting this back together may take some care.
|
|
|
|
|
|
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye (using our semantically-aware markup vocabulary). What the Word document (.docx) represents in operation is something entirely outside the .docx itself (regarded as a file or data set), namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
Even in this little sample, a case of this is evident if you make a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure (bold (italic)), where the 'italic' span is nested inside the bold span. In contrast, the Word document shows (bold)(bold italic) - two spans next to each other.
|
|
Even in this little sample, a case of this is evident if you make a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure <b>(bold <i>(italic)</i>)</b>, where the 'italic' span is nested inside the bold span. In contrast, the Word document shows <b>(bold)<i>(bold italic)</i></b> - two spans next to each other.
|
|
|
|
|
|
There is no absolute rule that says this bit of text "is" one or the other: our problem is only that we need to get from one representation to the next.
|
|
|
|
|
|
|
|
However, by getting to any of them, we hope to be able to get to all of them -- although this is only the beginning of the "impedence mismatch" between Word and the conceivable target formats. Indeed this particular transposition in the treatment of paragraph contents, is one that can be handled with a general transposition, so it is "only painful". In other words, it is the kind of pain we can make go away with our black box, which has a component that performs this transposition for us.
|
|
There is no absolute rule that says this bit of text "is" one or the other: our problem is only that we need to get from one representation to the next. And indeed this particular transposition in the treatment of paragraph contents, is one that can be handled with a general transposition, so it is "only painful". In other words, it is the kind of pain we can make go away with our black box, which has a component that performs this transposition for us.
|
|
|
|
|
|
More complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult, since they are never the same in the wild. Indeed we need to see them to really know what they look like. But here, agile development processes and an active feedback loop will help. The key is, to be attuned at all times to the particular formal aspects of the work at hand.
|
|
More complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you, we expect to be more difficult, since they are never the same in the wild. Indeed we need to see them to really know what they look like. But here, agile development processes and an active feedback loop will help. The key is, to be attuned at all times to the particular formal aspects of the work at hand.
|
|
|
|
|
... | | ... | |