... | ... | @@ -6,13 +6,13 @@ It's a hard problem: conversion or extraction of data encoded in Microsoft Word |
|
|
|
|
|
We start with a couple of observations and assumptions --
|
|
|
|
|
|
Having a 'black box' solution to get data from MS Word into publishing workflows - even one that works well - is a necessary but not a sufficient condition for what will come next. It is difficult to make progress in developing these workflows without such a tool. Yet even when we have one that works tolerably well, it will not be enough: we can expect always to have to augment it.
|
|
|
Having a 'black box' solution to get data from MS Word into publishing workflows - even one that works well - is a necessary but not a sufficient condition for what will come next. It is difficult to make progress in developing these workflows without such a tool. Yet even when we have one that works tolerably well, it will not be enough: we can expect always to have to augment or tweak it at any rate on the edges.
|
|
|
|
|
|
Works authored in Word are too variable in form and purpose, while at the same time, for some works, the creative process does not always (or even usually) end when the book goes to the designer -- or even, today, the printer, or whatever the contemporary equivalents are. One Size Fits All (or even most) is a worthy goal, but if such a solution were possible, one imagines it would exist by now -- such is the demand for it -- much as HTML Tidy, CURL or other open source utilities exist for various common or ubiquitous tasks. Despite a couple of seeming near-misses and plenty of workaround-pathways, we don't have a tool that can reliably and simply deliver clean markup out of Word -- or at least, the clean markup we need, out of the Word data we have. Nor is this exactly because the work hasn't been done. But the very terms of their success also show the problem.
|
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization even within this subset, on further subsets or even on the individual document. The required customizations can take the form either of tool development or tuning (configuration, extension or modification of the tool), or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really were that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really were that -- less useful or less valuable. On the contrary -- getting most of the way up might be a big advance, at any rate if it presented with us with options on how best to take a "data conversion" further. So there has to be editing, normalization and enhancement - that might be regarded as a feature and a situation to be encouraged, not a bug.
|
|
|
|
|
|
## Towards a solution
|
|
|
|
... | ... | @@ -49,7 +49,7 @@ Because our intermediate formats, however, will (also) be HTML, they may be imme |
|
|
|
|
|
Interestingly enough, we can do this all with an XML and specifically an XSLT-based pipeline architecture. Not only that, but if we take care that our HTML5 outputs are also well-formed XML, we can attach the extraction component to further processes (including XSLT processes) to provide missing parts of a complete solution.
|
|
|
|
|
|
### Generic markup (and its discontents)
|
|
|
### Generic markup (Considered as one of the Fine Arts)
|
|
|
|
|
|
As an illustration of our problem in general, consider a microcosmic view, an example reduced to the barest possible. (The rest of the problem is much like this, only greatly magnified in scale and complexity.) Consider the following line:
|
|
|
|
... | ... | |