... | ... | @@ -12,13 +12,13 @@ Works authored in Word are too variable in form and purpose, while at the same t |
|
|
|
|
|
The available solutions, both proprietary and open source, all constrain themselves to handle only a subset of Word documents, while at the same time more or less requiring some level of customization even within this subset, on further subsets or even on the individual document. The required customizations can take the form either of tool development or tuning (configuration, extension or modification of the tool), or handwork on the documents themselves, or both. This typically requires a level of expert assistance that often makes the work prohibitively difficult.
|
|
|
|
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really is that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
Not being able to provide a 100% solution, however, does not necessary make an 80% solution -- if it really were that -- less useful or less valuable. On the contrary. And while we may not be able to do away with the need for expert assistance for best results, we do believe we may be able to advance the state of play.
|
|
|
|
|
|
## Towards a solution
|
|
|
|
|
|
The best way we think we can do this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
|
|
|
|
This means - we AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
|
|
And, because we already know that 'perfect for everyone all the time' is impossible, WE AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
|
|
|
|
|
XSweet does not have to try and solve the entire problem. Instead, we break the problem into chunks and pieces, and address them serially, "in detail".
|
|
|
|
... | ... | @@ -37,7 +37,7 @@ Can we prioritize them based on which of them are more universal / ubiquitous, v |
|
|
- Specialized objects: math, formulae, drawings
|
|
|
- Specialized indexes
|
|
|
|
|
|
The beauty of listing them in order is that we can see that at least the low end can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious), while even the high end could be addressed using a carefully defined and validated profile of HTML or (better) HTML5 (because of `section` etc.)
|
|
|
The beauty of listing them in order is that we can see that at least the low end can be addressed as "sloppy HTML" or HTML slops (messy, but nutritious) -- our target format of choice (see below).
|
|
|
|
|
|
This is because WORDML IS NOT WHAT (practitioners call) GENERIC MARKUP
|
|
|
|
... | ... | @@ -55,7 +55,7 @@ As an illustration of our problem in general, consider a microcosmic view, an ex |
|
|
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
|
|
|
We might prefer (for one reason or another) to have any of these as a nicely-tagged representation suitable for further processing in an appropriate toolchain -- these are all variant species of descriptive or generic markup:
|
|
|
We might prefer (for one reason or another) to have any of these as a nicely-tagged representation of the bit of text shown above. As such, any of them would be suitable for further processing in an appropriate toolchain. (These are all variant species of descriptive or generic markup.)
|
|
|
|
|
|
- `<emphasis role="strong">Gene Roddenberry's <citation>Star Trek</citation></emphasis>` (Docbook )
|
|
|
- `<emph rend="bold">Gene Roddenberry's <title>Star Trek</title></emph>` (TEI)
|
... | ... | @@ -64,7 +64,7 @@ We might prefer (for one reason or another) to have any of these as a nicely-tag |
|
|
|
|
|
Basically what these all have in common, albeit each one expressing it in different ways, is that they encode aspects and features of the text that identify them by 'kind' (hence 'generic' markup). As such, they are fit not only for archiving but for arbitrary reuse, aggregation into collections (where consistent tagging semantics can support querying), republishing in manifold formats, etc. etc.
|
|
|
|
|
|
It is reasonable to stipulate any or all of these as worthwhile end points, if only because we know of real systems that use all of them. The question here is not whether to aim for markup of this quality, but how. Especially since despite their variations these all have another thing in common, namely how far they are from what is going to be discovered inside a Word document, such as (hereis a sample of the XML we find buried deep in a .docx file):
|
|
|
It is reasonable to stipulate any or all of these as worthwhile end points, if only because we know of real systems that use all of them. The question here is not whether to aim to acquire or derive markup of this quality, but how. Especially since despite their variations these all have another thing in common, namely how far they are from what is going to be discovered inside a Word document, such as (here is a sample of the XML we find buried deep in a .docx file):
|
|
|
|
|
|
```
|
|
|
<w:r w:rsidRPr="007449A0">
|
... | ... | @@ -116,7 +116,7 @@ or indeed (what is just as likely): |
|
|
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism. And yes, in general, the more you edit the document the worse this will get.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorically ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. Again, the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic.) But before we can get even to that point, putting this back together may take some care.
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorical sense ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. Again, the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
|
|
Even setting aside the noise and redundancy, however (and there are ways of dealing with them) there is a more crucial problem. Namely, the information we want simply isn't there. Even in the tiny example, there is nothing to indicate the italicized string *Star Trek* is a 'title' or 'title.cited', as one or another of the descriptive encoding systems has it. This represents a more formidable barrier - how to know this from that, to apply the recommended tagging correctly to data that gives no explicit indication. (Since not every italicized bit of text will not be a "cited title" by any means.) What is worse, we face this impediment everywhere we turn.
|
|
|
|
... | ... | |