... | ... | @@ -130,6 +130,8 @@ Yet we we aren't actually interested in the Word document "fully itself", even a |
|
|
|
|
|
### Ascending step by step instead of all at once
|
|
|
|
|
|
![uphill-conversion.svg](/uploads/188a75fc4824095416e7bfe7fb7f69f1/uphill-conversion.svg)
|
|
|
|
|
|
Similarly, more complex kinds of inferencing of structures -- especially more "semantic" structures such as figures-with-captions, tables, pull quotes and what have you -- all of these will present challenges not just because the data is complex, messy and redundant, but also because those things (at least as such) are never given, and present only implicitly. Despite the fact that as writers, readers and editors we find these things to be obvious enough (at least, we know how to interpret the cues we see), at the level of the encoding, these configurations are rarely the same in any two Word documents. Indeed we may need to see them to really know what they look like in any given case.
|
|
|
|
|
|
However, recognizing the limitations here actually provides us a way forward. Rather than try to infer or construe information not given, the solution here is to show ourselves the information we have, but in a more legible and tractable form. That is, to translate the Word into a form we can read, but not to try translating it all the way out of the language it uses in order to communicate what it says (whether to printer, PDF file generator or human reader) when it (just for example) puts something in italics. In other words, our first task is to *extract* the data, first, into a format that removes and reduces all the redundancy, distilling the "warrants" and claims of the Word document ('this bit is italic, that bit is styled 'Header.2') into a form we can read and work with.
|
... | ... | @@ -219,6 +221,7 @@ We like XSLT because it is |
|
|
* functional / declarative (good for maintenance/stability as well as performance)
|
|
|
* modular and homoiconic (useful in pipelining)
|
|
|
* capable for the job, even from scratch (i.e. no third-party libraries necessary) and using only standard features
|
|
|
* maybe *different from* and yet *complementary with* more mainstream approaches for web or HTML content (however well established in publishing back ends otherwise)
|
|
|
|
|
|
Our experience so far is that XSLT works for this application. Indeed some XSLTs may be perfectly legible even to "non programmers" who are also looking at inputs and outputs - while others perhaps are not. The collection of transformations (and its possible sequencings) may however make a useful object lesson.
|
|
|
|
... | ... | |