... | ... | @@ -118,7 +118,7 @@ or indeed (what is just as likely): |
|
|
</w:r>
|
|
|
```
|
|
|
|
|
|
(If you squint you can see that the code is starting to preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism. And yes, in general, the more you edit the document the worse this will get.)
|
|
|
(If you squint you can see that the code is starting to preserve a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism. And yes, in general, the more you edit the document the worse this will get.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a metaphorical sense ('noise' as 'tag entropy'), and "promiscuous". (It turns out the same thing may be said in many different ways as well as repeatedly. Again, the snippet offered merely hints at the actual complexity and verbosity of WordML internals.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. (After all, that is what happens inside Word itself, and it isn't magic.) But before we can get even to that point, putting this back together may take some care.
|
|
|
|
... | ... | @@ -153,21 +153,14 @@ It turns out, among these the choice is fairly clear. Only a made-for-purpose la |
|
|
|
|
|
### Why HTML5 turns out to be 'most excellent'
|
|
|
|
|
|
That is, we would like to say
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <i>Star Trek</i></b>
|
|
|
```
|
|
|
|
|
|
Or even (in the rare case)
|
|
|
|
|
|
```
|
|
|
<b>Gene Roddenberry's <span class="title.cited">Star Trek</span></b>
|
|
|
```
|
|
|
|
|
|
if the Word user had assigned a paragraph style "title.cited" to this range of text.
|
|
|
|
|
|
(And it turns out, this is doable -- essentially a matter of listening for the signal in the noise.)
|
|
|
as a target for a data conversion out of .docx, there would be three nice-to-haves:
|
|
|
Target format can capture presentational information (i.e. specs on how outputs should look on the 'page') - since that's the semantic domain of the .docx source, i.e. (much of) the information we have.
|
|
|
Target format may be unstructured or loosely structured; in any case structures are not enforced. That's because we know stuff coming in will have every structure and no structure.
|
|
|
Ideally it would be well-known, a lingua franca.
|
|
|
As you say, it turns out HTML/CSS is all these things.
|
|
|
Plus none of it is exclusive of XML discipline or even XML tools.
|
|
|
Indeed we could write schemas to validate whatever different 'flavors' or stages of HTML we wished to forrmalize.
|
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
|
... | ... | @@ -176,13 +169,15 @@ Why does HTML make a good target vocabulary? |
|
|
- HTML @style invites us to use CSS! (And we are describing presentational features. Perfect.) While HTML @class can expose Word Styles (since it is for user-driven semantic labeling after all).
|
|
|
- Yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do); we should be able to avoid complications regarding the purported conformance/orthodoxy of our outputs to applicable standards. (Also a separable problem.)
|
|
|
- To top it off, HTML is a well-known vernacular --
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now*;
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now;
|
|
|
- And since we are expecting to work further with our data on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan ...
|
|
|
- ... Illustrating the point: anyone can use HTML5 (especially wf XML HTML5) so let's use that.
|
|
|
|
|
|
(* Later if need be we can come back to formalize the target format as a profile of HTML5+CSS.)
|
|
|
Note that choosing HTML/CSS as a vocabulary does not mean we need utterly to forsake XML discipline or even XML tools.
|
|
|
|
|
|
Indeed, over the medium or long term it could be useful to design XML schemas to describe any different stages or 'profiles' of HTML (perhaps, operational subsets of XHTML5/CSS) that we wished to formalize.
|
|
|
|
|
|
Note the non-canonical and arguably deprecated heavy use of @style - we justify this on the grounds that we are going *up hill* and *by the time we reach the top* we can *cast these properties aside as nothing more than the engine that has got us there*.
|
|
|
We make use of @style in ways that are deprecated and may be discouraged in other contexts. We justify our use of @style in this way as a means to an end - we represent information on @style so that it can be abstracted away (and the style can be gotten rid of).
|
|
|
|
|
|
Next to these, the fact that HTML also has an element-type semantics albeit an impoverished one - p, ul, li, the lists, tables - is useful, but not essential, as long as we have generic "hangers" we can use such as div, p and span.
|
|
|
|
... | ... | @@ -192,17 +187,17 @@ Next to these, the fact that HTML also has an element-type semantics albeit an i |
|
|
|
|
|
## How it will work
|
|
|
|
|
|
XSLT (specifically XSLT 2.0) is a great tool for this job. (See below.) The most powerful, flexible and generic (hence portable) approach would combine XSLT stylesheets in a pipeline, a multi-step process beginning with "data extraction" (reading the data from the WordML and re-expressing it, as 'literally' as possible, in HTML+CSS), and then proceeding through as many steps as necessary of subsequent "refinement", in which the markup would be cleaned up and enhanced. One advantage of a pipeline arrangement is how straightforward it makes it to extend and modify (by changing or adding steps only where necessary).
|
|
|
XSLT (specifically XSLT 2.0) is a great tool for this job. (See below.) The most powerful, flexible and generic (hence portable) approach would combine XSLT stylesheets in a pipeline, a multi-step process beginning with "data extraction" (reading the data from the WordML and re-expressing it, as 'literally' as possible, in HTML+CSS), and then proceeding through as many steps as necessary of subsequent "refinement", in which the markup is cleaned up and enhanced. One advantage of a pipeline arrangement is how straightforward it makes it to extend and modify (by changing or adding steps only where necessary), either to improve its capability generally, or to modify it for specific cases.
|
|
|
|
|
|
Pipelines are logical organizations and can be implemented in many different ways. While with our code base we will offer pipeline configurations using both Bourne Shell (bash) and W3C XProc, for reference and component testing, we also expect to use our sibling project INK as a pipelining architecture, making XSweet available as a service to anyone using INK.
|
|
|
Pipelines are logical organizations and can be implemented in many different ways. While with our code base we can presently offer XSLT pipeline configurations using both Bourne Shell (bash) and W3C XProc, for reference and component testing, we also expect to use our sibling project INK as a pipelining architecture, making XSweet available as a service to anyone using INK.
|
|
|
|
|
|
### HTML soup
|
|
|
|
|
|
As noted above, sloppy or soupy HTML is what we expect, in particular because, as we perform no structural induction, everything comes through "flat" - one 'p' after another. If there is any structure to what we emit, it is introduced only very carefully by XSLTs explicitly designated for that purpose.
|
|
|
As noted above, sloppy or soupy HTML is what we expect, in particular because, as we perform no structural induction, everything comes through "flat" - one 'p' after another. If there is any structure to what we emit, it is introduced only very carefully by transformations or processes (XSLT or other) explicitly designated for that purpose.
|
|
|
|
|
|
Even sloppy HTML, however, should be relatively idiomatic -- it should look not too much worse than the examples above. (To permit it to be very complicated will not be doing anyone favors.) It can, however -- within the limitations of that idiom -- express anything we want it to.
|
|
|
|
|
|
If we want to it "sound like" the source, only rectify all its inadequacies, this means translating the information we see, into a language we understand. At the level of structures in the Word (either lines/paragraphs or inline 'text' i.e. arbitrary ranges), we have either or both of two sorts of evidence. First is a label or 'family' or 'class' assignment by virtue of the name of an attached Style (in Word, a so-called Paragraph or Text style). Alternatively, or in addition, we have explicit formatting properties as designated explicitly - including font shifts, bold/italics, whitespace properties and so forth. (Properties assigned by any other mechanism are likely to be document-wide in scope i.e. not differentiating, therefore not of interest to us.)
|
|
|
If we want to it "sound like" the source, while being prepared to rectify its inadequacies, this means translating the information we see, into a language we understand. At the level of structures in the Word (either lines/paragraphs or inline 'text' i.e. arbitrary ranges), we have either or both of two sorts of evidence. First is a label or 'family' or 'class' assignment by virtue of the name of an attached Style (in Word, a so-called Paragraph or Text style). Alternatively and in addition, we have explicit formatting properties as designated using Word's formatting tools, whether they be font shifts, bold/italics, whitespace properties and so forth. (Properties assigned by any other mechanism are likely to be document-wide in scope i.e. not differentiating, therefore not of interest to us.)
|
|
|
|
|
|
As it happens, in its **class** and **style** mechanisms (permitted in valid HTML since the 1990s) give us everything we need, especially since the conventional use of **style** (and indeed of **class**) is to call to CSS (either embedded or bound externally) -- which (Fortune smiles) just happens to be an entire language, and a well-known vernacular, for expressing just the sort of the information (more or less) we have at hand - namely things like font size and face shifts, etc.
|
|
|
|
... | ... | |