... | ... | @@ -190,6 +190,36 @@ Next to these, the fact that HTML also has an element-type semantics albeit an i |
|
|
|
|
|
## How it will work
|
|
|
|
|
|
XSLT (specifically XSLT 2.0) is a great tool for this job. The most powerful, flexible and generic (hence portable) approach would combine XSLT stylesheets in a pipeline, a multi-step process beginning with "data extraction" (reading the data from the WordML and re-expressing it, as 'literally' as possible, in HTML+CSS), and then proceeding through as many steps as necessary of subsequent "refinement", in which the markup would be cleaned up and enhanced. One advantage of a pipeline arrangement is how straightforward it makes it to extend and modify (by changing or adding steps only where necessary).
|
|
|
XSLT (specifically XSLT 2.0) is a great tool for this job. (See below.) The most powerful, flexible and generic (hence portable) approach would combine XSLT stylesheets in a pipeline, a multi-step process beginning with "data extraction" (reading the data from the WordML and re-expressing it, as 'literally' as possible, in HTML+CSS), and then proceeding through as many steps as necessary of subsequent "refinement", in which the markup would be cleaned up and enhanced. One advantage of a pipeline arrangement is how straightforward it makes it to extend and modify (by changing or adding steps only where necessary).
|
|
|
|
|
|
Pipelines are logical organizations and can be implemented in many different ways. While with our code base we will offer pipeline configurations using both Bourne Shell (bash) and W3C XProc, for reference and component testing, we also expect to use our sibling project INK as a pipelining architecture, making XSweet available as a service to anyone using INK.
|
|
|
|
|
|
### HTML soup
|
|
|
|
|
|
As noted above, sloppy or soupy HTML is what we expect, in particular because, as we perform no structural induction, everything comes through "flat" - one 'p' after another. If there is any structure to what we emit, it is introduced only very carefully by XSLTs explicitly designated for that purpose.
|
|
|
|
|
|
Even sloppy HTML, however, should be relatively idiomatic -- it should look not too much worse than the examples above. (To permit it to be very complicated will not be doing anyone favors.) It can, however -- within the limitations of that idiom -- express anything we want it to.
|
|
|
|
|
|
If we want to it "sound like" the source, only rectify all its inadequacies, this means translating the information we see, into a language we understand. At the level of structures in the Word (either lines/paragraphs or inline 'text' i.e. arbitrary ranges), we have either or both of two sorts of evidence. First is a label or 'family' or 'class' assignment by virtue of the name of an attached Style (in Word, a so-called Paragraph or Text style). Alternatively, or in addition, we have explicit formatting properties as designated explicitly - including font shifts, bold/italics, whitespace properties and so forth. (Properties assigned by any other mechanism are likely to be document-wide in scope i.e. not differentiating, therefore not of interest to us.)
|
|
|
|
|
|
As it happens, in its **class** and **style** mechanisms (permitted in valid HTML since the 1990s) give us everything we need, especially since the conventional use of **style** (and indeed of **class**) is to call to CSS (either embedded or bound externally) -- which (Fortune smiles) just happens to be an entire language, and a well-known vernacular, for expressing just the sort of the information (more or less) we have at hand - namely things like font size and face shifts, etc.
|
|
|
|
|
|
Not only that, but since HTML permits us (among other things) to mix headers at all levels in with structures (p, div), we can produce a result that passes the input through *as it is (not) structured* -- thereby postponing to a later step, the sometimes vexing problem of a structural induction (specific or generalized).
|
|
|
|
|
|
### Oh, other requirements for output format
|
|
|
|
|
|
We want our output to be HTML5, preferably valid. However -- not least because we are pipelining in XSLT! -- but we also want it to be well-formed XML. We can do this by stipulating that our results will be (implicitly) XHTML5 - with NO DOCTYPE declaration (as per HTML5 Rec) and NO XML declaration (as per XML Rec), encoded in UTF-8.
|
|
|
|
|
|
### Suitability and capability of XSLT
|
|
|
|
|
|
We like XSLT because it is
|
|
|
|
|
|
* standardized (W3C)
|
|
|
* available as FOSS (SaxonHE), OS neutral and portable
|
|
|
* functional / declarative (good for maintenance/stability as well as performance)
|
|
|
* modular and homoiconic (useful in pipelining)
|
|
|
* capable for the job, even from scratch (i.e. no third-party libraries necessary) and using only standard features
|
|
|
|
|
|
Our experience so far is that XSLT works for this application. Indeed some XSLTs may be perfectly legible even to "non programmers" who are also looking at inputs and outputs - while others perhaps are not. The collection of transformations (and its possible sequencings) may however make a useful object lesson.
|
|
|
|
|
|
|