... | ... | @@ -153,22 +153,24 @@ It turns out, among these the choice is fairly clear. Only a made-for-purpose la |
|
|
|
|
|
### Why HTML5 turns out to be 'most excellent'
|
|
|
|
|
|
For any target format serving a data conversion out of .docx, there would be three nice-to-haves:
|
|
|
|
|
|
as a target for a data conversion out of .docx, there would be three nice-to-haves:
|
|
|
Target format can capture presentational information (i.e. specs on how outputs should look on the 'page') - since that's the semantic domain of the .docx source, i.e. (much of) the information we have.
|
|
|
Target format may be unstructured or loosely structured; in any case structures are not enforced. That's because we know stuff coming in will have every structure and no structure.
|
|
|
Ideally it would be well-known, a lingua franca.
|
|
|
As you say, it turns out HTML/CSS is all these things.
|
|
|
Plus none of it is exclusive of XML discipline or even XML tools.
|
|
|
Indeed we could write schemas to validate whatever different 'flavors' or stages of HTML we wished to forrmalize.
|
|
|
* In order to reflect the 'semantic domain' of the .docx source format (which ultimately describes 'presentation' or is at least bound largely to presentation behaviors), we need to be able to capture presentational information, i.e. specs on how outputs should look on the 'page'. This is not because this information is valuable in itself, but because it serves as the "context of distinction" or "terrain" within which any putative semantics must be inferred. (And presumably a putative semantics must be able to map back into this presentation.)
|
|
|
|
|
|
Why does HTML make a good target vocabulary?
|
|
|
* The target format must support unstructured or loosely structured data; in any case structure constraints are not enforced. That's because we know stuff coming in will have every structure and no structure.
|
|
|
|
|
|
* Ideally our format of choice would be well-known, a *lingua franca*.
|
|
|
|
|
|
These obviously suggest a late-model HTML. We can stipulate HTML5 in an XML syntax (so nominally an XHTML5).
|
|
|
|
|
|
- We can easily leave our documents 'flat' as long as we need to - structure can come later! (This is a key distinction vs our final target format)
|
|
|
- It has `@class` and `@style`, fantastic escape hatches!
|
|
|
- HTML @style invites us to use CSS! (And we are describing presentational features. Perfect.) While HTML @class can expose Word Styles (since it is for user-driven semantic labeling after all).
|
|
|
- Yet at the same time, HTML semantics are not so rich as to be very arguable (anything will do); we should be able to avoid complications regarding the purported conformance/orthodoxy of our outputs to applicable standards. (Also a separable problem.)
|
|
|
- To top it off, HTML is a well-known vernacular --
|
|
|
|
|
|
Additionally, over against possible alternatives
|
|
|
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now;
|
|
|
- And since we are expecting to work further with our data on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan ...
|
|
|
- ... Illustrating the point: anyone can use HTML5 (especially wf XML HTML5) so let's use that.
|
... | ... | @@ -177,7 +179,7 @@ Note that choosing HTML/CSS as a vocabulary does not mean we need utterly to for |
|
|
|
|
|
Indeed, over the medium or long term it could be useful to design XML schemas to describe any different stages or 'profiles' of HTML (perhaps, operational subsets of XHTML5/CSS) that we wished to formalize.
|
|
|
|
|
|
We make use of @style in ways that are deprecated and may be discouraged in other contexts. We justify our use of @style in this way as a means to an end - we represent information on @style so that it can be abstracted away (and the style can be gotten rid of).
|
|
|
In the meantime, we have a suitable carrier format. (See below for more.) It does make use of @style in ways that are deprecated and may be discouraged in other contexts. We justify our use of @style in this way as a means to an end - we represent information on @style so that it can be abstracted away (and the style can be gotten rid of).
|
|
|
|
|
|
Next to these, the fact that HTML also has an element-type semantics albeit an impoverished one - p, ul, li, the lists, tables - is useful, but not essential, as long as we have generic "hangers" we can use such as div, p and span.
|
|
|
|
... | ... | @@ -199,25 +201,24 @@ Even sloppy HTML, however, should be relatively idiomatic -- it should look not |
|
|
|
|
|
If we want to it "sound like" the source, while being prepared to rectify its inadequacies, this means translating the information we see, into a language we understand. At the level of structures in the Word (either lines/paragraphs or inline 'text' i.e. arbitrary ranges), we have either or both of two sorts of evidence. First is a label or 'family' or 'class' assignment by virtue of the name of an attached Style (in Word, a so-called Paragraph or Text style). Alternatively and in addition, we have explicit formatting properties as designated using Word's formatting tools, whether they be font shifts, bold/italics, whitespace properties and so forth. (Properties assigned by any other mechanism are likely to be document-wide in scope i.e. not differentiating, therefore not of interest to us.)
|
|
|
|
|
|
As it happens, in its **class** and **style** mechanisms (permitted in valid HTML since the 1990s) give us everything we need, especially since the conventional use of **style** (and indeed of **class**) is to call to CSS (either embedded or bound externally) -- which (Fortune smiles) just happens to be an entire language, and a well-known vernacular, for expressing just the sort of the information (more or less) we have at hand - namely things like font size and face shifts, etc.
|
|
|
As it happens, in its **class** and **style** mechanisms (permitted in valid HTML since the 1990s) give us just about everything we need, especially since the conventional use of **style** (and indeed of **class**) is to call to CSS (either embedded or bound externally) -- which (Fortune smiles) just happens to be an entire language, and a well-known vernacular, for expressing just the sort of the information (more or less) we have at hand - namely things like font size and face shifts, etc. That is, in the form of super-added CSS we have at hand a fairly complete language for describing the sorts of features we are likely to be interested in preserving, in an extraction process that must necessarily be *selective* to hear the signal in the noise.
|
|
|
|
|
|
Not only that, but since HTML permits us (among other things) to mix headers at all levels in with structures (p, div), we can produce a result that passes the input through *as it is (not) structured* -- thereby postponing to a later step, the sometimes vexing problem of a structural induction (specific or generalized).
|
|
|
|
|
|
### Oh, other requirements for output format
|
|
|
|
|
|
We want our output to be HTML5, preferably valid. However -- not least because we are pipelining in XSLT! -- but we also want it to be well-formed XML. We can do this by stipulating that our results will be (implicitly) XHTML5 - with NO DOCTYPE declaration (as per HTML5 Rec) and NO XML declaration (as per XML Rec), encoded in UTF-8.
|
|
|
We want our output to be HTML5, preferably valid. However -- not least because we like pipelining in XSLT -- but we also want it to be well-formed XML. We can do this by stipulating that our results will be (implicitly) XHTML5 - with NO DOCTYPE declaration (as per HTML5 Rec) and NO XML declaration (as per XML Rec), encoded in UTF-8.
|
|
|
|
|
|
### Suitability and capability of XSLT
|
|
|
|
|
|
We like XSLT because it is
|
|
|
|
|
|
* standardized (W3C)
|
|
|
* available as FOSS (SaxonHE), OS neutral and portable
|
|
|
* functional / declarative (good for maintenance/stability as well as performance)
|
|
|
* modular and homoiconic (useful in pipelining)
|
|
|
* capable for the job, even from scratch (i.e. no third-party libraries necessary) and using only standard features
|
|
|
* maybe *different from* and yet *complementary with* more mainstream approaches for web or HTML content (however well established in publishing back ends otherwise)
|
|
|
|
|
|
Our experience so far is that XSLT works for this application. Indeed some XSLTs may be perfectly legible even to "non programmers" who are also looking at inputs and outputs - while others perhaps are not. The collection of transformations (and its possible sequencings) may however make a useful object lesson.
|
|
|
* Standardized (W3C)
|
|
|
* Available as FOSS (SaxonHE), OS neutral and portable
|
|
|
* Functional and declarative (good for maintenance/stability as well as performance)
|
|
|
* Modular and homoiconic (useful in pipelining)
|
|
|
* In general, XSLT transformations work well within transparent pipelining architectures
|
|
|
* Deployed this way XSLT is capable for the job, even from scratch (i.e. no third-party libraries necessary) and using only standard features
|
|
|
|
|
|
Our experience so far is that XSLT works for this application. Of course ultimately, any solution will not only provide functionality; it will also be useful when copied, cloned and adapted, or we have not achieved our larger goal. This will require a solution whose productions are traceable, legible and modifiable. This means that XSweet will be more than just the XSLTs; it will also be the 'folkways' developed in using it.
|
|
|
|