NOTE: THIS REPO WIKI INFORMATION HAS BEEN SUPERCEDED BY DOCUMENTATION ON THE PROJECT WEBSITE. SEE THE XSWEET PROJECT WEBSITE FOR THE MOST UP-TO-DATE DOCUMENTATION.
XSweet
A set of tools supporting data acquisition, editorial and document production workflows, on an XML stack with XML/HTML/CSS interfaces. We like the XML stack (XSLT in particular) for these purposes, because it is well suited to encapsulating discrete processes in document transformation, providing performant, scalable, reusable and robust solutions in a 'pluggable' way. XSweet should "just work". But it should also be adaptable.
Aims (specific, general, short- and long-term):
- MS Word "Office Open" XML (aka WordML) into "HTML typescript"
- Arbitrary HTML / CSS mapping and munging (HTML tweak)
- Validation services against ad-hoc (project based) schemas and constraint sets
- Conversion from editorial system (enhanced typescript) into structured targets (e.g. TEI, JATS/BITS)
Design constraints:
- All open source (specifications and components)
- W3C XSLT 2.0 is okay
- INK will provide a pipelining infrastructure, but pipelines need to be operational outside INK as well.
XSweet components (so far)
Each of these will be an INK recipe?
docx-extract
Input: an unbundled .docx file, specifically its document.xml Output: An HTML file (probably valid). The HTML will conform to HTML Typescript, an informal profile of XHTML5 that interprets a WordML document as "fair copy" for a subsequent publication process.
html-validate
This should include HTML5 validation plus any ad-hoc (html typescript or project-oriented) validation.
html-tweak
Make any adjustments to HTML @class/@style indicated by an externally exposed (configurable) driver file. This is an "otherwise copy" (modified identity) transformation (i.e. its results are a modified copy of the source with the only differences being the specified interventions).
header-promote
Convert p elements into h1-h6 based on heuristics. Otherwise copy.
ucp-adjust
Adjustments warranted by/for UCP and its projects. (Otherwise copy.) e.g. @class='Hypertext' -> hypertext links
produce-plaintext
Produces plain text version of HTML content extraction (for comparison / proofing) Intended as a 'terminal' process, i.e. its results are not expected to be inputs downstream (though no one will stop you)
produce-analysis
Produces synoptic/analytic 'map' of HTML output (element/attribute usage in hierarchy) A 'terminal' process (transformation delivers results of analysis, not modified copy of source)
html-frame
populate structured html section skeleton with contents from flat HTML typescript (i.e. converts from unstructured, to structured form, following a structure externally specified, a 'frame'. This is implemented as a 'pull', so some defensiveness i.e. validation/analysis/testing is called for to avoid dropped data.)
html-polish
cleanup of arbitrary html - removes redundancies and normalizes anomalies. (Otherwise copy.)
(old contents: move some to docx-extract page)
A suite of XSLT 2.0 stylesheets for .docx
data extraction and refinement into HTML, for editorial workflows. (And perhaps eventually for other tasks. And also, possibly not always XSLT 2.0.)
Scope and goals
We aim to develop and share an open source toolkit on a commodity platform (XSLT) that provides good-enough data extraction from arbitrary Word (.docx) files into publishing workflows (editing and production) able to exploit HTML markup for ingest.
"Good enough" means that the tools are serviceable (or better) in actual document conversion workflows, while producing results at least as good (for these purposes) as available alternatives and pathways.
An important consideration for these purposes is that these stylesheets need to work (or may need to work) on arbitrary Word inputs, not just Word documents written to templates or (implicitly or explicitly) other constraint sets. Another is that the results do not have to be good enough to publish, just good enough to be worth editing further.
A pipeline, not a stylesheet
For maximum flexibility and maintainability, we deploy an XSLT-based solution not as a single transformation but as a series of transformations to be arranged in a sequence (pipeline). Since this is exactly analogous to INK's processing model (an INK 'recipe' is a pipeline), for ongoing projects we can deploy this straightforwardly on INK, while also remaining platform independent with respect to pipelining technology. (I.e., the same XSLTs in the same sequence will work the same in another environment; this is commodity/standard XSLT 2.0.)
One advantage this gives us is transparency. Since each XSLT does less, holes and bugs are easier to find and fill than in a single relatively opaque XSLT (which may run pipelines internally). A suite of smaller XSLTs should be easier to understand, maintain and modify than a single monolithic stylesheet.
Another advantage is flexibility. We may be able to deploy suites of modules to be used together and separately in "mix-and-match" combinations.
If we run into performance issues due to overhead in this architecture (e.g. for parsing/serialization of temporary results) we can consider strategies for mitigation.
Experience has shown that the transformation Word .docx -> structured markup is difficult to specify and implement. We believe one reason it has been difficult is because assumptions have been made regarding requirements, which do not actually apply in many or most cases -- and in particular, which do not apply in a situation in which a significant editing phase is planned for after conversion.
HTML as a target format
The main (only) operational target for the system is (so far) ingest by PubSweet. Any spinoff applications for the HTML produced by this pipeline are nice-to-haves, not presently requirements. This gives us a great deal of flexibility in the design of an optimal format for use in PubSweet by its users (and client applications).
However an important consideration is that none of these are in scope for this transformation:
-
Any attempt to represent in the HTML result the "correct" intellectual or logical structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures, boxed text, margin callouts etc., where the Word source encoding does not have (explicit) structure, the HTML will not seek to represent it.
-
Any effort to identify or extract metadata fields. Metadata will be provided via another channel; if we want, an additional pipeline step (XSLT) can provide for metadata injection (like any enhancement) but its source is not expected necessarily to be the Word (.docx) data.
-
In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" may be so marked in the HTML (as a
<span class="Italic">
), but not marked as HTMLi
or represented as italic in any other way.
None of these rules are absolute. In particular because it will be difficult to be both comprehensive and succinct (economical), the particulars of the target format (as respects element types, attribute values etc.) are probably best designed in use. HTML tagging works as a baseline because it is a vernacular and developers know what to expect from it -- so it gives us some (broad) boundaries going forward.
Development of a formal spec for such a format is an item tbd. For now, we intend to produce "good pudding" that can be proven by eating it.
"HTML slops" or HTML soup is what we call the output - it is messy, but nutritious. Broadly speaking, it should be considered a (fairly weak) form of HTML or HTML5, using XML syntax (albeit an empty XML prologue) making it amenable to both XML and HTML parsers.
Extract, then refine.
In the first step, extraction, data is pulled from the Word document. Because WordML and its ilk (Office Open XML) is less of a "markup language" than it is an object serialization, its representation of the text must be "folded" into a markup-idiomatic version via a peculiar kind of structural transformation (XSLT Level 10 spell).
The resulting HTML will be a correct and fairly transparent representation (albeit expressed in HTML-ese) of the content of the Word document. It will not be optimal: for example, it is likely to be very redundant and repetitive. We tolerate this in data extraction for the sake of transparency and traceability.
Following extraction, data (now HTML) may be piped through a sequence of steps, each performing a different task in data or markup processing, including cleanup. (For example, one such post-process 'collapses' tagging so that <i>Wuthering</i><i> Heights</i>
becomes <i>Wuthering Heights</i>
.)
The goal of the entire pipeline (just not the first step) is as clean and simple a representation as possible of the 'labeling' of document parts implicit in formatting properties and style names assigned in the WordML data to regions of text, with minimal (ideally no) 'interpolation' of 'meaning' beyond what is given in the source data (either indirectly via formatting properties, or nominally via style names).
Note that separating requirements into "extract" and "refine" permits us to design each separately. Possibly either phase could support customization layers. Initially our goal is to see how much we can do with only generic logic.
Iterative development model
Since many of the particular requirements for data capture and representation can only be defined in use, project feedback is essential to further development of these stylesheets.
For the time being this will involve old-fashioned comparisons between Word source data (viewed both as a printed page, and as an artifact in Word) and HTML produced by the XSLT and as exposed in / interfaced with (tools in) PubSweet, and reporting lapses, discrepancies, and opportunities for improvement. (First question is always: Did we get it all?)
As the system matures it should require less time from an XSLT developer to process a single (set of) Word documents for PubSweet. A sustainable system may permit configuration or tweaks to performed routinely by team members with XSLT (or other) skills; but it should not require deep knowledge (of XSLT) to keep it running or extend to new cases. Ideally, it would be well enough documented that new users could teach themselves. (Something to aim for.)
Keep in mind that another advantage of a pipelining architecture is that XSLT can be combined into pipelines with transformations implemented in other languages. It is taken as an assumption from the outset that this tool will only succeed if it works well with other tools, most especially other tools that people actually use.
Big Questions
What should come through and what shouldn't and how do we know?
While the transformation is lossless as respects "main document content" (that is, data that is stored as the nominal "document text" in the Word file), there is also much other data in a Word source file that should come with it (the most obvious: footnotes and figures), and due to the ("database-like") organization of the .docx source data, it is difficult or impossible to ensure a transformation will never drop data. (Especially since some of the info in the Word document -- page headers come to mind as an example -- should arguably not come into the HTML in any case.)
The solution is vigilance, contextual awareness (both processes and contingencies) and testing. In addition to some means of representing and aligning specifications for the different XSLTs (such as inline documentation), we need to have robust mechanisms for detecting problems in data extraction especially lost data; for ameliorating such problems in the instance (sometimes they may not be fatal errors); and for maintaining and improving the XSLTs so they don't happen.
For now this is happening under Issues in this repository.
Expected problem areas? Of course, citations and references, math, cross-references.