Commit 862a6d9a authored by Wendell Piez's avatar Wendell Piez

More docs

parent d2c1e2dc
......@@ -12,7 +12,7 @@ The stylesheets in XSweet include other XSLTs as well, maintained in the HTMLeva
Why use XSweet
- Works reasonably well with defaults
- Works reasonably well with defaults*
- Works reasonably well on arbitrary inputs
- Alternatively, you can assert control
- Deciding which phases (transformations) to include or exclude
......@@ -20,11 +20,22 @@ Why use XSweet
- Acts like a black box, but isn't: due to its "get insidability", XSweet is adaptable and extensible
- Versatile, powerful, scalable: doorway to XSLT
* You do have to "pick your pathway"
Also, please use XSweet in addition to other tools that do parts or all of the job. There is no wrong way to use XSweet. You might find you like another extractor, but like to use XSweet to clean up afterwards. Or maybe the other way around.
XSweet's various components are all written as straightforward "one-up" XSLT transformations with some meta-transformations and other advanced stuff mixed in. (The basic "extract from docx" transformation combines five separate transformation steps, operating on a zip file unzipped on the system.) Operations are broken out very discretely so a single XSLT typically has a single task to do. This enables very flexible mixing and matching the transformations called in for particular pipelines. If there's something you don't need, you simply leave it out. If there's something you want, you add it.
## Pick your pathway
Picking your pathway entails two choices: first, picking your framework or glue language. Then, selecting from and arranging for the different transformations to be executed on your data. Since XSweet's first challenge was Word data conversion, the working assumption is that your input data might (may) be WordML (an XML format maintained internally in Word docx files); but components of XSweet might also be useful in document processing architectures especially XML-based HTML processing architectures.
Since this endeavor is open-ended and everyone's solution will be different, XSweet offers several "prefab pathways" using several different "glue" solutions, for inspection and emulation. An easy way to start is just to try whichever of these you think will work best for you.
## XSweet architecture: mix and match your XSLTs
XSweet's various components are all written as straightforward "one-up" XSLT transformations with some meta-transformations and other advanced stuff mixed in. (The basic "extract from docx" transformation combines five separate transformation steps, operating on a zip file unzipped on the system.) Operations are broken out discreetly so a single XSLT typically has a single task to do. This enables flexible mixing and matching of the transformations called in for particular pipelines. If there's something you don't need, you simply leave it out. If there's something you want, you add it.
Using a glue language of some sort, transformations are strung together in chains or "pipelines". There are many ways of building and operating a pipeline of transformations, running all the way from old-fashioned scripts, to full-fledged pipelining technologies such as those supporting XProc, a specification of W3C that describes a pipeline language. As long as you can get them to work independently, XSweet will also work irrespective of the particular approach you use to putting its inputs and outputs together.
This does mean you need some kind of glue (language, environment) to string the transformations together. Fortunately there are many ways of building and operating a pipeline of transformations, running all the way from old-fashioned scripts, to full-fledged pipelining technologies such as those supporting XProc, a specification of W3C that describes a pipeline language. As long as you can get them to work independently, XSweet will also work irrespective of the particular approach you use to putting its inputs and outputs together.
A pipelining or "glue" framework must be able to:
......
......@@ -19,12 +19,22 @@ Note: it's a convention in this project to name stylesheets with components of t
## `css-abstract`
Does its best to rewrite style properties into CSS. YMMV.
## `docx-extract`
Pulls HTML out of .docx. Assumes `document.xml` as the (primary) source document. Several XSLTs here may also make reference to other XML documents in the (docx, unzipped) repository, such as `styles.xml`, `footnotes.xml` and the like.
See the readme for more info.
## `html-polish`
Steps expected to be final or near-final.
## `list-promote`
Make HTML `ol` and `ul` from WordML inputs (marked as lists items).
## `local-fixup`
## `produce-analysis`
......
# XSLT Pipelines and pipeline languages
XSLT:
* A 4GL language specifically designed for and well-suited to this task
* Its "document transformation" architecture fits well with most publishing workflows (in some form or other)
* Standards basis and available open source tools render it effectively platform (vendor) independent
Pipelines:
The way XSLT (and its forerunner technologies) works is by specifying a *transformation*, but it does not answer the question of what that consists of, exactly. We change this into that, but it turns out that the "this" is detailed and complex (it is not just a tree, it is a trunk, bark and leaves) -- and so is the "that".
A tried and true method of handling complexity in transformations is to break it down into simpler parts, which can be developed and tested separately before they are coordinated together. XSLT, as a declarative and functional language, mandates no specific mechanism by which its own operations may be achieved -- its own deployment architectures are to various for that, and anyone who has the capability of running one transformation, has the capability of running a second one after the first. If the input of the second transformation is the result of the first, this is a pipeline. So pipelining is what you do with XSLT, not the way you do it.
In the case of XSweet, its pipelines may be considerably longer, as many as eight, ten, 12 or more XSLT transformations in a chain, each consuming the output of the last. Moreover, there are particular operations that XSweet includes, that require not a single transformation but a small coordination or choreography of several "document" inputs and outputs. For example, one operation (header promotion) requires that a document be run through a transformation producing an analysis, which is subsequently used to produce *an XSLT stylesheet*, which is then applied back again to the original document. (This way, features of the particular document can be encoded directly into the transformation to be applied to it.) Pipelines, in other words, can have branches, both on the input side (such as multiple document inputs not just because you have a stack of chapters, but also because you have a metadata or configuration that goes with all of them) and on the output side (multiple outputs for multiple purposes of analysis, representation and formatting).
How is such a pipeline achieved? *Any way you like*. However, as with everything, there are tradeoffs. In particular, in the case of an XSLT application running over XML documents (stored in a file system or shared over the wire), there are considerations related to the overhead of moving and manipulating data in such a system. Sometimes, the same operation that will take minutes to accomplish in one environment, will take only seconds in another (i.e. order of magnitude difference can be observed) only because of how the transformations are executed and chained. However, a slow pipeline technology might have other advantages. It might be quick and easy to set up using a language a programmer already knows.
XSweet was designed and tested using the industry-leading XSLT engine, SaxonHE, open source software written in Java, supporting XSLT and XPath 3.0. However, we have pipelined calls to execute particular transformations in numerous ways -- the only thing they having in common is XSLT and Saxon themselves.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment