... | ... | @@ -2,11 +2,22 @@ |
|
|
|
|
|
Not a formal subset or profile of HTML or any other formally specified language, but an *approach* to using HTML and web pages.
|
|
|
|
|
|
### Principles
|
|
|
HTML Typescript has nothing to do with Word documents or word processing documents - except that it is designed so that producing recognizable HTML Typescript documents is relatively *easy* and *straightforward* to produce reliably from Word OpenXML.
|
|
|
|
|
|
* Keep all the data. Where possible, we design everything to come through by default. But: we don't take this to an extreme; it doesn't actually mean we have to capture literally everything. For example, we treat certain parts of documentary apparatus (such as page headers or page layout settings) to be incidental and dispensable by design (because part of our job is to decide those things). But we don't drop anything by accident.
|
|
|
* Expect the data to be flat. "One darn paragraph after another." Semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties. Sometimes we are lucky. For example if the Word comes in all nicely styled with a known template - we capture that info the HTML. But we only copy labels, and don't try to assign semantic categories.
|
|
|
** Subpipelines (separate recipes) such as the header-promote pipeline can provide enhancements over and above this, but these functiona as optional add-ons.
|
|
|
In particular, this means that insofar as as information available only from Word documents, is ambiguous and underspecified, an HTML Typescript version must mirror (exactly) that ambiguity and underspecification. It should tell us (albeit in HTML+CSS, a language we can understand) exactly what we could know from the Word (if we understood WordML) - and no more. It should not make logical leaps, inferences, or even much in the way of translation.
|
|
|
|
|
|
### Design principles
|
|
|
|
|
|
A typewriter can be used to create an artifact (namely a typed MS or typescript) amenable to a type- and print-based publication process. Although a material object, the typescript also typically provides for a kind of *encoding* by which it can communicate intentions from author to editors. Of course a typescript is also a "platform" for changes, with a kind of "production loop" built around it (wherein it might be said to evolve in form, from fair copy to galleys to printed production).
|
|
|
|
|
|
We aim to provide the same sort of "paper functionality" in HTML. It is not an electronic scratchpad - what we see are recognizably documents, with the features of formatted documents. But it isn't a formal or even very regular arrangement. It reflects whatever regularies "emerge" from the habits and practices of its original authors - those and no more. Which is good since those regularities (such as they are) are exactly the ones we wish to see.
|
|
|
|
|
|
### Translation principles
|
|
|
|
|
|
* When extracting from a Word processor -- wherever possible, we design everything to come through by default. But: we don't take this to an extreme; it doesn't actually mean we have to capture literally everything. For example, in a word processor document, we treat certain parts of documentary apparatus (such as page headers or page layout settings) to be incidental and dispensable by design. (Because part of our job is to provide those things). But we don't drop anything by accident.
|
|
|
* But -- expect data coming out to be flat. "One darn paragraph after another." Semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties. Sometimes we are lucky. For example if the Word comes in all nicely styled with a known template - we can capture that info in the HTML. Because HTML Typescript has a way to do this, no problem. But we only copy labels -- we don't try to assign semantic categories.
|
|
|
* Instead, we focus on capturing and correctly representing what we see, and presenting it in such form as a downstream process can do what's called for.
|
|
|
** Exception: subpipelines (separate recipes) such as the header-promote pipeline can provide enhancements over and above straight "extraction", but these function as optional add-ons.
|
|
|
* Cross-references are emulated for footnotes and endnotes (as regular structures created as such in the Word document), but by and large, there will be no linking or cross-referencing implemented where they are warranted in the source. Where the source data has a string, we get a string and nothing more. (We expect robust linking mechanisms to come later when data is better structured and more regular.)
|
|
|
* `class` overloading is okay. `class` is convenient for any semantic labeling. We will be more or less shameless in adding whatever values we think we need, to communicate downstream.
|
|
|
* Similarly `style` and other *presentational* formatting is not only acceptable - for some applications it is preferred. Word has the feature that any arbitrary span or segment of text may be assigned its own properties for presentation. (And writers using Word do this a lot.) This has a straightforward analogue in HTML `@style` attribute, which is (for ordinary every day purposes) deprecated or discouraged. It permits us to hang CSS to describe formatting basically wherever we like [example]
|
... | ... | @@ -17,33 +28,41 @@ Indeed, the very reasons why @style and presentational tagging in general should |
|
|
|
|
|
The expectation of HTML Typescript is that we wish the thing to be a basis for improvement, not a finished thing in itself.
|
|
|
|
|
|
## Syntactic requirements (these preserve interoperability with the XML transformation/query stack)
|
|
|
## XSweet implementation
|
|
|
|
|
|
### Syntactic requirements (these preserve interoperability with the XML transformation/query stack)
|
|
|
|
|
|
* XML syntax w/ no tag minimization. Everything in XHTML namespace; otherwise avoid namespace complications
|
|
|
* Implicitly (X)HTML5 w/ no XML declaration and no DOCTYPE declaration (thus serializable using XML serializers)
|
|
|
|
|
|
## Validation
|
|
|
### Validation
|
|
|
|
|
|
Ideally any HTML Typescript document will be valid to the HTML5 Rec and all relevant specs. However we can't practically require all tools to be valid all the time to anything, so formal validation to an external model is not in scope for HTML Typescript - even the HTML model.
|
|
|
|
|
|
This is because validation is so important and sensitive, it requires a toolkit of its own, with no interference here!
|
|
|
|
|
|
## High-level structure / scaffolding
|
|
|
### High-level structure / scaffolding
|
|
|
|
|
|
docx-extract section
|
|
|
|
|
|
xsweet-footnotes and xsweet-endnotes sections
|
|
|
|
|
|
## "Paragraph level", @class and @style
|
|
|
### "Paragraph level", @class and @style
|
|
|
|
|
|
Overloading class.
|
|
|
|
|
|
## Inline markup
|
|
|
### Inline markup
|
|
|
|
|
|
Element types, @class and @style
|
|
|
|
|
|
`u` and `i'` vs (e.g.) `strong` and `em`. Escaping to CSS when tags are not available (eg small caps).
|
|
|
|
|
|
## CSS refactoring
|
|
|
### CSS (re) writing and refactoring
|
|
|
|
|
|
### Licenses to rewrite
|
|
|
|
|
|
XSweet components can be applied to perform certain regularizations, such as promotion of inline css to the paragraph when applied to everything in the paragraph, or grouping of like elements appearing in sequence.
|
|
|
|
|
|
## Sticky bits
|
|
|
### Sticky bits
|
|
|
|
|
|
|