... | ... | @@ -2,12 +2,48 @@ |
|
|
|
|
|
Not a formal subset or profile of HTML or any other formally specified language, but an *approach* to using HTML and web pages.
|
|
|
|
|
|
Principles
|
|
|
* We aim to keep all the data. Where possible, we design everything to come through by default. This doesn't mean we have to capture everything. For example, we treat certain parts of documentary apparatus (such as page headers or page layout settings) to be incidental and dispensable. But we don't drop anything by accident.
|
|
|
* Expect the data to be flat. One darn paragraph after another. Semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties.
|
|
|
* Cross-references are emulated for footnotes and endnotes, but by and large, there will be no linking or cross-referencing implemented - because they are simply not warranted in the source. Where the source data has a string, we get a string and nothing more.
|
|
|
### Principles
|
|
|
|
|
|
* Keep all the data. Where possible, we design everything to come through by default. But: we don't take this to an extreme; it doesn't actually mean we have to capture literally everything. For example, we treat certain parts of documentary apparatus (such as page headers or page layout settings) to be incidental and dispensable by design (because part of our job is to decide those things). But we don't drop anything by accident.
|
|
|
* Expect the data to be flat. "One darn paragraph after another." Semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties. Sometimes we are lucky. For example if the Word comes in all nicely styled with a known template - we capture that info the HTML. But we only copy labels, and don't try to assign semantic categories.
|
|
|
** Subpipelines (separate recipes) such as the header-promote pipeline can provide enhancements over and above this, but these functiona as optional add-ons.
|
|
|
* Cross-references are emulated for footnotes and endnotes (as regular structures created as such in the Word document), but by and large, there will be no linking or cross-referencing implemented where they are warranted in the source. Where the source data has a string, we get a string and nothing more. (We expect robust linking mechanisms to come later when data is better structured and more regular.)
|
|
|
* `class` overloading is okay. `class` is convenient for any semantic labeling. We will be more or less shameless in adding whatever values we think we need, to communicate downstream.
|
|
|
* `style` and other *presentational* formatting is not only acceptable - for some applications it is preferred. Word has the feature that any arbitrary span or segment of text may be assigned its own properties for presentation. This has a straightforward analogue in HTML `@style` attribute, which is (for ordinary every day purposes) deprecated or discouraged. In effect, CSS provides us an entire language to expose a range of features in our source data, except in a language familiar to developers and readily processed using tools they already have. So if the Word says, "1inch left margin" we can turn that into "left margin: 72pt" or indeed "left-margin: 1in" in CSS.
|
|
|
The very reasons why @style and presentational tagging in general should be avoided in "good" markup -- because *at best*, they represent *work to be done*, while at worst they are misleading cruft -- these are the very same reasons why presentational tagging, `i`, `b` and all that species, are acceptable and indeed preferable as a representation (as 'transparent' as possible) of Word source data. Because that's what the Word data has, and we need to see exactly what is there prior to "casting" it into anything.
|
|
|
* Similarly `style` and other *presentational* formatting is not only acceptable - for some applications it is preferred. Word has the feature that any arbitrary span or segment of text may be assigned its own properties for presentation. (And writers using Word do this a lot.) This has a straightforward analogue in HTML `@style` attribute, which is (for ordinary every day purposes) deprecated or discouraged. It permits us to hang CSS to describe formatting basically wherever we like [example]
|
|
|
|
|
|
In effect, CSS provides us an entire language to expose a range of features in our source data, except in a language familiar to developers and readily processed using tools they already have. So if the Word says, "1inch left margin" we can turn that into "left margin: 72pt" or indeed "left-margin: 1in" in CSS.
|
|
|
|
|
|
Indeed, the very reasons why @style and presentational tagging in general should be avoided in "good" markup -- because *at best*, they represent *work to be done*, while at worst they are misleading cruft -- these are the very same reasons why presentational tagging, `i`, `b` and all that species, are acceptable and indeed preferable as a representation (as 'transparent' as possible) of Word source data. Because that's what the Word data has, and we need to see exactly what is there prior to "casting" it into anything.
|
|
|
|
|
|
The expectation of HTML Typescript is that we wish the thing to be a basis for improvement, not a finished thing in itself.
|
|
|
|
|
|
## Syntactic requirements (these preserve interoperability with the XML transformation/query stack)
|
|
|
|
|
|
* XML syntax w/ no tag minimization. Everything in XHTML namespace; otherwise avoid namespace complications
|
|
|
* Implicitly (X)HTML5 w/ no XML declaration and no DOCTYPE declaration (thus serializable using XML serializers)
|
|
|
|
|
|
## Validation
|
|
|
|
|
|
Ideally any HTML Typescript document will be valid to the HTML5 Rec and all relevant specs. However we can't practically require all tools to be valid all the time to anything, so formal validation to an external model is not in scope for HTML Typescript - even the HTML model.
|
|
|
|
|
|
This is because validation is so important and sensitive, it requires a toolkit of its own, with no interference here!
|
|
|
|
|
|
## High-level structure / scaffolding
|
|
|
|
|
|
docx-extract section
|
|
|
|
|
|
xsweet-footnotes and xsweet-endnotes sections
|
|
|
|
|
|
## "Paragraph level", @class and @style
|
|
|
|
|
|
## Inline markup
|
|
|
|
|
|
Element types, @class and @style
|
|
|
|
|
|
`u` and `i'` vs (e.g.) `strong` and `em`. Escaping to CSS when tags are not available (eg small caps).
|
|
|
|
|
|
## CSS refactoring
|
|
|
|
|
|
## Sticky bits
|
|
|
|
|
|
|