... | ... | @@ -18,7 +18,7 @@ We'll do this using basic-brain-dead HTML/CSS: basically a few structural divs f |
|
|
|
|
|
HTML Typescript isn't one thing, because it is a transitional format. (So the same document may be in one form of HTML Typescript early in an editing process, another one later -- before being "lifted" out of HTML Typescript altogether.) There are a couple of consistent differences between HTML Typescript and other species of HTML, which make it recognizable:
|
|
|
|
|
|
* At least early in editing, it will be mostly flat. (No application-oriented scaffolding to speak of certainly not lots of deeply nested divs.)
|
|
|
* At least early in editing, it will be mostly flat. (No application-oriented scaffolding to speak of certainly not lots of deeply nested divs.) To the extent there is structure, it reflect how the data is extracted, not how it is correctly arranged or structured internally.
|
|
|
|
|
|
* Not much richness of tagging. No HTML5 'semantic' elements such as `header` or `aside`. Mostly just `p` elements with inline elements, `span` and the like.
|
|
|
|
... | ... | @@ -51,7 +51,7 @@ By 'presentational' of course we mean that tagging is devoted to describing pres |
|
|
|
|
|
Open this in any HTML browser and you will see something quite consistent with the source data. The only indicators that there is a shift in structure (which the eye can see as the "beginning of the list") are in the indent and margin settings.
|
|
|
|
|
|
In a subsequent processing phase, we might use a filter to remove the font settings (as not informative of useful distinctions) and rewritte the styles to reduce verbosity. (This is still HTML Typescript.)
|
|
|
In a subsequent processing phase, we might use a filter to remove the font settings (as not informative of useful distinctions) and rewrite the styles to reduce the inline verbosity. (This is still HTML Typescript.)
|
|
|
|
|
|
```
|
|
|
<p class="xsw_indent36pt">Take Emerson. Emerson is always on the verge of making himself exceptional—either exceptionally puny, ineffective, and futile, or exceptionally stable and transparent. He gave his “Laws of Writing” to the young George Woodbury one day in 1860. There are ten of them:</p>
|
... | ... | @@ -68,8 +68,9 @@ while our CSS has |
|
|
|
|
|
The intent is to reduce the "noise" -- turn down the background static so we can see represented exactly what the Word document original represents, the way it represents it. From there, our editorial process can go forward.
|
|
|
|
|
|
BTW since our "editorial process" is set up with tools, there's nothing to prevent us from deploying a filter that would turn the above into something more like what we know we want, maybe something like:
|
|
|
BTW since our editorial environment is set up with tools, we have means to deploy a filter that would turn the above into something more like what we know we want, maybe something like:
|
|
|
|
|
|
```
|
|
|
<p>Take Emerson. Emerson is always on the verge of making himself exceptional—either exceptionally puny, ineffective, and futile, or exceptionally stable and transparent. He gave his “Laws of Writing” to the young George Woodbury one day in 1860. There are ten of them:</p>
|
|
|
<ol>
|
|
|
<li>Write not at all unless you have something new.</p>
|
... | ... | @@ -77,20 +78,22 @@ BTW since our "editorial process" is set up with tools, there's nothing to preve |
|
|
<li>Have nothing of the plan visible—nor firstly, secondly, or thirdly. Show the body, not the ligaments.</p>
|
|
|
...</ol>
|
|
|
<p class="continuing">...</p>
|
|
|
```
|
|
|
|
|
|
But this is no longer HTML Typescript. (It's something more like "HTML Galley Proof".)
|
|
|
|
|
|
### Translation principles
|
|
|
|
|
|
* When extracting from a Word processor -- wherever possible, we design everything to come through by default. But: we don't take this to an extreme; it doesn't actually mean we have to capture literally everything. For example, in a word processor document, we treat certain parts of documentary apparatus (such as page headers or page layout settings) to be incidental and dispensable by design. (Because part of our job is to provide those things). But we don't drop anything by accident.
|
|
|
* But -- expect data coming out to be flat. "One darn paragraph after another." Semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties. Sometimes we are lucky. For example if the Word comes in all nicely styled with a known template - we can capture that info in the HTML. Because HTML Typescript has a way to do this, no problem. But we only copy labels -- we don't try to assign semantic categories.
|
|
|
* Instead, we focus on capturing and correctly representing what we see, and presenting it in such form as a downstream process can do what's called for.
|
|
|
* But -- expect data coming out to be flat. "One darn paragraph after another." Within the word processor, semantic distinctions are typically represented, if at all, only implicitly, by means of varying formatting properties. Sometimes we are lucky. For example if the Word comes in all nicely styled with a known template - we can capture that info in the HTML. Because HTML Typescript has a way to do this, no problem. But we only copy labels given in the Word -- we don't try to assign semantic categories. This means that if the Word says only "this is in italics", we translate that into italics.
|
|
|
* Consequently (and conveniently), CSS and its expressive limitations (property set) put an outer bound on what can be described in HTML Typescript. Sorry, if you use Word to make big letters with shadows and red outlining, HTML Typescript will probably not show that (since we can't describe it in CSS).
|
|
|
* Within these limitations we focus on capturing and correctly representing what we see, and presenting it in such form as a downstream process can do what's called for.
|
|
|
** Exception: subpipelines (separate recipes) such as the header-promote pipeline can provide enhancements over and above straight "extraction", but these function as optional add-ons.
|
|
|
* Cross-references are emulated for footnotes and endnotes (as regular structures created as such in the Word document), but by and large, there will be no linking or cross-referencing implemented where they are warranted in the source. Where the source data has a string, we get a string and nothing more. (We expect robust linking mechanisms to come later when data is better structured and more regular.)
|
|
|
* `class` overloading is okay. `class` is convenient for any semantic labeling. We will be more or less shameless in adding whatever values we think we need, to communicate downstream.
|
|
|
* Similarly `style` and other *presentational* formatting is not only acceptable - for some applications it is preferred. Word has the feature that any arbitrary span or segment of text may be assigned its own properties for presentation. (And writers using Word do this a lot.) This has a straightforward analogue in HTML `@style` attribute, which is (for ordinary every day purposes) deprecated or discouraged. It permits us to hang CSS to describe formatting basically wherever we like [example]
|
|
|
|
|
|
In effect, CSS provides us an entire language to expose a range of features in our source data, except in a language familiar to developers and readily processed using tools they already have. So if the Word says, "1inch left margin" we can turn that into "left margin: 72pt" or indeed "left-margin: 1in" in CSS.
|
|
|
In effect, CSS provides us the language to expose a range of features in our source data, in a way familiar to developers and readily processed using tools they already have. So if the Word says, "1inch left margin" we can turn that into "left margin: 72pt" or indeed "left-margin: 1in" in CSS.
|
|
|
|
|
|
Indeed, the very reasons why @style and presentational tagging in general should be avoided in "good" markup -- because *at best*, they represent *work to be done*, while at worst they are misleading cruft -- these are the very same reasons why presentational tagging, `i`, `b` and all that species, are acceptable and indeed preferable as a representation (as 'transparent' as possible) of Word source data. Because that's what the Word data has, and we need to see exactly what is there prior to "casting" it into anything.
|
|
|
|
... | ... | |