'Push' mapping from XSweet HTML Typescript to Editoria (intake) HTML
Surveys elements emitted by XSweet and documents what happens to them going through the Editorial Typescript filter.
See the Pull mapping page for a (preliminary?) spec of elements recognized and used by Editoria.
The input structure (emitted by XSweet) looks like this:
html head [head stuff, including CSS] body div#docx-body div#docx-footnotes [when there are footnotes] div#docx-endnotes [when there are end notes]
The structure received by Editoria must be like this:
html head [head stuff, but excluding CSS?] body [contains a flat sequence of paragraphs, headers and block-level objects]
This reduction is achieved by resolving the notes with their anchors (references) as described next, and getting rid of the
div element superstructure.
Notes (end notes and footnotes)
WordML describes two sequences of notes, "footnotes" and "end notes". They are functionally identical except the way they are laid out on the page.
But footnotes are comparatively rare, and are not yet implemented and tested in XSweet (as of writing). So this mapping really applies only to end notes.
XSweet will emit inline elements, sometimes nested (for example,
<b>), when WordML encoding in the source data warrants inline reformatting. This includes shifts in font weight (bold, normal), style (italics or roman), font family, font size (rarely).
Similarly, inline styles applied in the Word data will be represented as
<span class="TheStyle">, where 'TheStyle' is the name of the style in Word. (Could be anything.)
These result in any of the following in XSweet output:
<sub>, when they are HTML equivalents for formatting assigned in the Word
@styleor (if CSS has been refactored)
@class- can happen for arbitrary Styles, but also for formatting that must be escaped into CSS to get into HTML, e.g.
<span style="font-variant: small-caps">
- Rarely, an "unknown" element may be produced - this is what XSweet does with Word formatting that it does not know what else to do with. (Since we are treating the universe of WordML as potentially open-ended.) Of course, such elements will be invalid in HTML and (eventually) we should have ways of trapping and dealing with them.
For purposes of converting to Editoria these will fall into two groups. (1) Those with explicit mappings, and (2) those that will be caught with the fallback rule (see below)
Here is the first group, with its mappings (as of writing):
XSweet produces flat paragraphs with no nested (div or
<section>) structure, and that is what Editoria wants.
Note that many paragraphs may have either a named class (indicating a Word style of that name), and/or paragraph-level formatting (expressed in CSS via
@class after CSS refactoring.
When mappings are available (see below for Extracts) these can be converted but otherwise they will be submitted to fallback treatment (and
@style may be lost).
Extracts, block quotes and other "embedded content objects"
p elements with a "Quote" class) will be converted to
<extract>. (Not HTML5 but okay for Editoria.)
Requirements for other content objects will be developed over time. Note that coming out of XSweet, these are all just flat paragraphs with style names and formatting (by design of HTML Typescript).
-tbd- (For now, lists come through as flat paragraphs, with scant indications they could be lists. XSweet work on lists may be called for.)
Up until the last step in the transformation pipeline, the fallback rule is "copy everything through".
The last step, a cleanup phase, also offers an opportunity for a generally-applicable fallback rule for handling everything not known in advance to conform to Editoria's expectations.
For early development phases, this catch-all rule is if it's not otherwise known about, strip the element, while leaving its contents. Similarly, it arbitrarily removes any
@class from any element. Preventing otherwise unknown or unaccounted elements from getting into the editorial text reduces noise and complications related to these elements (which by definition will not have been mapped).
Later, we plan to relax this rule, permitting unknowns to pass through after flagging them in some way, for correction/emendation.