|
|
# Header promotion logic: formatting approach (WIP)
|
|
|
# 1. Extraction
|
|
|
stylesheet: /docx-extract/docx-html-extract.xsl
|
|
|
|
|
|
The first step extracts Word's underlying OpenOfficeXML into near-html (XHTML).
|
|
|
|
|
|
Certain attributes specified on the paragraph-level (inside `<w:pPr>`) of the xml are mapped to `<p>` attributes:
|
|
|
|
|
|
* named Word styles are mapped as a `<class="STYLE">` attribute
|
|
|
* formatting information is mapped to `<p style="ATTRIBUTES">`. Attributes include:
|
|
|
* text alignment: `text-align`
|
|
|
* indentation: `text-indent`, `padding-left`
|
|
|
* top and bottom margin: `margin-top`, `margin-bottom`
|
|
|
* list level (for list items): `xsweet-list-level`
|
|
|
* Word outline level, if specified: `xsweet-outline-level`
|
|
|
* WENDELL - anything else that gets extracted?
|
|
|
|
|
|
Many paragraph-level declarations of font, font size, and certain formatting are ignored as redundant and declared on runs inside the paragraph.
|
|
|
|
|
|
XSweet represents runs of text inside paragraphs and their properties these as `<spans>` inside a `<p>`:
|
|
|
* named Word styles are mapped as `<span class="STYLE">`
|
|
|
* formatting is mapped to a `<span style="ATTRIBUTES">`. Attributes include:
|
|
|
* font: `font-family`
|
|
|
* font size: `font-size`
|
|
|
* small caps: `font-variant: small-caps`
|
|
|
|
|
|
WENDELL:
|
|
|
* how do toggle styles like `<b>`, `<i>`, and `<kern>` get passed through? Since `kern` makes it through, I think that there must be a list of wrapper elements that get proactively removed, and anything else gets passed through. Is that right? If so, what's the list of elements that are ignored/removed by extract?
|
|
|
* what else does extraction do?
|
|
|
* do any of the attributes listed as extracted from runs (font, font size, small caps) ever come from the paragraph (w:pPr)?
|
|
|
|
|
|
# 2. Notes
|
|
|
|
|
|
Both endnotes and footnotes from Word are extracted. Inline endnote and footnote callouts become clickable links that correspond to their respective notes at the end of the document.
|
|
|
|
|
|
Unreferenced notes are removed and the endnotes and footnotes are automatically numbered, separately and in order of first reference.
|
|
|
|
|
|
Example:
|
|
|
```html
|
|
|
<div class="docx-body">
|
|
|
<p>Here’s the introductory paragraph</p>
|
|
|
<p> Here is an endnote callout<span class="EndnoteReference">
|
|
|
<a class="endnoteReference" href="#en1">1</a></span>
|
|
|
</p>
|
|
|
</div>
|
|
|
<div class="docx-endnotes">
|
|
|
<div class="docx-endnote" id="en1">
|
|
|
<p class="EndnoteText"><span class="EndnoteReference" /> Text of the endnote</p>
|
|
|
</div>
|
|
|
</div>
|
|
|
```
|
|
|
|
|
|
# 3. Scrub
|
|
|
|
|
|
This step removes the following tags:
|
|
|
* position
|
|
|
* iCs
|
|
|
* lang
|
|
|
* vertAlign
|
|
|
* noProof
|
|
|
|
|
|
Empty inline elements are removed, and formatting applied to tags that contain only whitespace is removed.
|
|
|
|
|
|
CSS properties are normalized and put into a consistent order.
|
|
|
|
|
|
Qs:
|
|
|
* does it do anything with caps and strike?
|
|
|
* it's unclear to me exactly what the css normalization does: "@style is rewritten to normalize its CSS."
|
|
|
|
|
|
# 4. Join
|
|
|
|
|
|
This step combines strings of elements into one element when:
|
|
|
1. More than one element of the same type occurs in a row, and
|
|
|
2. The two tags have similar style attributes
|
|
|
|
|
|
Example:
|
|
|
```html
|
|
|
<p style="text-align: center">
|
|
|
<b>Part I: </b><b>United</b><b> and </b><b>Divided</b>
|
|
|
</p>
|
|
|
```
|
|
|
becomes:
|
|
|
```html
|
|
|
<p style="text-align: center">
|
|
|
<b>Part I: United and Divided</b>
|
|
|
</p>
|
|
|
```
|
|
|
This step does not combine runs of `div`s, `p`s, or `tab`s.
|
|
|
|
|
|
# 5. Collapse paragraphs
|
|
|
|
|
|
In this step, inline formatting gets copied to the paragraph level wherever possible.
|
|
|
|
|
|
For example, this:
|
|
|
```html
|
|
|
<p style="color: blue"><span font-weight: bold>blue bold text</span></p>
|
|
|
```
|
|
|
is transformed into:
|
|
|
```html
|
|
|
<p style="color: blue; font-weight: bold">bold blue text</p>
|
|
|
```
|
|
|
Elements that contain only formatting information that can be pushed to the paragraph level are removed entirely, as in the `span` above.
|
|
|
|
|
|
Additionally, `<p>`-level `style` information is added to reflect these tags whenever they wrap all the contents of a paragraph:
|
|
|
`<i>` -> `font-style='italic'`
|
|
|
`<b>` -> `font-weight='bold'`
|
|
|
`<u>` -> `text-decoration='underline'`
|
|
|
These tags are also left inline for now.
|
|
|
|
|
|
# 6. LISTS
|
|
|
|
|
|
# 7. Header promotion logic: formatting approach (WIP)
|
|
|
|
|
|
## 1. Create paragraph representations
|
|
|
Create a representation of all the `<p>` tags in the document, including all the properties relevant to header promotion:
|
... | ... | @@ -33,7 +143,7 @@ AND |
|
|
* Less than 200 characters in average length
|
|
|
AND
|
|
|
* Average consecutive paragraph run is less than 2
|
|
|
* Promote if it is a paragraph of a type that _never_ ends in a peroid
|
|
|
* Promote if it is a paragraph of a type that _never_ ends in a period
|
|
|
|
|
|
## 4. With the newly created list of all the headers, determine the most logical header level to promote the paragraphs grouped by common properties. Sort by these criteria, in sequence:
|
|
|
* Font size
|
... | ... | @@ -42,37 +152,15 @@ AND |
|
|
* Underline
|
|
|
* Always caps
|
|
|
|
|
|
# Link handling
|
|
|
XSweet can recognize hyperlinks. To create a hyperlink, XSweet must find:
|
|
|
1. text that looks like a URL (DEFINE)
|
|
|
2. must be somehow distinguished from the surrounding text (e.g. different font or formatting) (DEFINE)
|
|
|
|
|
|
# 1. Extraction
|
|
|
stylesheet: /docx-extract/docx-html-extract.xsl
|
|
|
|
|
|
The first step extracts Word's underlying OpenOfficeXML into near-html (XHTML).
|
|
|
|
|
|
Certain attributes specified on the paragraph-level (inside `<w:pPr>`) of the xml are mapped to `<p>` attributes:
|
|
|
|
|
|
* named Word styles are mapped as a `<class="STYLE">` attribute
|
|
|
* formatting information is mapped to `<p style="ATTRIBUTES">`. Attributes include:
|
|
|
* text alignment: `text-align`
|
|
|
* indentation: `text-indent`, `padding-left`
|
|
|
* top and bottom margin: `margin-top`, `margin-bottom`
|
|
|
* list level (for list items): `xsweet-list-level`
|
|
|
* Word outline level, if specified: `xsweet-outline-level`
|
|
|
* WENDELL - anything else that gets extracted?
|
|
|
# 8. Final rinse
|
|
|
|
|
|
Many paragraph-level declarations of font, font size, and certain formatting are ignored as redundat and declared on runs inside the paragraph.
|
|
|
* Removes redundant inline tags (e.g. `<b>`, `<i>`, `<u>`) that are expressed instead as CSS `style`.
|
|
|
* Inserts placeholder comments into empty `<div>`s and `<p>`s to ensure they are retained
|
|
|
* Removes extraneous noise from endnote and footnote references
|
|
|
* Removes redundant styling repeated on child elements
|
|
|
|
|
|
XSweet represents runs of text inside paragraphs and their properties these as `<spans>` inside a `<p>`:
|
|
|
* named Word styles are mapped as `<span class="STYLE">`
|
|
|
* formatting is mapped to a `<span style="ATTRIBUTES">`. Attributes include:
|
|
|
* font: `font-family`
|
|
|
* font size: `font-size`
|
|
|
* small caps: `font-variant: small-caps`
|
|
|
# Link handling
|
|
|
|
|
|
WENDELL:
|
|
|
* how do toggle styles like `<b>`, `<i>`, and `<kern>` get passed through? Since `kern` makes it through, I think that there must be a list of wrapper elements that get proactively removed, and anything else gets passed through. Is that right? If so, what's the list of elements that are ignored/removed by extract?
|
|
|
* what else does extraction do?
|
|
|
* do any of the attributes listed as extracted from runs (font, font size, small caps) ever come from the paragraph (w:pPr)? |
|
|
XSweet can recognize hyperlinks. To create a hyperlink, XSweet must find:
|
|
|
1. text that looks like a URL (DEFINE)
|
|
|
2. must be somehow distinguished from the surrounding text (e.g. different font or formatting) (DEFINE) |
|
|
\ No newline at end of file |