... | ... | @@ -47,3 +47,32 @@ XSweet can recognize hyperlinks. To create a hyperlink, XSweet must find: |
|
|
1. text that looks like a URL (DEFINE)
|
|
|
2. must be somehow distinguished from the surrounding text (e.g. different font or formatting) (DEFINE)
|
|
|
|
|
|
# 1. Extraction
|
|
|
stylesheet: /docx-extract/docx-html-extract.xsl
|
|
|
|
|
|
The first step extracts Word's underlying OpenOfficeXML into near-html (XHTML).
|
|
|
|
|
|
Certain attributes specified on the paragraph-level (inside `<w:pPr>`) of the xml are mapped to `<p>` attributes:
|
|
|
|
|
|
* named Word styles are mapped as a `<class="STYLE">` attribute
|
|
|
* formatting information is mapped to `<p style="ATTRIBUTES">`. Attributes include:
|
|
|
* text alignment: `text-align`
|
|
|
* indentation: `text-indent`, `padding-left`
|
|
|
* top and bottom margin: `margin-top`, `margin-bottom`
|
|
|
* list level (for list items): `xsweet-list-level`
|
|
|
* Word outline level, if specified: `xsweet-outline-level`
|
|
|
* WENDELL - anything else that gets extracted?
|
|
|
|
|
|
Many paragraph-level declarations of font, font size, and certain formatting are ignored as redundat and declared on runs inside the paragraph.
|
|
|
|
|
|
XSweet represents runs of text inside paragraphs and their properties these as `<spans>` inside a `<p>`:
|
|
|
* named Word styles are mapped as `<span class="STYLE">`
|
|
|
* formatting is mapped to a `<span style="ATTRIBUTES">`. Attributes include:
|
|
|
* font: `font-family`
|
|
|
* font size: `font-size`
|
|
|
* small caps: `font-variant: small-caps`
|
|
|
|
|
|
WENDELL:
|
|
|
* how do toggle styles like `<b>`, `<i>`, and `<kern>` get passed through? Since `kern` makes it through, I think that there must be a list of wrapper elements that get proactively removed, and anything else gets passed through. Is that right? If so, what's the list of elements that are ignored/removed by extract?
|
|
|
* what else does extraction do?
|
|
|
* do any of the attributes listed as extracted from runs (font, font size, small caps) ever come from the paragraph (w:pPr)? |