Alex Theg · d9ba328f
--- a/documentation.md
+++ b/documentation.md
-# Header promotion logic: formatting approach (WIP)
+# 1. Extraction
+stylesheet: /docx-extract/docx-html-extract.xsl
+
+The first step extracts Word's underlying OpenOfficeXML into near-html (XHTML).
+
+Certain attributes specified on the paragraph-level (inside `<w:pPr>`) of the xml are mapped to `<p>` attributes:
+
+* named Word styles are mapped as a `<class="STYLE">` attribute
+* formatting information is mapped to `<p style="ATTRIBUTES">`. Attributes include:
+  * text alignment: `text-align`
+  * indentation: `text-indent`, `padding-left`
+  * top and bottom margin: `margin-top`, `margin-bottom`
+  * list level (for list items): `xsweet-list-level`
+  * Word outline level, if specified: `xsweet-outline-level`
+  * WENDELL - anything else that gets extracted?
+
+Many paragraph-level declarations of font, font size, and certain formatting are ignored as redundant and declared on runs inside the paragraph.
+
+XSweet represents runs of text inside paragraphs and their properties these as `<spans>` inside a `<p>`:
+* named Word styles are mapped as `<span class="STYLE">`
+* formatting is mapped to a `<span style="ATTRIBUTES">`. Attributes include:
+  * font: `font-family`
+  * font size: `font-size`
+  * small caps: `font-variant: small-caps`
+
+WENDELL:
+* how do toggle styles like `<b>`, `<i>`, and `<kern>` get passed through? Since `kern` makes it through, I think that there must be a list of wrapper elements that get proactively removed, and anything else gets passed through. Is that right? If so, what's the list of elements that are ignored/removed by extract?
+* what else does extraction do?
+* do any of the attributes listed as extracted from runs (font, font size, small caps) ever come from the paragraph (w:pPr)?
+
+# 2. Notes
+
+Both endnotes and footnotes from Word are extracted.  Inline endnote and footnote callouts become clickable links that correspond to their respective notes at the end of the document.
+
+Unreferenced notes are removed and the endnotes and footnotes are automatically numbered, separately and in order of first reference.
+
+Example:
+```html
+<div class="docx-body">
+  <p>Here’s the introductory paragraph</p>
+  <p> Here is an endnote callout<span class="EndnoteReference">
+    <a class="endnoteReference" href="#en1">1</a></span>
+  </p>
+</div>
+<div class="docx-endnotes">
+  <div class="docx-endnote" id="en1">
+    <p class="EndnoteText"><span class="EndnoteReference" /> Text of the endnote</p>
+  </div>
+</div>
+```
+
+# 3. Scrub
+
+This step removes the following tags:
+* position
+* iCs
+* lang
+* vertAlign
+* noProof
+
+Empty inline elements are removed, and formatting applied to tags that contain only whitespace is removed.
+
+CSS properties are normalized and put into a consistent order.
+
+Qs:
+* does it do anything with caps and strike?
+* it's unclear to me exactly what the css normalization does: "@style is rewritten to normalize its CSS."
+
+# 4. Join
+
+This step combines strings of elements into one element when:
+1. More than one element of the same type occurs in a row, and
+2. The two tags have similar style attributes
+
+Example:
+```html
+<p style="text-align: center">
+  <b>Part I: </b><b>United</b><b> and </b><b>Divided</b>
+</p>
+```
+becomes:
+```html
+<p style="text-align: center">
+  <b>Part I: United and Divided</b>
+</p>
+```
+This step does not combine runs of `div`s, `p`s, or `tab`s.
+
+# 5. Collapse paragraphs
+
+In this step, inline formatting gets copied to the paragraph level wherever possible.
+
+For example, this:
+```html
+<p style="color: blue"><span font-weight: bold>blue bold text</span></p>
+```
+is transformed into:
+```html
+<p style="color: blue; font-weight: bold">bold blue text</p>
+```
+Elements that contain only formatting information that can be pushed to the paragraph level are removed entirely, as in the `span` above.
+
+Additionally, `<p>`-level `style` information is added to reflect these tags whenever they wrap all the contents of a paragraph:   
+`<i>` -> `font-style='italic'`  
+`<b>` -> `font-weight='bold'`  
+`<u>` -> `text-decoration='underline'`  
+These tags are also left inline for now.
+
+# 6. LISTS
+
+# 7. Header promotion logic: formatting approach (WIP)

 ## 1. Create paragraph representations
 Create a representation of all the `<p>` tags in the document, including all the properties relevant to header promotion:
@@ -33,7 +143,7 @@ AND
  * Less than 200 characters in average length
 AND
  * Average consecutive paragraph run is less than 2
-* Promote if it is a paragraph of a type that _never_ ends in a peroid
+* Promote if it is a paragraph of a type that _never_ ends in a period

 ## 4. With the newly created list of all the headers, determine the most logical header level to promote the paragraphs grouped by common properties.  Sort by these criteria, in sequence:
 * Font size
@@ -42,37 +152,15 @@ AND
 * Underline
 * Always caps

-# Link handling
-XSweet can recognize hyperlinks. To create a hyperlink, XSweet must find:
-1. text that looks like a URL (DEFINE)
-2. must be somehow distinguished from the surrounding text (e.g. different font or formatting) (DEFINE)
-
-# 1. Extraction 
-stylesheet: /docx-extract/docx-html-extract.xsl
-
-The first step extracts Word's underlying OpenOfficeXML into near-html (XHTML).
-
-Certain attributes specified on the paragraph-level (inside `<w:pPr>`) of the xml are mapped to `<p>` attributes:
-
-* named Word styles are mapped as a `<class="STYLE">` attribute
-* formatting information is mapped to `<p style="ATTRIBUTES">`. Attributes include:
-  * text alignment: `text-align`
-  * indentation: `text-indent`, `padding-left`
-  * top and bottom margin: `margin-top`, `margin-bottom`
-  * list level (for list items): `xsweet-list-level`
-  * Word outline level, if specified: `xsweet-outline-level`
-  * WENDELL - anything else that gets extracted?
+# 8. Final rinse

-Many paragraph-level declarations of font, font size, and certain formatting are ignored as redundat and declared on runs inside the paragraph.
+* Removes redundant inline tags (e.g. `<b>`, `<i>`, `<u>`) that are expressed instead as CSS `style`.
+* Inserts placeholder comments into empty `<div>`s and `<p>`s to ensure they are retained
+* Removes extraneous noise from endnote and footnote references
+* Removes redundant styling repeated on child elements

-XSweet represents runs of text inside paragraphs and their properties these as `<spans>` inside a `<p>`:
-* named Word styles are mapped as `<span class="STYLE">`
-* formatting is mapped to a `<span style="ATTRIBUTES">`. Attributes include:
-  * font: `font-family`
-  * font size: `font-size`
-  * small caps: `font-variant: small-caps`
+# Link handling

-WENDELL:
-* how do toggle styles like `<b>`, `<i>`, and `<kern>` get passed through? Since `kern` makes it through, I think that there must be a list of wrapper elements that get proactively removed, and anything else gets passed through. Is that right? If so, what's the list of elements that are ignored/removed by extract?
-* what else does extraction do?
-* do any of the attributes listed as extracted from runs (font, font size, small caps) ever come from the paragraph (w:pPr)?
+XSweet can recognize hyperlinks. To create a hyperlink, XSweet must find:
+1. text that looks like a URL (DEFINE)
+2. must be somehow distinguished from the surrounding text (e.g. different font or formatting) (DEFINE)
\ No newline at end of file