Expose properties assigned via paragraph style as literals not just class assignments.
Here we have a sample (Friedman ch 1) that 'works' in header promotion by accident. Necessary properties are not being represented in property-based group assignment (for header promotion) b/c they are assigned only via style name, represented only out of line (CSS
@ class reference to properties written into HTML
<style> as literal CSS). These properties need to be exposed at the element level for header promotion to work.
Either we need a CSS-rectification pass that would deal with this (which must be able to parse our CSS among other things), or we need to emit the info at the element level via @ style (as well as @ class).
The latter is probably more robust - but that will (also) impact the CSS-abstract task - which will be all the more important if/as there is promiscuous @ style marked everywhere for header promotion ...
Todo: capture styled properties from docx on HTML @ style (i.e., not only into
<style>); address css-abstract accordingly.
Alex's original bug report for this issue. See comment for above revision:
Here are 2 issues/observations, then a proposition for a comprehensive solution I think could be really powerful if it's possible:
It looks like the styling had an effect on what was promoted. In this case, the author used the styles mostly right, but inconsistencies in styling caused some headings not to be promoted. Take the example of the first 3 headings ("Part I. Border Crossings", "Chapter One", "Documenting Sovereignty"). They all look the same, but the Word styles used mean the 1st one isn't promoted but the other two are.
The promoted h2s are not formatted consistently - there are 3 distinct formats that have all been elevated to h2s.
Grand Solution Proposition:
- Run normal header promotion
- Check for consistency of appearance among each heading level (see #67 for more details), by extracting the important display information from each element that's been promoted, comparing this across all the same-level headers, and adjusting heading levels to resolve inconsistencies. I'd say there are 3 important display bits to match on: 1) alignment, 2) font size, 3) bold/italic/underline. This leaves us with a unique "appearance fingerprint" for each heading level.
- Combine heading levels that have the same appearance (#64)
- Finally, take another pass through the Word doc to promote uncaught headers that match a heading level on the important formatting
In this example:
- The headers are promoted as they are in the table
- All the things that were promoted to h2s get broken out into 3 different heading levels, each with their own display signature
- The "June 25, 2009" header that was initially promoted to an h1 would be seen to match the display of the other headings that should be the same level, and its heading level is changed to match
- Finally, check the Word doc for unpromoted headers, looking for things that share the same formatting as one of the heading levels. The "Part I. Border Crossings" heading that wasn't initially promoted gets promoted.
I like this example because it pulls together several different steps that would each be helpful on their own. If this is possible I think it would be a really big win. What do you think Wendell? I'd be interested to hear your thoughts.