|
|
# Header promotion logic: formatting approach
|
|
|
|
|
|
## 1. Create paragraph representations
|
|
|
Create a representation of all the `<p>` tags in the document, including all the properties relevant to header promotion:
|
|
|
* Font size
|
|
|
* Font style
|
|
|
* Font weight
|
|
|
* Text-decoration
|
|
|
* Color
|
|
|
* Text-align
|
|
|
* Refined-style
|
|
|
* Average length of all paragraphs formatted the same way
|
|
|
* How often paragraphs formatted the same way appear in contiguous runs
|
|
|
* Whether it’s all caps
|
|
|
* How many paragraphs are formatted the same way
|
|
|
|
|
|
## 2. Group paragraphs by shared formatting
|
|
|
Create a list of all the unique paragraph property combinations found in the document. Group the `<p>`s together based on shared properties.
|
|
|
|
|
|
## 3. Examine the `<p>` tags grouped by properties, and promote some to headers
|
|
|
|
|
|
* Anything that is right alighted is not considered for header promotion
|
|
|
* The most common type of paragraph in the document (i.e. the combination of paragraph properties that occurs the most) is not considered for header promotion
|
|
|
* Promote a paragraph format if:
|
|
|
* The average run of consecutive paragraphs styled the same way is 4 or fewer (long runs of `<p>s` with the same styling suggest that the paragraphs are _not_ headings)
|
|
|
AND
|
|
|
* The font size specified is not the smallest font size found in the document
|
|
|
AND
|
|
|
* The average length of paragraphs with the given set of properties is not more than 120 characters
|
|
|
* Promote a paragraph format if it is:
|
|
|
* Centered
|
|
|
AND
|
|
|
* Less than 200 characters in average length
|
|
|
AND
|
|
|
* Average consecutive paragraph run is less than 2
|
|
|
* Promote if it is a paragraph of a type that _never_ ends in a peroid |
|
|
\ No newline at end of file |