Improve header level guesses by considering order of appearance
As mentioned on the 4/15/17 comment on ticket #81 (closed):
How can header promotion consider the order of appearance of the headers to improve the heading level it guesses? How is it currently deciding which level to promote things to? Does it consider the resulting structure of the document?
We'll probably want to implement some checks for this. My first thought is that once header promotion has identified all the different heading levels (say there are 3 different formats, so it knows for sure there should be 3 different heading levels), it could then order them according to whatever looks like it produces the most credible heading structure.
As an example, let's say it finds 3 levels of headers and initially promotes them as follows: h2 h1 h1 h3 h1 h3 h3 h1
A final step would realize that changing the order around to get this structure makes better sense: h1 h2 h2 h3 h2 h3 h3 h2 and make the change.
We'd need to give it a few rules it can use to score the structures and choose the best one. A good start could be to say that generally (not rigidly) 1) lower level headers should be nested under higher level ones, and 2) sequential heading levels should either stay the same or increment or decrement by 1 level. For improving the accuracy of heading levels, I don't think the formatting itself will ever tell us much about what level a group of headings should be. That's because authors use formatting in so many different and entirely inconsistent ways. Considered in a vacuum, there's no reasonable way to say something like "bolding denotes a higher level heading than underlining, which denotes a higher level than italics". I think the best way to improve header level inferring will be by looking at their order of appearance and what that might say about the structure.
What do you think?