|
# Header Promotion Logic
|
|
# Header Promotion Logic
|
|
|
|
|
|
|
|
A review of the kinds of issues that come up with header promotion suggests there are actually three different strategies that may be used (that in some circumstances might be used) to provide for "intelligent" or at least not random header promotion. We are already using a combination of these methods.
|
|
|
|
|
|
|
|
Header promotion logic is in two phases: identifying header paragraph candidates, and assigning header levels. However, the two are also logically interdependent, since if we assign header levels via an explicit map, we can use the same map to determine candidate headers.
|
|
|
|
|
|
|
|
## Methods
|
|
|
|
|
|
|
|
Testing with our first demo poc XSLT suggests there are actually distinct methods to header assignment.
|
|
|
|
|
|
|
|
### Header assignment by nominal class
|
|
|
|
|
|
|
|
This strategy requires an external map or rule set of how to bind nominal classes to headers. It works well but only for a subset of Word documents - for whom other strategies may also work well. Our version 1 header promotion used this strategy anytime it found styles that suggested meaningful style names -- to it, namely (only) Header1 and that sort of thing. Consequently many documents have presumably been tagged according to this strategy, which should not have been -- which would give better results using the other method.
|
|
|
|
|
|
|
|
This isn't our first-choice strategy since it is robust only when carefully supervised and inputs edited to match. (I.e. best results are from defining your styles up front, then seeing to it these are used, locking down to a template etc. etc.) It's not going to work reliably on many/most "wild" Word docx inputs.
|
|
|
|
|
|
|
|
### Header assignment by properties (heuristic analysis)
|
|
|
|
|
|
|
|
We can apply a rule set to determine priorities among families of paragraphs identified as headers, for example making font size a determiner.
|
|
|
|
|
|
|
|
This is what we have done so far, and it appears to be working. The details of the rule sets for (a) identifying candidate paragraphs, and (b) ranking them for header levels, needs to be documented and described. (Keep reading)
|
|
|
|
|
|
|
|
### "Normalized" or "Regular Order" header level assignment
|
|
|
|
|
|
|
|
This is experimental -- a concept, not yet implemented -- providing an approach to the interesting set of problems noted in Issue #83.
|
|
|
|
|
|
|
|
This method relies on heuristic analysis to identify candidate headers (that is, which paragraphs are likely to be headers based on detected formatting and other criteria), but for purposes of assigning levels to the headers (the most arbitrary and 'noisiest' aspect of header promotion), it assumes they appear in 'regular order', that is, following this pattern (expressed as a regular expression):
|
|
|
|
|
|
|
|
```
|
|
|
|
(h1, (h2, (h3, (h4, (h5, h6*)*)*)*)*)*
|
|
|
|
```
|
|
|
|
|
|
|
|
Once likely header (lines) are identified, ranking them under the assumption that this order is followed, is simply achieved, and comparing the results of such an ordering to the results of another strategy, offers an important information point. If the two results are the same, then we know that (a) the document presents its headers in regular order, and (b) that header promotion has succeeded in representing it. If they are different(what is probably more likely unless the documents have been regulated somehow), we know that header promotion's rules failed in this instance to register header levels correctly (maybe it swapped levels or had worse problems) -- or the header levels in the document do not make "Regular Order" to begin with (as far as the header promotion can see). In any case, if the two results are different, both are then available for examination and possibly merging/reconciling. (And if the two are the same then the merge/reconcile is trivial.)
|
|
|
|
|
|
|
|
Note that only a subset of Word documents have their headers in Regular Order, and they are not wrong to do so. The result of imposing regular order on such documents, will be "garbage" -- and detectable as such (inasmuch as it too violates the pattern described). Note that this sense of regular order is the same sense that must be assumed by structural induction based on header placement, like the HTMLevator project.
|
|
|
|
|
|
|
|
## Runtime configuration and interface
|
|
|
|
|
|
|
|
Interface to XSLT is a runtime parameter exposed under the name 'assignment'. This can be one of two values, either a path to an XML file name i.e. `assignment=headerstyles.xml`, or the simple value 'regular-order' (`assignment=regular-order). Any other value for this parameter, or no value, falls back on the default behavior, namely to perform property-based heuristic assignment.
|
|
|
|
|
|
|
|
Internally, the stylesheet accepting this parameter will operate as follows: when the string value given resolves to a file name, this file will be consulted to provide a map from nominal styles found in the Word, to header levels in resulting HTML (strategy two, 'header assignment by nominal class').
|
|
|
|
|
|
|
|
Otherwise, if the string has value "REGULAR-ORDER" (presumably never the name of a configuration file that would pre-empt it), then regular order is imposed.
|
|
|
|
|
|
|
|
Otherwise we get the default mapping of header lines to levels based on property analysis.
|
|
|
|
|
|
|
|
## Implementation (somewhat out of date as of 20170420)
|
|
|
|
|
|
|
|
(The notes below describe the XSLT internals. The current XSLT should be consulted; it now has better comments. Note that described is the implementation of the entire header promotion pipeline - whereas the issues under discussion bear only on the final step, namely the assignment of candidate header lines to categories.)
|
|
|
|
|
|
In the digest-paragraphs.xsl, paragraphs in a document (resulting from XSweet extraction) are submitted to a sequence of operations to determine whether and where to promote certain paragraphs, into HTML **h1** through **h6**.
|
|
In the digest-paragraphs.xsl, paragraphs in a document (resulting from XSweet extraction) are submitted to a sequence of operations to determine whether and where to promote certain paragraphs, into HTML **h1** through **h6**.
|
|
|
|
|
|
The actual conversion is performed by a separate stylesheet, which is produced by filtering the results of this initial transform.
|
|
The actual conversion is performed by a separate stylesheet, which is produced by filtering the results of this initial transform.
|
... | | ... | |