... | ... | @@ -30,10 +30,12 @@ This method relies on heuristic analysis to identify candidate headers (that is, |
|
|
(h1, (h2, (h3, (h4, (h5, h6*)*)*)*)*)*
|
|
|
```
|
|
|
|
|
|
Once likely header (lines) are identified, ranking them under the assumption that this order is followed, is simply achieved, and comparing the results of such an ordering to the results of another strategy, offers an important information point. If the two results are the same, then we know that (a) the document presents its headers in regular order, and (b) that header promotion has succeeded in representing it. If they are different(what is probably more likely unless the documents have been regulated somehow), we know that header promotion's rules failed in this instance to register header levels correctly (maybe it swapped levels or had worse problems) -- or the header levels in the document do not make "Regular Order" to begin with (as far as the header promotion can see). In any case, if the two results are different, both are then available for examination and possibly merging/reconciling. (And if the two are the same then the merge/reconcile is trivial.)
|
|
|
Once likely header (lines) are identified, ranking them under the assumption that this order is followed, is simply achieved, and comparing the results of such an ordering to the results of another strategy, offers an important information point. If the two results are the same (and in regular order), then we know that (a) the document presents its headers in regular order, and (b) that header promotion has succeeded in representing it. If they are different (which is probably more likely unless the document's assignment of styles has been regulated somehow), we know that header promotion's rules failed in this instance to register header levels correctly (maybe it swapped levels or had worse problems) -- or the header levels in the document do not make "Regular Order" to begin with (as far as the header promotion can see). In any case, if the two results are different, both are then available for examination and possibly merging/reconciling. (And if the two are the same then the merge/reconcile is trivial.)
|
|
|
|
|
|
Note that only a subset of Word documents have their headers in Regular Order, and they are not wrong to do so. The result of imposing regular order on such documents, will be "garbage" -- and detectable as such (inasmuch as it too violates the pattern described). Note that this sense of regular order is the same sense that must be assumed by structural induction based on header placement, like the HTMLevator project.
|
|
|
|
|
|
Also note that for this expansion into regular order to work, we assume that candidate headers have been identified properly and correctly and that these correspond to the headers that appear in regular order. (I.e. the document in question doesn't imply some "special notion" of regular order as sometimes happens.) The step of making the first header (type) appearing into an h1 while the second (type) becomes h2, etc., is trivial, but if the step before has included "spurious headers", we have GIGO. Accordingly, comparing these results to other results may also help diagnosing these issues in documents.
|
|
|
|
|
|
## Runtime configuration and interface
|
|
|
|
|
|
Interface to XSLT is a runtime parameter exposed under the name 'assignment'. This can be one of two values, either a path to an XML file name i.e. `assignment=headerstyles.xml`, or the simple value 'regular-order' (`assignment=regular-order). Any other value for this parameter, or no value, falls back on the default behavior, namely to perform property-based heuristic assignment.
|
... | ... | |