HTMLevator supports structural induction and "section type inferencing" in conversion of data from (appropriately coded) .docx files into HTML. It is designed to be used with XSweet.
HTMLevator currently includes three separate applications. They can be used together or separately, although one of them is unlikely to be able to be as useful without another -- they are best used in combination.
A general purpose "mapper" enabling class or style values on HTML elements to be systematically mapped. So for example, all "font-style: italic" can be made class='emph' ... write whatever mappings you need. Very useful for data cleanup and consolidation, and/or as prep for any other steps.
XSweet can convert paragraphs p elements in HTML into h1-h6 elements. It uses one of several means to determine which paragraphs receive this treatment: the most robust is to configure it yourself with a styles mapping file, another runtime configuration you set up yourself (which can be made sensitive to consistent code points in your inputs). Or, if your data is sufficiently regular, another method may be less onerous. Indeed if asked, HTMLevator's header promotion will 'guess' appropriate headers based on a ranking of format (style) attributes in the inputs.
(What follows remains from original notes as to requirements: see the repository readme for up to date description of the implementation.)
Any sequence of HTML elements leading with a header (h1-h6) is wrapped as a section. Within each section, the header plus its (block level) elements are followed by sections for contiguous (subsequent) lower level sections.
Paragraphs and all other elements travel with the immediately preceding header
Paragraphs and blocks preceding the first header, appear without a section wrapper
(before the first section)
Hence sequences with no headers, are unchanged
The logic should also apply to 'section' elements as well as wrapper elements
Hence, a properly sectioned HTML is returned unchanged, but one whose headers do not lead sections is "repaired".
When sections are skipped (e.g. h4 appearing before h3), the extra section wrappers should not appear. So such a section comes wrapped as if it were at a higher level - although its header still indicates its 'presentation' level.
correct, leading with h1;
correct, leading with para contents then h1;
correct, leading with h3;
correct, leading with para contents then h3;
skipping levels at the front;
skipping levels inside)
section type inferencing
(At time of writing, these requirements are not addressed by HTMLevator.)
Means recognizing, for example, a Conclusions section (by means of its title and/or other properties) and submitting it to appropriate handling -- including validation, to detect whether and where such a section is required, expected or permitted. HTMLevator currently does not provide for section type inferencing, except to note that it is a natural requirement and one that can be readily accomplished in this architecture.
On HTML files whose section levels are regularly and systematically indicated by a "regular order" of headers, the XSLT provided here will reliably create a nested section structure.
For future development - validation of structures/content types
Once structures have been induced (inferred or projected over the element sequence), they need to be validated against rule sets appropriate to their workflows.
For example, a journal may have validation rules such as these:
There must be top-level sections entitled "Introduction", "Methods and Materials", "Conclusion[s]" and "Bibliography". ("Conclusion[s]" means the 's' is optional.)
They must appear in that order.
An "Acknowledgements" section may optionally appear after "Conclusion[s]" but before "Bibliography".
Sections at lower levels may be named anything except the names given.
No section name can be repeated at the top level. (It's okay to repeat subsection names as long as they avoid the top-level section names.)
The specific validation rules enforced by HTMLevator are tdb, based on requirements. We also expect to face a requirement to make these rules configurable. This may be done via a "driver" config file or a meta-stylesheet (a la Schematron).
For now, we presume we will use XPath within XSLT to test against constraints, piping validation results directly into outputs and/or into a separate report (tdb).