... | ... | @@ -2,21 +2,97 @@ |
|
|
|
|
|
See page on the original [Pre-alpha Specs](pre-alpha-specs) (before redesign)
|
|
|
|
|
|
HTMLevator supports structural induction and "section type inferencing" in conversion of data from (appropriately coded) `.docx` files into HTML. It is designed to be used with XSweet.
|
|
|
|
|
|
"Structural induction" means HTMLevator will produce section elements where needed to "wrap" (structure) unorganized contents. Section type inferencing means recognizing, for example, a "Conclusions" section and submitting it to appropriate handling (including validation, to detect whether and where such a section is permitted).
|
|
|
|
|
|
XSweet (a companion project) via its Header Promotion pathway can already produce HTML h1-h6 for Word `docx` files containing Parapraph Styles named "Header 1" through "Header 6".
|
|
|
|
|
|
Within Word, by default these styles are bound to the appropriate Outline level - hence, the structure of the HTML file resulting from HTMLevator can be displayed directly in Word (before running) by using the Outline View.
|
|
|
|
|
|
Accordingly, preparation of any Word file for HTMLevator requires *only* assuring that Header 1 - Header 6 styles are assigned correctly to section titles at their respective levels of hierarchy.
|
|
|
|
|
|
XSweet (header promotion) and HTMLevator do the rest - XSweet makes HTML with h1-h6, then HTMLevator makes nested sections for the detected headers -- and goes from there.
|
|
|
|
|
|
HTMLevator can also be used on files with no such preparation buy YMMV - all depends on whether/how XSweet header promotion works to detect h1-h6 on your file.
|
|
|
|
|
|
Two parts:
|
|
|
|
|
|
## Structural induction
|
|
|
|
|
|
Interpolate `<section>` elements appropriately
|
|
|
Interpolate `<section>` elements appropriately.
|
|
|
|
|
|
Any sequence of HTML elements leading with a header (h1-h6) is wrapped as a section. Within the section, the h1-h6 plus its elements is following by sections for contiguous (subsequent) lower level sections.
|
|
|
|
|
|
I.e. h1 h2 h2 h3 h1 h2 h3 becomes section (h1 section (h2) section (h2 section (h3) ) ) section (h1 section (h2 (section h3) ) ).
|
|
|
|
|
|
In
|
|
|
```
|
|
|
h1
|
|
|
h2
|
|
|
h3
|
|
|
h2
|
|
|
h3
|
|
|
h1
|
|
|
h2
|
|
|
h2
|
|
|
h3
|
|
|
```
|
|
|
|
|
|
(todo: make up samples, unit testing)
|
|
|
Out
|
|
|
```
|
|
|
section
|
|
|
h1
|
|
|
section
|
|
|
h2
|
|
|
section
|
|
|
h3
|
|
|
section
|
|
|
h2
|
|
|
section
|
|
|
h3
|
|
|
section
|
|
|
h1
|
|
|
section
|
|
|
h2
|
|
|
section
|
|
|
h2
|
|
|
section
|
|
|
h3
|
|
|
|
|
|
Examples:
|
|
|
correct, leading with h1
|
|
|
correct, leading with para contents then h1
|
|
|
correct, leading with h3
|
|
|
correct, leading with para contents then h3
|
|
|
skipping levels at the front
|
|
|
skipping levels inside
|
|
|
```
|
|
|
|
|
|
Notes:
|
|
|
* Paragraphs and all other elements travel with the immediately preceding header
|
|
|
* Paragraphs and blocks preceding the first header, appear without a section wrapper
|
|
|
(before the first section)
|
|
|
* Hence sequences with no headers, are unchanged
|
|
|
* The logic should also apply to 'section' elements as well as wrapper elements
|
|
|
* Hence, a properly sectioned HTML is returned unchanged
|
|
|
* When sections are skipped (e.g. h4 appearing before h3), the extra section wrappers should *not* appear. So such a section comes wrapped as if it were at a higher level - although its header still indicates its 'presentation' level.
|
|
|
|
|
|
(Examples:
|
|
|
correct, leading with h1;
|
|
|
correct, leading with para contents then h1;
|
|
|
correct, leading with h3;
|
|
|
correct, leading with para contents then h3;
|
|
|
skipping levels at the front;
|
|
|
skipping levels inside)
|
|
|
|
|
|
## Validation of structures/content types
|
|
|
|
|
|
Once structures have been induced (inferred or projected over the element sequence), they need to be validated against rule sets appropriate to their workflows.
|
|
|
|
|
|
For example, a journal may have validation rules such as these:
|
|
|
|
|
|
* There must be top-level sections entitled "Introduction", "Methods and Materials", "Conclusion[s]" and "Bibliography". ("Conclusion[s]" means the 's' is optional.)
|
|
|
* They must appear in that order.
|
|
|
* An "Acknowledgements" section may optionally appear after "Conclusion[s]" but before "Bibliography".
|
|
|
* No section name can be repeated at the top level. (It's okay for subsections.)
|
|
|
* Sections at lower levels may be named anything except the names given.
|
|
|
|
|
|
The specific validation rules enforced by HTMLevator are tdb, based on requirements. We may also face a requirement to make these rules configurable.
|
|
|
|
|
|
For now, we presume we will use XPath internally to XSLT to enforce and perhaps express constraints, piping validation results directly into outputs and/or into a separate report (tdb).
|
|
|
|
|
|
|