|
|
# HTMLcognizer - draft specs
|
|
|
|
|
|
We will build an XSLT pipeline that will accept Word documents that conform to a certain defined 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
We will build an XSLT pipeline that will accept Word documents that conform to a particular 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
|
|
|
The first step of this pipeline can be executed in XSweet, which produces an HTML Typescript document from a `.docx` source. If XSweet needs to be extended or modified to support the functionality described here, such work is in scope for this project (although present requirements do not appear to necessitate this). Like XSweet, this pipeline will require only XSLT 2.0 and SaxonHE, and will be amenable to integration into INK.
|
|
|
|
|
|
Our target is an HTML file whose `body` is divided into a sequence of `section` elements (let's say), no nested subsections. Further, based on the literal contents of the nominal section titles, each `<section>` is to be assigned to a section types captured as a "class" attribute value.
|
|
|
Our target is an HTML file whose `<body>` is divided into a sequence of `<section>` elements (let's say), no nested subsections. Further, based on the literal contents of the nominal section titles, each `<section>` is to be assigned to a section types captured as a "class" attribute value.
|
|
|
|
|
|
The names of the section heads must be validable to known constraints expressed (and ultimately configurable) in the pipeline. Initially, section names will be things like "Methods" and "Conclusion", as shown (for example) here:
|
|
|
|
|
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172213
|
|
|
|
|
|
Here is a provisional list of section headers, with tokens for indicating their (section) types in the result HTML:
|
|
|
Here is a provisional list of section headers, with proposed tokens for indicating their (section) types in the result HTML:
|
|
|
|
|
|
- Introduction `introduction`
|
|
|
- Methods and materials `methods`
|
... | ... | @@ -37,23 +37,23 @@ The pipeline will have exception handling logic for sections that are unrecogniz |
|
|
|
|
|
3. A subsequent process will validate these sections against constraints such as "is the text of the Section Title a recognized value?" (such as "Conclusions" or "Discussion").
|
|
|
|
|
|
This validation must be super-simple for purposes of process intelligibility and transparency. Ordinarily such validations check against such features as:
|
|
|
This validation should be super-simple for purposes of process intelligibility and transparency -- but this is an area where engineers (facing complex real-world requirements) can go a little nuts. Ordinarily such validations check against such features as:
|
|
|
|
|
|
1. Is the value permissible as a controlled value? (e.g. "Conclusions" or "Discussion")?
|
|
|
|
|
|
1a. Are variants permitted (e.g. "Conclusion"); if so are they normalized or left as-is?
|
|
|
|
|
|
2. Are cardinality constraints observed? For example, is it permissible to have a second "Conclusions" section?
|
|
|
2. Are cardinality constraints observed (can a section be repeated)? For example, is it permissible to have a second "Conclusions" section?
|
|
|
|
|
|
3. Do the sections (with types indicated) come in a permissible order?
|
|
|
|
|
|
4. Co-occurrence and other dependency constraints, for example 'If there is an Acknowledgements there may be no Funding section'.
|
|
|
4. Co-occurrence and other dependency constraints, for example 'If there is an Acknowledgements section there may be no Funding section'.
|
|
|
|
|
|
Recognized sections should be flagged in a way detectable by receiving software, for example with controlled @class values.
|
|
|
For now, HTMLcognizer will enforce only the first of these rule sets. The specific types and their criteria (recognized values) should be configurable. We can consider more complex requirements for validation of type assignments as they emerge.
|
|
|
|
|
|
For now, we will enforce only the first of these rule sets. The specific types permitted are not enumerated here (tbd), but they should be configurable. (We can consider 2, 3 or even more complex requirements such as dependencies, but our requirements for them must be evaluated.)
|
|
|
When a section title is detected, but its value is not recognized as that of a known section title, we make the section but also inject a warning (into the main output) to go with it.
|
|
|
|
|
|
When a section title is detected, but its value is not recognized as that of a known section title, we make the section and emit a warning to go with it.
|
|
|
Recognized sections will be flagged in a way detectable by receiving software, for example with controlled @class values.
|
|
|
|
|
|
## Examples
|
|
|
|
... | ... | @@ -81,7 +81,7 @@ The input document has a paragraph marked with style "Section_Title", but its va |
|
|
|
|
|
```
|
|
|
<section class="UNKNOWN">
|
|
|
<p style="htmlcog_alert">HTMLCognizer alert: "Conclsions" is unrecognized as a section title</p>
|
|
|
<p style="htmlcog_alert">HTMLCognizer alert: "Conclsions" is not recognized as a section title</p>
|
|
|
<p style="Section_Title">Conclsions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
... | ... | @@ -90,13 +90,15 @@ The input document has a paragraph marked with style "Section_Title", but its va |
|
|
|
|
|
### Notes
|
|
|
|
|
|
If header promotion is applied, it occurs separately (either before or after) from the sectioning process. Sectioning is based *only* on the assigned style name, not on its features or contents.
|
|
|
If XSweet header promotion is applied, it occurs separately (either before or after) from the sectioning process. Sectioning is based *only* on the assigned style name, not on its features or contents.
|
|
|
|
|
|
Paragraphs appearing before the first Section Title indicated should appear in the results without a `<section>` element enclosure.
|
|
|
Paragraphs appearing before the first Section_Title indicated should appear in the results without a `<section>` element enclosure.
|
|
|
|
|
|
The contents of the Section_Title line, only (and nothing regarding its formatting) are evaluated to determine whether the section is recognized s a defined type.
|
|
|
The contents of the Section_Title line, only (and nothing regarding its formatting) are evaluated to determine whether the section is recognized s a defined type. Regex matching or other string testing is okay. For example, a rule that "any title starting with 'Conclusion' and not longer than 40 characters indicates a `conclusion` section" would be straightforward to implement. HTMLcognizer should not, however, normalize "deviant" forms even if it is programmed to recognize them.
|
|
|
|
|
|
HTMLcognizer does not intervene to "correct" anything. It is up to the user whether to modify the Word document and run it through XSweet+HTMLcognizer again, or whether to correct the file in the HTML.
|
|
|
HTMLcognizer should not be confused when footnote references occur inside titles. They must not throw off any string comparison. Other contents, however, are all construed as literals; no formatting (either paragraph-level or inline) is considered.
|
|
|
|
|
|
HTMLcognizer does not intervene to "correct" anything. If conversion does not produce acceptable results, it is up to the user whether to modify the Word document and run it through XSweet+HTMLcognizer again, or whether to correct the file in the HTML.
|
|
|
|
|
|
Everything *added* by HTMLcognizer must be flagged somehow so we know it is not in the source data.
|
|
|
|
... | ... | @@ -106,9 +108,12 @@ In this draft spec, `class='UNKNOWN'` indicates a nominal section, whose title d |
|
|
|
|
|
## Development and testing
|
|
|
|
|
|
Punchlist
|
|
|
** Pre-finalize the name of the "magic style". Is **Section_Title** okay?
|
|
|
** Pre-finalize a list of controlled names for section types and their respective permissible title value(s), for demonstration. Pre-finalize the rules to be enforced regarding section type assignment, ordering, and title values.
|
|
|
** Pre-finalize the name of the "magic Style"
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize that style, for testing
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize this style and these values (and not), for testing
|
|
|
** Build a transformation pipeline that performs this conversion as described above
|
|
|
|
|
|
(n.b. 'pre-finalize' means "decide, but not finally" :stuck_out_tongue_closed_eyes: )
|
|
|
|
|
|
The transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our constraints set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), we will want to externalize these constraints in the form of an external (XML) file. |