|
|
# HTMLcognizer - draft specs
|
|
|
|
|
|
We will build a pipeline that will accept Word documents that conform to a certain defined 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
|
|
|
This is conceived as a post-process to XSweet (.docx file extraction), so its inputs are actually HTML Typescript as XSweet emits them. If XSweet needs to be extended or modified to support this, such work is in scope for this project but so far we think the HTML Typescript we have, is good enough.)
|
|
|
|
|
|
Our initial target is an HTML file whose `body` is divided into a sequence of sections, no nested subsections. Further, based on the literal contents of the nominal section titles, the sections are to be assigned to section types.
|
|
|
|
|
|
The names of the section heads must be validable to known constraints expressed (and ultimately configurable) in the pipeline. Initially, section names will be things like "Methods" and "Conclusion", as shown (for example) here:
|
|
|
|
|
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172213
|
|
|
|
|
|
Here is a provisional list of section headers, with tokens for indicating their (section) types in the result HTML:
|
|
|
|
|
|
- Introduction `introduction`
|
|
|
- Methods and materials `methods`
|
|
|
- Results `results`
|
|
|
- Discussion `discussion`
|
|
|
- Conclusions `conclusion`
|
|
|
- Supporting information `supporting_info`
|
|
|
- Acknowledgments `acknowledgements`
|
|
|
- Author Contributions `author_contrib`
|
|
|
- References `references`
|
|
|
- **Further tbd** for example, what about Abstract?
|
|
|
|
|
|
The pipeline will have exception handling code for sections that are unrecognized or out of order.
|
|
|
|
|
|
## Prototype design
|
|
|
|
|
|
1. We designate a particular named Paragraph Style to be the "section title" style. For now, the style name will be hard coded, maybe even **Section_Title**.
|
|
|
|
|
|
Note: we do not care about the styling on this paragraph, although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The rule is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
|
|
|
2. Represented in HTML Typescript by XSweet as class assignments, this style name will be recognized and the HTML document sectioned (with paragraphs grouped) at the boundaries indicated by the flagged 'Section_Title' paragraphs. (Note that these proposed names can be adjusted.) Each such paragraph will lead off a new section.
|
|
|
|
|
|
3. A subsequent process will validate these sections against constraints such as "is the text of the Section Title the same as a recognized value?" (such as "Conclusions" or "Discussion").
|
|
|
|
|
|
This validation must be super-simple for purposes of transparency. Ordinarily such validations check against such features as:
|
|
|
|
|
|
1. Is the value permissible as a controlled value? (e.g. "Conclusions" or "Discussion")?
|
|
|
|
|
|
1a. Are variants permitted (e.g. "Conclusion"); if so are they normalized or left as-is?
|
|
|
|
|
|
2. Are cardinality constraints observed? For example, is it permissible to have a second "Conclusions" section?
|
|
|
|
|
|
3. Do the sections (with types indicated) come in a permissible order?
|
|
|
|
|
|
4. Co-occurrence and other dependency constraints, for example 'If there is an Acknowledgements there may be no Funding section'.
|
|
|
|
|
|
Recognized sections should be flagged in a way detectable by receiving software, for example with controlled @class values.
|
|
|
|
|
|
For now, we will enforce only the first of these rule sets. The specific types permitted are not enumerated here (tbd), but they should be configurable. (We can consider 2, 3 or even more complex requirements such as dependencies, but our requirements for them must be evaluated.)
|
|
|
|
|
|
When a section title is detected, but its value is not recognized as that of a known section title, we make the section and emit a warning to go with it.
|
|
|
|
|
|
## Examples
|
|
|
|
|
|
### Conformant
|
|
|
|
|
|
The input document has a single paragraph marked with style "Section_Title", whose value is "Conclusions", which is recognized as indicating a known section type, to be flagged as `conclusions`.
|
|
|
|
|
|
The HTML result looks like this:
|
|
|
|
|
|
```
|
|
|
<section class="conclusions">
|
|
|
<p style="Section_Title">Conclusions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
|
|
</section>
|
|
|
```
|
|
|
|
|
|
Note: since this `<section>` element is introduced by the filter (transformation), there is no chance its assigned "class" value will clash with a class assignment (representing a Style name) given in the Word document. Its *value* may clash - for example, we may get a `<section class='conclusions'>` next to a series of `<p class='conclusions'>` if a Word author happened to have (and use) a paragraph style named "conclusions" in the Word document.
|
|
|
|
|
|
If this is considered to be a problem or potential problem we could address it by using unlikely names such as `htmlcog_conclusions_sec` or something along those lines.
|
|
|
|
|
|
### Non-conformant
|
|
|
|
|
|
The input document has a paragraph marked with style "Section_Title", but its value ("Conclsions" in this example) is not controlled as designating a section type. So the class is given as `UNKNOWN` and an alert is also produced:
|
|
|
|
|
|
```
|
|
|
<section class="UNKNOWN">
|
|
|
<p style="htmlcog_alert">HTMLCognizer alert: "Conclsions" is unrecognized as a section title</p>
|
|
|
<p style="Section_Title">Conclsions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
|
|
</section>
|
|
|
```
|
|
|
|
|
|
### Notes
|
|
|
|
|
|
If header promotion is applied, it occurs separately (either before or after) from the sectioning process. Sectioning is based *only* on the assigned style name, not on its features or contents.
|
|
|
|
|
|
Paragraphs appearing before the first Section Title indicated should appear in the results without a `<section>` element enclosure.
|
|
|
|
|
|
The contents of the Section_Title line, only (and nothing regarding its formatting) are evaluated to determine whether the section is recognized s a defined type.
|
|
|
|
|
|
HTMLcognizer does not intervene to "correct" anything. It is up to the user whether to modify the Word document and run it through XSweet+HTMLcognizer again, or whether to correct the file in the HTML.
|
|
|
|
|
|
Everything *added* by HTMLcognizer must be flagged somehow so we know it is not in the source data.
|
|
|
|
|
|
We don't do nesting. For example, within the "Conclusions" section everything will be flat, there are no subsections. Subsectioning must be accomplished by a different (probably subsequent) process.
|
|
|
|
|
|
## Development and testing
|
|
|
|
|
|
** Pre-finalize a list of controlled names for section types and their respective permissible title value(s), for demonstration. Pre-finalize the rules to be enforced regarding section type assignment, ordering, and title values.
|
|
|
** Pre-finalize the name of the "magic Style"
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize that style.
|
|
|
** Build a transformation pipeline that performs this conversion as described above
|
|
|
|
|
|
The transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our constraints set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), we will want to externalize these constraints in the form of an external (XML) file. |