... | ... | @@ -2,145 +2,11 @@ |
|
|
|
|
|
See page on the original [Pre-alpha Specs](pre-alpha-specs) (before redesign)
|
|
|
|
|
|
We will build an XSLT pipeline that will accept Word documents that conform to a particular 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
Two parts:
|
|
|
|
|
|
The first step of this pipeline can be executed in XSweet, which produces an HTML Typescript document from a `.docx` source. If XSweet needs to be extended or modified to support the functionality described here, such work is in scope for this project (although present requirements do not appear to necessitate this). Like XSweet, this pipeline will require only XSLT 2.0 and SaxonHE, and will be amenable to integration into INK.
|
|
|
## Structural induction
|
|
|
|
|
|
Our target is an HTML file whose `<body>` is divided into a sequence of `<section>` elements (let's say), no nested subsections. Further, based on the literal contents of the nominal section titles, each `<section>` is to be assigned to a section types captured as a "class" attribute value.
|
|
|
Interpolate `<section>` elements appropriately
|
|
|
|
|
|
The names of the section heads must be validable to known constraints expressed (and ultimately configurable) in the pipeline. Initially, section names will be things like "Methods" and "Conclusion", as shown (for example) here:
|
|
|
## Validation of structures/content types
|
|
|
|
|
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172213
|
|
|
|
|
|
Here is a provisional list of section headers, with proposed tokens for indicating their (section) types in the result HTML:
|
|
|
|
|
|
- Introduction `introduction`
|
|
|
- Methods and materials `methods`
|
|
|
- Results `results`
|
|
|
- Discussion `discussion`
|
|
|
- Conclusions `conclusion`
|
|
|
- Supporting information `supporting_info`
|
|
|
- Acknowledgments `acknowledgements`
|
|
|
- Author Contributions `author_contrib`
|
|
|
- References `references`
|
|
|
- **Further tbd** for example, what about Abstract?
|
|
|
|
|
|
The pipeline will have exception handling logic for sections that are unrecognized or otherwise problematic, as detailed.
|
|
|
|
|
|
## Prototype design
|
|
|
|
|
|
### 1
|
|
|
|
|
|
**We designate a particular named Paragraph Style to be the "section title" style.** For now, this style name will be hard coded, maybe even **Section_Title**. (It can, however, be parameterized so as to be declared at run time.)
|
|
|
|
|
|
Note: we do not care about the styling on this paragraph -- how it *appears* in the Word document or the XSweet results -- although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The principle is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
|
|
|
### 2
|
|
|
|
|
|
Represented in HTML Typescript by XSweet as `class` (attribute) assignments, this style name will be recognized and **the HTML document divided into `<section>` elements** at the boundaries indicated by the flagged 'Section_Title' paragraphs. Each such paragraph will lead off a new section.
|
|
|
|
|
|
Consequently the entire document will be "tesselated" into a sequence of sections, each section beginning with its title and continuing until the next title is found.
|
|
|
|
|
|
Paragraphs appearing *before* the first Section_Title indicated, however, should appear in the results without a `<section>` element enclosure. So we have (in RNC notation; NB if header promotion has been performed on the inputs, some `p` might be `h1-h6`):
|
|
|
|
|
|
```
|
|
|
body {
|
|
|
p*,
|
|
|
section { p* }* }
|
|
|
```
|
|
|
|
|
|
### 3
|
|
|
|
|
|
A subsequent process will **validate these sections against constraints** such as "is the text of the Section Title a recognized value?" (such as "Conclusions" or "Discussion"). Documents that fail this validation will be annotated in the results.
|
|
|
|
|
|
This validation should be super-simple for purposes of process intelligibility and transparency -- but this is an area where engineers (facing complex real-world requirements) can go a little nuts. Ordinarily such validations check against such features as:
|
|
|
|
|
|
1. Is the value permissible as a controlled value? (e.g. "Conclusions" or "Discussion")?
|
|
|
|
|
|
1a. Are variants permitted (e.g. "Conclusion"); if so are they normalized or left as-is?
|
|
|
|
|
|
2. Are cardinality constraints observed (can a section be repeated)? For example, is it permissible to have a second "Conclusions" section?
|
|
|
|
|
|
3. Do the sections (with types indicated) come in a permissible order?
|
|
|
|
|
|
4. Co-occurrence and other dependency constraints, for example 'If there is an Acknowledgements section there may be no Funding section'.
|
|
|
|
|
|
For now, HTMLevator will enforce only the first of these rule sets. The specific types and their criteria (recognized values) should be configurable. We can consider more complex requirements for validation of type assignments as they emerge.
|
|
|
|
|
|
When a section title is detected, but its value is not recognized as that of a known section title, we make the section but also inject a warning (into the main output) to go with it.
|
|
|
|
|
|
Recognized sections will be flagged in a way detectable by receiving software, for example with controlled `class` attribute values.
|
|
|
|
|
|
## Examples
|
|
|
|
|
|
### Conformant
|
|
|
|
|
|
The input document has a single paragraph marked with style "Section_Title", whose value is "Conclusions", which is recognized as indicating a known section type, to be flagged as `conclusions`.
|
|
|
|
|
|
The HTML result looks like this:
|
|
|
|
|
|
```
|
|
|
<section class="conclusions">
|
|
|
<p style="Section_Title">Conclusions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
|
|
</section>
|
|
|
```
|
|
|
|
|
|
Note: since this `<section>` element is introduced by the filter (transformation), there is no chance its assigned "class" value will clash with a class assignment (representing a Style name) given in the Word document. Its *value* may clash - for example, we may get a `<section class='conclusions'>` next to a series of `<p class='conclusions'>` if a Word author happened to have (and use) a paragraph style named "conclusions" in the Word document.
|
|
|
|
|
|
If this is considered to be a problem or potential problem we could address it by using unlikely names such as `htmlcog_conclusions_sec` or something along those lines.
|
|
|
|
|
|
### Non-conformant
|
|
|
|
|
|
#### 1
|
|
|
|
|
|
An input document has no paragraphs marked with style "Section_Title".
|
|
|
|
|
|
Its output through the HTMLevator pipeline represents the input without modification.
|
|
|
|
|
|
#### 2
|
|
|
|
|
|
An input document has a paragraph marked with style "Section_Title", but its value ("Conclsions" in this example) is not controlled as designating a section type.
|
|
|
|
|
|
In the result, a `<section>` is created with an assigned class of `UNKNOWN` and an alert is also produced:
|
|
|
|
|
|
```
|
|
|
<section class="UNKNOWN">
|
|
|
<p style="htmlcog_alert">[[[ HTMLevator alert: "Conclsions" is not recognized as a section title. ]]]</p>
|
|
|
<p style="Section_Title">Conclsions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
|
|
</section>
|
|
|
```
|
|
|
|
|
|
### Notes
|
|
|
|
|
|
If XSweet header promotion is applied, it occurs separately (either before or after) from the sectioning process. Sectioning is based *only* on the assigned style name, not on its features or contents.
|
|
|
|
|
|
The (string) content of the Section_Title line, only (and nothing regarding its formatting) is evaluated to determine whether the section is recognized as a defined type. Regex matching or other string testing is okay. For example, a rule that "any title starting with 'Conclusion' and not longer than 40 characters indicates a `conclusion` section" would be straightforward to implement. HTMLevator should not, however, normalize "deviant" forms or variants even if it is programmed to recognize them.
|
|
|
|
|
|
HTMLevator should not be confused when footnote references occur inside titles. They must not throw off any string comparison. Other contents, however, are all construed as literals; no formatting (either paragraph-level or inline) is considered.
|
|
|
|
|
|
HTMLevator does not intervene to "correct" anything. If conversion does not produce acceptable results, it is up to the user whether to modify the Word document and run it through XSweet+HTMLevator again, or whether to correct the file in the HTML.
|
|
|
|
|
|
Everything *added* by HTMLevator must be flagged somehow so it can be seen downstream where it comes from.
|
|
|
|
|
|
We don't do nesting. For example, within the "Conclusions" section everything will be flat, there are no subsections. Subsectioning must be accomplished by a different (probably subsequent) process.
|
|
|
|
|
|
In this draft spec, `class='UNKNOWN'` indicates a nominal section, whose title does not indicate a known section type. This is all upper-case for legibility, but in HTML, `class` values are case-insensitive. *Caveat receptor*.
|
|
|
|
|
|
As described here, the transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our section types and constraint set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), it will be very helpful to externalize these constraints in the form of an external (XML) file (which can be modified without XSLT expertise).
|
|
|
|
|
|
## Development and testing
|
|
|
|
|
|
Punchlist
|
|
|
|
|
|
** Pre-finalize the name of the "magic style". Is **Section_Title** okay?
|
|
|
** Pre-finalize a list of controlled names for section types and their respective permissible title value(s), for demonstration. Pre-finalize the rules to be enforced regarding section type assignment, ordering, and title values.
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize this style and these values (and not), for testing
|
|
|
** Build a transformation pipeline that performs this conversion as described above
|
|
|
|
|
|
(n.b. 'pre-finalize' means "decide, but not finally" :stuck_out_tongue_closed_eyes: ) |