... | ... | @@ -27,17 +27,33 @@ The pipeline will have exception handling logic for sections that are unrecogniz |
|
|
|
|
|
## Prototype design
|
|
|
|
|
|
1. We designate a particular named Paragraph Style to be the "section title" style. For now, this style name will be hard coded, maybe even **Section_Title**. (It can, however, be parameterized so as to be declared at run time.)
|
|
|
### 1
|
|
|
|
|
|
Note: we do not care about the styling on this paragraph -- how it *appears* in the Word document or the XSweet results -- although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
**We designate a particular named Paragraph Style to be the "section title" style.** For now, this style name will be hard coded, maybe even **Section_Title**. (It can, however, be parameterized so as to be declared at run time.)
|
|
|
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The principle is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
Note: we do not care about the styling on this paragraph -- how it *appears* in the Word document or the XSweet results -- although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
|
|
|
2. Represented in HTML Typescript by XSweet as `class` (attribute) assignments, this style name will be recognized and the HTML document sectioned (with paragraphs grouped) at the boundaries indicated by the flagged 'Section_Title' paragraphs. (Note that these proposed names can be adjusted.) Each such paragraph will lead off a new section.
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The principle is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
|
|
|
3. A subsequent process will validate these sections against constraints such as "is the text of the Section Title a recognized value?" (such as "Conclusions" or "Discussion").
|
|
|
### 2
|
|
|
|
|
|
This validation should be super-simple for purposes of process intelligibility and transparency -- but this is an area where engineers (facing complex real-world requirements) can go a little nuts. Ordinarily such validations check against such features as:
|
|
|
Represented in HTML Typescript by XSweet as `class` (attribute) assignments, this style name will be recognized and **the HTML document divided into `<section>` elements** at the boundaries indicated by the flagged 'Section_Title' paragraphs. Each such paragraph will lead off a new section.
|
|
|
|
|
|
Consequently the entire document will be "tesselated" into a sequence of sections, each section beginning with its title and continuing until the next title is found.
|
|
|
|
|
|
Paragraphs appearing *before* the first Section_Title indicated, however, should appear in the results without a `<section>` element enclosure. So we have (in RNC notation; NB if header promotion has been performed on the inputs, some `p` might be `h1-h6`):
|
|
|
|
|
|
```
|
|
|
body {
|
|
|
p*,
|
|
|
section { p* }* }
|
|
|
```
|
|
|
|
|
|
### 3
|
|
|
|
|
|
A subsequent process will **validate these sections against constraints** such as "is the text of the Section Title a recognized value?" (such as "Conclusions" or "Discussion"). Documents that fail this validation will be annotated in the results.
|
|
|
|
|
|
This validation should be super-simple for purposes of process intelligibility and transparency -- but this is an area where engineers (facing complex real-world requirements) can go a little nuts. Ordinarily such validations check against such features as:
|
|
|
|
|
|
1. Is the value permissible as a controlled value? (e.g. "Conclusions" or "Discussion")?
|
|
|
|
... | ... | @@ -53,7 +69,7 @@ For now, HTMLcognizer will enforce only the first of these rule sets. The specif |
|
|
|
|
|
When a section title is detected, but its value is not recognized as that of a known section title, we make the section but also inject a warning (into the main output) to go with it.
|
|
|
|
|
|
Recognized sections will be flagged in a way detectable by receiving software, for example with controlled @class values.
|
|
|
Recognized sections will be flagged in a way detectable by receiving software, for example with controlled `class` attribute values.
|
|
|
|
|
|
## Examples
|
|
|
|
... | ... | @@ -77,11 +93,21 @@ If this is considered to be a problem or potential problem we could address it b |
|
|
|
|
|
### Non-conformant
|
|
|
|
|
|
The input document has a paragraph marked with style "Section_Title", but its value ("Conclsions" in this example) is not controlled as designating a section type. So the class is given as `UNKNOWN` and an alert is also produced:
|
|
|
#### 1
|
|
|
|
|
|
An input document has no paragraphs marked with style "Section_Title".
|
|
|
|
|
|
Its output through the HTMLcognizer pipeline represents the input without modification.
|
|
|
|
|
|
#### 2
|
|
|
|
|
|
An input document has a paragraph marked with style "Section_Title", but its value ("Conclsions" in this example) is not controlled as designating a section type.
|
|
|
|
|
|
In the result, a `<section>` is created with an assigned class of `UNKNOWN` and an alert is also produced:
|
|
|
|
|
|
```
|
|
|
<section class="UNKNOWN">
|
|
|
<p style="htmlcog_alert">HTMLCognizer alert: "Conclsions" is not recognized as a section title</p>
|
|
|
<p style="htmlcog_alert">[[[ HTMLCognizer alert: "Conclsions" is not recognized as a section title. ]]]</p>
|
|
|
<p style="Section_Title">Conclsions</p>
|
|
|
<p>We conclude there is less than an 0.001% probability that the moon is made of green cheese.<p>
|
|
|
...
|
... | ... | @@ -92,28 +118,27 @@ The input document has a paragraph marked with style "Section_Title", but its va |
|
|
|
|
|
If XSweet header promotion is applied, it occurs separately (either before or after) from the sectioning process. Sectioning is based *only* on the assigned style name, not on its features or contents.
|
|
|
|
|
|
Paragraphs appearing before the first Section_Title indicated should appear in the results without a `<section>` element enclosure.
|
|
|
|
|
|
The contents of the Section_Title line, only (and nothing regarding its formatting) are evaluated to determine whether the section is recognized s a defined type. Regex matching or other string testing is okay. For example, a rule that "any title starting with 'Conclusion' and not longer than 40 characters indicates a `conclusion` section" would be straightforward to implement. HTMLcognizer should not, however, normalize "deviant" forms even if it is programmed to recognize them.
|
|
|
The (string) content of the Section_Title line, only (and nothing regarding its formatting) is evaluated to determine whether the section is recognized as a defined type. Regex matching or other string testing is okay. For example, a rule that "any title starting with 'Conclusion' and not longer than 40 characters indicates a `conclusion` section" would be straightforward to implement. HTMLcognizer should not, however, normalize "deviant" forms or variants even if it is programmed to recognize them.
|
|
|
|
|
|
HTMLcognizer should not be confused when footnote references occur inside titles. They must not throw off any string comparison. Other contents, however, are all construed as literals; no formatting (either paragraph-level or inline) is considered.
|
|
|
|
|
|
HTMLcognizer does not intervene to "correct" anything. If conversion does not produce acceptable results, it is up to the user whether to modify the Word document and run it through XSweet+HTMLcognizer again, or whether to correct the file in the HTML.
|
|
|
|
|
|
Everything *added* by HTMLcognizer must be flagged somehow so we know it is not in the source data.
|
|
|
Everything *added* by HTMLcognizer must be flagged somehow so it can be seen downstream where it comes from.
|
|
|
|
|
|
We don't do nesting. For example, within the "Conclusions" section everything will be flat, there are no subsections. Subsectioning must be accomplished by a different (probably subsequent) process.
|
|
|
|
|
|
In this draft spec, `class='UNKNOWN'` indicates a nominal section, whose title does not indicate a known section type. This is all upper-case for legibility, but in HTML, `class` values are case-insensitive. *Caveat receptor*.
|
|
|
|
|
|
As described here, the transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our section types and constraint set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), it will be very helpful to externalize these constraints in the form of an external (XML) file (which can be modified without XSLT expertise).
|
|
|
|
|
|
## Development and testing
|
|
|
|
|
|
Punchlist
|
|
|
|
|
|
** Pre-finalize the name of the "magic style". Is **Section_Title** okay?
|
|
|
** Pre-finalize a list of controlled names for section types and their respective permissible title value(s), for demonstration. Pre-finalize the rules to be enforced regarding section type assignment, ordering, and title values.
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize this style and these values (and not), for testing
|
|
|
** Build a transformation pipeline that performs this conversion as described above
|
|
|
|
|
|
(n.b. 'pre-finalize' means "decide, but not finally" :stuck_out_tongue_closed_eyes: ) |
|
|
|
|
|
The transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our constraints set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), we will want to externalize these constraints in the form of an external (XML) file. |