... | ... | @@ -2,9 +2,9 @@ |
|
|
|
|
|
We will build a pipeline that will accept Word documents that conform to a certain defined 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
|
|
|
This is conceived as a post-process to XSweet (.docx file extraction), so its inputs are actually HTML Typescript as XSweet emits them. If XSweet needs to be extended or modified to support this, such work is in scope for this project but so far we think the HTML Typescript we have, is good enough.)
|
|
|
This is conceived as a post-process to XSweet (.docx file extraction), so its inputs (the source format) are actually HTML Typescript as XSweet emits it. If XSweet needs to be extended or modified to support the functionality described here, such work is in scope for this project -- but so far we think the HTML Typescript we have, is good enough.
|
|
|
|
|
|
Our initial target is an HTML file whose `body` is divided into a sequence of sections, no nested subsections. Further, based on the literal contents of the nominal section titles, the sections are to be assigned to section types.
|
|
|
Our target is an HTML file whose `body` is divided into a sequence of `section` elements (let's say), no nested subsections. Further, based on the literal contents of the nominal section titles, each `<section>` is to be assigned to a section types captured as a "class" attribute value.
|
|
|
|
|
|
The names of the section heads must be validable to known constraints expressed (and ultimately configurable) in the pipeline. Initially, section names will be things like "Methods" and "Conclusion", as shown (for example) here:
|
|
|
|
... | ... | @@ -23,21 +23,21 @@ Here is a provisional list of section headers, with tokens for indicating their |
|
|
- References `references`
|
|
|
- **Further tbd** for example, what about Abstract?
|
|
|
|
|
|
The pipeline will have exception handling code for sections that are unrecognized or out of order.
|
|
|
The pipeline will have exception handling logic for sections that are unrecognized or otherwise problematic, as detailed.
|
|
|
|
|
|
## Prototype design
|
|
|
|
|
|
1. We designate a particular named Paragraph Style to be the "section title" style. For now, the style name will be hard coded, maybe even **Section_Title**.
|
|
|
1. We designate a particular named Paragraph Style to be the "section title" style. For now, this style name will be hard coded, maybe even **Section_Title**. (It can, however, be parameterized so as to be declared at run time.)
|
|
|
|
|
|
Note: we do not care about the styling on this paragraph, although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
Note: we do not care about the styling on this paragraph -- how it *appears* in the Word document or the XSweet results -- although we expect it will ordinarily be an A-level or top-level head. Nor do we care if the style has a local override: assignment of the **Section_Title** Paragraph Style by the user to a paragraph is the only thing required.
|
|
|
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The rule is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
Note also: we do *not* presently propose to validate the permitted section titles inside Word - i.e. provide feedback to users if they assign this style and use it incorrectly. The principle is GIGO (garbage-in garbage-out) - the way you tell if it is working is, the outputs are correct.
|
|
|
|
|
|
2. Represented in HTML Typescript by XSweet as class assignments, this style name will be recognized and the HTML document sectioned (with paragraphs grouped) at the boundaries indicated by the flagged 'Section_Title' paragraphs. (Note that these proposed names can be adjusted.) Each such paragraph will lead off a new section.
|
|
|
2. Represented in HTML Typescript by XSweet as `class` (attribute) assignments, this style name will be recognized and the HTML document sectioned (with paragraphs grouped) at the boundaries indicated by the flagged 'Section_Title' paragraphs. (Note that these proposed names can be adjusted.) Each such paragraph will lead off a new section.
|
|
|
|
|
|
3. A subsequent process will validate these sections against constraints such as "is the text of the Section Title the same as a recognized value?" (such as "Conclusions" or "Discussion").
|
|
|
3. A subsequent process will validate these sections against constraints such as "is the text of the Section Title a recognized value?" (such as "Conclusions" or "Discussion").
|
|
|
|
|
|
This validation must be super-simple for purposes of transparency. Ordinarily such validations check against such features as:
|
|
|
This validation must be super-simple for purposes of process intelligibility and transparency. Ordinarily such validations check against such features as:
|
|
|
|
|
|
1. Is the value permissible as a controlled value? (e.g. "Conclusions" or "Discussion")?
|
|
|
|
... | ... | @@ -102,11 +102,13 @@ Everything *added* by HTMLcognizer must be flagged somehow so we know it is not |
|
|
|
|
|
We don't do nesting. For example, within the "Conclusions" section everything will be flat, there are no subsections. Subsectioning must be accomplished by a different (probably subsequent) process.
|
|
|
|
|
|
In this draft spec, `class='UNKNOWN'` indicates a nominal section, whose title does not indicate a known section type. This is all upper-case for legibility, but in HTML, `class` values are case-insensitive. *Caveat receptor*.
|
|
|
|
|
|
## Development and testing
|
|
|
|
|
|
** Pre-finalize a list of controlled names for section types and their respective permissible title value(s), for demonstration. Pre-finalize the rules to be enforced regarding section type assignment, ordering, and title values.
|
|
|
** Pre-finalize the name of the "magic Style"
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize that style.
|
|
|
** Assemble a set of Word documents (actual and/or mocked up) that utilize that style, for testing
|
|
|
** Build a transformation pipeline that performs this conversion as described above
|
|
|
|
|
|
The transformation pipeline will have at least two phases, (1) marking the sections and (2) validating and tagging their types. These may be combined into a single XSLT for simplicity of application. However, even a single XSLT is likely to support two inputs: if we wish our constraints set to be configurable at runtime (for the second pass: e.g. which section titles indicate which types?), we will want to externalize these constraints in the form of an external (XML) file. |