|
|
# HTMLcognizer - draft specs
|
|
|
|
|
|
We will build a pipeline that will accept Word documents that conform to a certain defined 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
We will build an XSLT pipeline that will accept Word documents that conform to a certain defined 'styling profile' (subject to our definition), and produce HTML from these documents containing structures reflecting document organization (i.e., explicit `div` or `section` in the HTML), as indicated by their styling.
|
|
|
|
|
|
This is conceived as a post-process to XSweet (.docx file extraction), so its inputs (the source format) are actually HTML Typescript as XSweet emits it. If XSweet needs to be extended or modified to support the functionality described here, such work is in scope for this project -- but so far we think the HTML Typescript we have, is good enough.
|
|
|
The first step of this pipeline can be executed in XSweet, which produces an HTML Typescript document from a `.docx` source. If XSweet needs to be extended or modified to support the functionality described here, such work is in scope for this project (although present requirements do not appear to necessitate this). Like XSweet, this pipeline will require only XSLT 2.0 and SaxonHE, and will be amenable to integration into INK.
|
|
|
|
|
|
Our target is an HTML file whose `body` is divided into a sequence of `section` elements (let's say), no nested subsections. Further, based on the literal contents of the nominal section titles, each `<section>` is to be assigned to a section types captured as a "class" attribute value.
|
|
|
|
... | ... | |