... | ... | @@ -22,13 +22,13 @@ The main (only) operational target for the system is (so far) ingest by PubSweet |
|
|
|
|
|
However an important consideration is that none of these are in scope for this transformation:
|
|
|
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case.
|
|
|
* Any attempt to represent in the HTML result the "correct" _intellectual_ or _logical_ structure (which is to say, typically, chapters, section, nested parts etc.) beyond simply preserving paragraph style names. Document structure will be to provided or "restored" in a subsequent development phase (editorial step), after ingest of the data. This relieves the XSLT of having to perform a semantic/structural induction that is difficult to specify for any reasonable common case, and impossible to specify for the general case. While it may be possible to aid editing by providing some obvious structure for things like figures and boxed text (callouts), where the Word does not have (explicit) structure, the HTML will not seek to represent it.
|
|
|
|
|
|
* Any effort to identify or extract metadata fields (which again will provided via another channel).
|
|
|
|
|
|
* In general, any "semantic" interpretation of ad-hoc (local) names in the Word document. For example a segment marked as style "Italic" will be so marked in the HTML (as a `<span class="Italic">`), but not marked as HTML `i` or represented as italic in any other way.
|
|
|
|
|
|
As long as we do not lose data content coming across, indeed, the only thing we really care about our format is that it be (a) legible and intelligible to target users and applications, (b) as economical and tractable as possible, and (c) 'truthful' in its representations. However, requirements b and c are at odds, since economy means leaving things out. We want to represent only what is both true, and useful. Because this is not yet known (and indeed because it may vary from one case to the next), the particulars of the target format (as respects element types, attribute values etc.) are probably best defined "under load" (that is, in use). We like HTML because it is a vernacular and developers know what to expect from it -- so it gives us some (broad) boundaries going forward.
|
|
|
None of these rules are absolute. However, as long as we do not lose data content coming across, the only thing we really care about our format is that it be (a) legible and intelligible to target users and applications, (b) as economical and tractable as possible, and (c) 'truthful' in its representations. However, requirements b and c are at odds, since economy means leaving things out. We want to represent only what is both true, and useful. Because this is not yet known (and indeed because it may vary from one case to the next), the particulars of the target format (as respects element types, attribute values etc.) are probably best defined "under load" (that is, in use). We like HTML because it is a vernacular and developers know what to expect from it -- so it gives us some (broad) boundaries going forward.
|
|
|
|
|
|
Development of a formal spec for such a format is an item tbd. For now, we intend to "produce pudding" that can be proven by eating it.
|
|
|
|
... | ... | @@ -51,6 +51,7 @@ Since there is much to be done to get to that point, this means being vigilant f |
|
|
No provision is made for passing through, for example, page headers, into the HTML, in any form.
|
|
|
|
|
|
However, at deployment time, no provision is made for handling tables, for example, and we know we will have to handle them. So we already know we will be fixing up the XSLT to work for these cases. But what about cases we haven't seen yet?
|
|
|
|
|
|
We need to have robust mechanisms for detecting problems in data extraction (or any transformation_ _especially lost data_, for ameliorating such problems in the instance (sometimes they may not be fatal errors), and for maintaining and improving the XSLTs.
|
|
|
|
|
|
Operationally, what will be the best way to specify corrections / improvements? (Could use Issues on this here gitlab.)
|
... | ... | |