... | ... | @@ -16,13 +16,13 @@ Not being able to provide a 100% solution, however, does not necessary make an 8 |
|
|
|
|
|
## Towards a solution
|
|
|
|
|
|
We are ambitious. We know we need something that will work tolerably and usefully well, even if run with all the default settings intact and no special configuration. Running the set of stylesheets "lights out", that is (which is to say arbitrarily will no special inputs or supervision) should always be an option - and the outputs of such a process shouldn't be entirely useless.
|
|
|
We are ambitious. We know we need something that will work tolerably and usefully well, even if run with all the default settings intact and no special configuration. That is, any special switches or customizations should be regarded as improvements, not necessities.
|
|
|
|
|
|
However, we also want something that can be extended and adjusted to fit special cases and do special things - because special cases and special things are *regular* and *to be expected*. It is to be expected, we assert, that some loss of "semantic resolution" will occur - but the degree to which that happens, can be responsive to the level of effort and skill put in. In other words, while a solution should give us something (reasonable) "lights out", it should also be able to do more if opened up and tinkered with.
|
|
|
However, we also want something that can be extended and adjusted to fit special cases and do special things - because special cases and special things are *regular* and *to be expected*. It is to be expected, we concede, that there may inevitably be some loss of "semantic resolution" in any (or in any rate, in any off-the-shelf) conversion -- but the degree to which that happens, can also be responsive to the level of effort and skill put in. In other words, while a solution should give us something (reasonable) as a "black box", it should also be able to do more if opened up and tinkered with.
|
|
|
|
|
|
The best way we think we can provide this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to open it up and rewire it -- completely, if need be.
|
|
|
The best way we think we can provide this is at the level of the architecture. We must design from the outset with the idea that everything in XSweet should be designed for adaptation and reuse. XSweet should work like a black box but you should also be able to rewire it inside -- completely, if need be.
|
|
|
|
|
|
And, because we already know that 'perfect for everyone all the time' is impossible, WE AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT
|
|
|
And, because we already know that 'perfect for everyone all the time' is impossible, WE AIM (first) FOR USEFUL, NOT (yet) COMPLETE OR PERFECT.
|
|
|
|
|
|
XSweet does not have to try and solve the entire problem. Instead, we break the problem into chunks and pieces, and address them serially, "in detail".
|
|
|
|
... | ... | @@ -32,7 +32,7 @@ If we break up the problem into detail, what are the pieces? |
|
|
|
|
|
Can we prioritize them based on which of them are more universal / ubiquitous, vs which show the most variation (across documents, document types, publication types and workflows) and will therefore be most problematic and peculiar?
|
|
|
|
|
|
- Main text including inline features such as bold, italics (and their warrants?) as well as significant structural divisions (headers)
|
|
|
- Main text including features such as bold, italics (and their warrants where available?), font face and size shifts -- both between lines, and in line.
|
|
|
- Footnotes, endnotes and textual apparatus with their cross-references
|
|
|
- 'Textual objects' including figures, tables, structured lists
|
|
|
(as represented typographically and by other means) with their cross-references
|
... | ... | @@ -140,7 +140,7 @@ Indeed only if we do *as little as possible* in changing the representation of t |
|
|
|
|
|
So far so good - but what will that format actually be? It is not hard to envision what it would look like. It would differ from the Word source data in being 'idiomatic' with respect to markup structures most especially idioms related to inline markup. But instead of using the obscure and impenetrable Word vocabulary, it would (again) use either a standard or made-for-purpose vocabulary. The difference would be that it would translate what we actually see in the Word, whether it be "semantic" (that is, arbitrary labels such as we have with Word Styles) or merely renditional or presentational, maybe looking something like this:
|
|
|
|
|
|
- ` <b>Gene Roddenberry's <i>Star Trek</i></b>` (HTML)
|
|
|
- `<b>Gene Roddenberry's <i>Star Trek</i></b>` (HTML)
|
|
|
- `<b>Gene Roddenberry's <i>Star Trek</i></b>` (DITA)
|
|
|
- `<bold>Gene Roddenberry's <italic>Star Trek</italic></bold>` (JATS/BITS)
|
|
|
- `<emphasis role="bold">Gene Roddenberry's <emphasis>Star Trek</emphasis></emphasis>` (Docbook)
|
... | ... | @@ -161,11 +161,11 @@ For any target format serving a data conversion out of .docx, there would be thr |
|
|
|
|
|
* In order to reflect the 'semantic domain' of the .docx source format (which ultimately describes 'presentation' or is at least bound largely to presentation behaviors), we need to be able to capture presentational information, i.e. specs on how outputs should look on the 'page'. This is not because this information is valuable in itself, but because it serves as the "context of distinction" or "terrain" within which any putative semantics must be inferred. (And presumably a putative semantics must be able to map back into this presentation.)
|
|
|
|
|
|
* The target format must support unstructured or loosely structured data; in any case structure constraints are not enforced. That's because we know stuff coming in will have every structure and no structure.
|
|
|
* The target format must support unstructured or loosely structured data; in any case structural constraints are not enforced. That's because we know stuff coming in will have every structure and no structure. Since a great part of the editorial process is devoted to precisely this question of how to organize and represent the text, no structural semantics *per se* should be dictated by what is essentially a bridge format.
|
|
|
|
|
|
* Ideally our format of choice would be well-known, a *lingua franca*.
|
|
|
|
|
|
These obviously suggest a late-model HTML. We can stipulate HTML5 in an XML syntax (so nominally an XHTML5).
|
|
|
These obviously suggests a late-model HTML. We can stipulate HTML5 in an XML syntax (so nominally an XHTML5).
|
|
|
|
|
|
- We can easily leave our documents 'flat' as long as we need to - structure can come later! (This is a key distinction vs our final target format)
|
|
|
- It has `@class` and `@style`, fantastic escape hatches!
|
... | ... | @@ -175,7 +175,7 @@ These obviously suggest a late-model HTML. We can stipulate HTML5 in an XML synt |
|
|
|
|
|
Additionally, over against possible alternatives
|
|
|
|
|
|
- A custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now;
|
|
|
- A home-brew, custom vocabulary would have to be designed, tested, documented and learned by users; HTML lets us just fake it for now;
|
|
|
- And since we are expecting to work further with our data on an HTML platform, and go from there when it comes to other formats for interchange/archiving - we can just stick with that plan ...
|
|
|
- ... Illustrating the point: anyone can use HTML5 (especially wf XML HTML5) so let's use that.
|
|
|
|
... | ... | @@ -183,8 +183,6 @@ Note that choosing HTML/CSS as a vocabulary does not mean we need utterly to for |
|
|
|
|
|
Indeed, over the medium or long term it could be useful to design XML schemas to describe any different stages or 'profiles' of HTML (perhaps, operational subsets of XHTML5/CSS) that we wished to formalize.
|
|
|
|
|
|
In the meantime, we have a suitable carrier format. (See below for more.) It does make use of @style in ways that are deprecated and may be discouraged in other contexts. We justify our use of @style in this way as a means to an end - we represent information on @style so that it can be abstracted away (and the style can be gotten rid of).
|
|
|
|
|
|
Next to these, the fact that HTML also has an element-type semantics albeit an impoverished one - p, ul, li, the lists, tables - is useful, but not essential, as long as we have generic "hangers" we can use such as div, p and span.
|
|
|
|
|
|
```
|
... | ... | |