... | ... | @@ -110,11 +110,11 @@ or (what is actually much more likely): |
|
|
|
|
|
(If you squint you can see that Word is preserving a kind of 'edit history' by not cleaning up after itself - as well as other artifacts of word-processorism.)
|
|
|
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (The same thing may be said in many different ways as well as repeatedly.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. But XML really isn't it. Putting this back together may take some care.
|
|
|
There are a couple of serious problems here. Internally, WordML is sloppy and highly redundant, "noisy" in a formal sense (measure of tag entropy), and 'promiscuous'. (It turns out the same thing may be said in many different ways as well as repeatedly.) The syntax is awful. It is not hard to imagine how the data here can map into tractable objects in the right kind of programming environment. But XML really isn't it. Putting this back together may take some care.
|
|
|
|
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As. And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
More importantly, WordML lacks explicit indicators of intellectual structure, or at any rate indicators whose consistency we can depend on. This is because it is actually a representation *twice removed* from the logical structure we see in our mind's eye. What the Word document (.docx) represents in operation is something entirely outside the .docx itself, namely the printed artifact that is produced when you hit the "Print" button (or today, the PDF you produce with a Save As). And even this is actually not entirely true or is, at least, further complicated by the way in which Word also becomes a kind of environment in its own right, and so the .docx is never 'fully itself' except when it is in MS Word itself.)
|
|
|
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", but the book, article, research paper or other "thing" hiding inside it.
|
|
|
On the other hand we aren't actually interested in the Word document "fully itself", even once it is printed or displayed (correctly) on screen -- but the book, article, research paper or other "thing" hiding inside it.
|
|
|
|
|
|
Even in this little sample, a case of this is evident if you make a close scrutiny at the *structure* of the outputs compared to the inputs. The clean markup samples all have the structure (bold (italic)), where the 'italic' span is nested inside the bold span. In contrast, the Word document shows (bold)(bold italic) - two spans next to each other.
|
|
|
|
... | ... | |