XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2018-05-21T17:47:38Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/23Images2018-05-21T17:47:38ZWendell PiezImagesThe end of Zorba chapter 4 `b04%20Urban.docx` has photographs ... making this a reasonable case for analysis.The end of Zorba chapter 4 `b04%20Urban.docx` has photographs ... making this a reasonable case for analysis.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/22Locate, implement and test footnotes conversion2017-08-16T18:59:57ZWendell PiezLocate, implement and test footnotes conversionSo-called (in Word) "endnotes" have gotten some attention but footnotes need analysis, maybe a bit of specification (should they work just like endnotes?) and implementation. Find or create an example with actual footnotes (i.e. not just...So-called (in Word) "endnotes" have gotten some attention but footnotes need analysis, maybe a bit of specification (should they work just like endnotes?) and implementation. Find or create an example with actual footnotes (i.e. not just placeholders).Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/21Assessing coverage by comparing plain-text diffs2018-03-15T20:02:17ZWendell PiezAssessing coverage by comparing plain-text diffsA stylesheet `plaintext.xsl` is now provided, which is able to strip tagging from an XHTML (or other wf XML) document. To HTML it adds a bit of whitespace (line ends) for legibility.
It will be really interesting to compare the result...A stylesheet `plaintext.xsl` is now provided, which is able to strip tagging from an XHTML (or other wf XML) document. To HTML it adds a bit of whitespace (line ends) for legibility.
It will be really interesting to compare the results of this transformation with what Word provides in a "Save As (Plain Text .txt)" over the same input file.1.0.0Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/20Early aggregator to call extractor over a sequence of files2017-08-09T19:08:49ZWendell PiezEarly aggregator to call extractor over a sequence of filesA simple file manifest source (`xsweet-manifest.xml`) could be used to drive the extractor and produce an assembly of HTML `section` elements within a single file, ready for subequent pipelining. (This is a job that INK may ultimately be...A simple file manifest source (`xsweet-manifest.xml`) could be used to drive the extractor and produce an assembly of HTML `section` elements within a single file, ready for subequent pipelining. (This is a job that INK may ultimately be doing but we probably need the functionality sooner. Saxon can do the job if it can read the source data.) Such a single merged result for the book will help us in mapping since we'd have a comprehensive view of all the sources together. Plus, it may help to scale (or not) -- total throughput time should be lower since there is only one pipeline not one per chapter, so less overhead.
Something really simple to start such as
``` xml
<source-files>
<dir name="a1 Urban_tit"/>
<dir name="a2 Urban_toc"/>
<dir name="a3 Urban_pref"/>
<dir name="a4 Urban_Illus"/>
<dir name="b00 Urban_Intro"/>
<dir name="b01 Urban"/>
<dir name="b02 Urban"/>
<dir name="b03 Urban"/>
<dir name="b04 Urban"/>
<dir name="b05 Urban"/>
<dir name="b06 Urban"/>
<dir name="b07 Urban_Concl"/>
<dir name="z Urban_bib"/>
</source-files>
```
The idea being that a `word/document.xml` will be found inside each directory named, and its extracted results are to be merged as a `section` into a single HTML document.
Question: will this work in testing? What can we assume about the extraction process and availability of inputs (to Saxon) at runtime?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/19Preliminary analysis suggesting 'Zorba' optimizations2017-08-09T19:08:49ZWendell PiezPreliminary analysis suggesting 'Zorba' optimizationsAnalysis of one of the Zorba example(s):
[b04-html-analysis.html](/uploads/8a53b8021b4b0abbd70fb6c4e51cbd2a/b04-html-analysis.html)
Probably the following can be stripped (retaining their contents):
- `b/b` (directly nested `b' ...Analysis of one of the Zorba example(s):
[b04-html-analysis.html](/uploads/8a53b8021b4b0abbd70fb6c4e51cbd2a/b04-html-analysis.html)
Probably the following can be stripped (retaining their contents):
- `b/b` (directly nested `b' elements)
- `span[@class=('apple-converted-space','basket-total']`
- `span[@class='Hypertext']//*` (any element e.g. `u` or `i` appearing inside 'Hypertext' spans, whose formatting should be provided by CSS) - at least when `normalize-space(.)=normalize-space(..)` (the element is co-extensive with its parent)
- `@class[.=('NormalWeb','bCs')]` (superfluous class values)
Also, apart from providing a mapping into `h1` and `h2`, font and color info is apparently not particularly helpful, so could probably safely be stripped in general...
We should also run this analysis on an aggregation of all Zorba samples.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/18Consider adding another bit of "scrub" logic to scrub.xsl2017-04-03T21:07:12ZWendell PiezConsider adding another bit of "scrub" logic to scrub.xslOftentimes HTML results from .docx show how formatting was controlled at the inline level not paragraph level, so we get things like:
```
<p>
<span style="font-size: 18">A: Nobody can beat me! I am the best showman in the whole hi...Oftentimes HTML results from .docx show how formatting was controlled at the inline level not paragraph level, so we get things like:
```
<p>
<span style="font-size: 18">A: Nobody can beat me! I am the best showman in the whole history of man. </span>
</p>
```
We might consider removing the `span` and promoting its properties to the `p`.
Don't do this when there's a `@class` collision; also think through `@style`.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/17Expose indents2017-04-03T21:07:12ZWendell PiezExpose indentsSamples in Zorba (see e.g. `2-b00Urban_Intro.docx`).
A related issue is whether we should start exposing formatting settings that are bound via (paragraph or inline) style.Samples in Zorba (see e.g. `2-b00Urban_Intro.docx`).
A related issue is whether we should start exposing formatting settings that are bound via (paragraph or inline) style.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/16Mapping formatting properties to header elements in 'Zorba'2017-08-08T18:33:44ZAdam Hydeadam@coko.foundationMapping formatting properties to header elements in 'Zorba'(Old title: 'UCP custom script config for headers')
Some interesting issues that denote headings (from chapter 3 of Zorba):
Item 1:
```
<p><span style="font-size: 24"><b>CHAPTER 3</b></span></p>
```
Item 2:
```
<p><...(Old title: 'UCP custom script config for headers')
Some interesting issues that denote headings (from chapter 3 of Zorba):
Item 1:
```
<p><span style="font-size: 24"><b>CHAPTER 3</b></span></p>
```
Item 2:
```
<p><span style="font-size: 24"><b>FROM SEX TO SUPERCONSCIOUSNESS</b></span></p>
```
Item 3:
```
<p><span style="font-size: 24"><b>Sexuality, Tantra, and Liberation in 1970s India</b></span></p>
```
Item 4:
```
<p><span style="font-size: 20"><b><i>From Sex to Superconsciousness: Rajneesh, Freud, Reich, and the Transmutation of Desire</i></b></span></p>
```
1, 2 and 3 are the same, largest font size in the doc, bolded, no period at the end of the line, listed at the top, and probably H1.
Item 4 I think is a H2 as it is the second largest font size (2 points larger than the body p), no period, and contains an ```<i>``` nested in a ```<b>```
the question is ...how specific is this to UCP, and how much of this is just this author. We need to check against other UCP books.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/15Remove spans with empty font-family attribute2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove spans with empty font-family attributeI see this in the fourth chapter:
```
<p>Reich had been one of Freud's leading
disciples during the early years of psychoanalysis; unlike
Freud, however, Reich was a <span style=
"font-family:">socialist</span> who ...I see this in the fourth chapter:
```
<p>Reich had been one of Freud's leading
disciples during the early years of psychoanalysis; unlike
Freud, however, Reich was a <span style=
"font-family:">socialist</span> who <span style=
"font-family:">thought</span> it <span style=
"font-family:">imperative</span> to combine <span style=
"font-family:">politicalactivism</span> and sexual theory.
Sexual repression, Reich argued, was the cornerstone of
totalitarianism, so in order to liberate people <span style=
"font-family:">politically</span> it was necessary to
<span style="font-family:">liberate themsexually</span>
first<span style="font-family:">.</span><a class=
"endnoteReference" href="#en21">21</a></p>
```
```<span style="font-family:">``` is obviously redundant and show be removed.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/14HTML5 namespace2017-08-08T18:27:52ZWendell PiezHTML5 namespaceHTML5 has certain namespaces defined as per https://www.w3.org/TR/2011/WD-html5-20110405/namespaces.html.
In particular, it appears namespace `http://www.w3.org/1999/xhtml` should be bound to unprefixed names (explicitly in the docume...HTML5 has certain namespaces defined as per https://www.w3.org/TR/2011/WD-html5-20110405/namespaces.html.
In particular, it appears namespace `http://www.w3.org/1999/xhtml` should be bound to unprefixed names (explicitly in the document); in other words it should look like XHTML4 in this regard (though not in `DOCTYPE` declaration):
```
<html xmlns="http://www.w3.org/1999/xhtml"> ... </html>
```
We should probably do this even though tools haven't seemed to care so far. Indeed we may wish to do it from the extractor forward (i.e. html should be namespace-qualified throughout), so as to avoid confusing users.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/13Need a rule mapping Word style names into HTML @class2017-04-03T21:07:12ZWendell PiezNeed a rule mapping Word style names into HTML @classSince Word styles can be named anything ("My paragraph style") we need a mapping that will cast spaces and other forbidden characters into permissible substrings, for @class assignment in the HTML.Since Word styles can be named anything ("My paragraph style") we need a mapping that will cast spaces and other forbidden characters into permissible substrings, for @class assignment in the HTML.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/12Internal cross-references2018-03-15T20:05:00ZWendell PiezInternal cross-referencesThe discovery that processing references (to citations, i.e. endnotes) may require providing generated text where the .docx has only a placeholder suggests we are going to have to take similar care with cross-references. (Word supplies m...The discovery that processing references (to citations, i.e. endnotes) may require providing generated text where the .docx has only a placeholder suggests we are going to have to take similar care with cross-references. (Word supplies mechanisms to cross-reference to arbitrary sections or other targets, showing page numbers or generated text dynamically. These could well be represented by empty elements in the .docx.)
Specifically, we need to be sure that elements indicating cross-references are not removed in the `scrub.xsl` stage, probably by providing them with some kind of content in an earlier pipeline step.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/11Document how to set up pipelines2017-04-03T21:07:12ZWendell PiezDocument how to set up pipelines[The 'Zorba' sample doc page](sample-zorba), which documents the XSLT sequence for the conversion of the _Zorba_ sample, needs to describe a CL sequence (calling Saxon) with details - and/or we need a page on reading and constructing cal...[The 'Zorba' sample doc page](sample-zorba), which documents the XSLT sequence for the conversion of the _Zorba_ sample, needs to describe a CL sequence (calling Saxon) with details - and/or we need a page on reading and constructing calls to Saxon. So users can have a hope of debugging pipelines when they change.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/10Wiki page documenting pipeline!2017-04-03T21:07:12ZWendell PiezWiki page documenting pipeline!Probably every sample document, at least initially, is going to need its own wiki page, describing the pipeline steps to be run on the .docx input along with any special issues. The individual XSLTs should be linked (for inspection) and ...Probably every sample document, at least initially, is going to need its own wiki page, describing the pipeline steps to be run on the .docx input along with any special issues. The individual XSLTs should be linked (for inspection) and explained at a high level.
(Presumably this could evolve into a page on the particular INK recipe for the document.)Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/9What is 'position'?2017-04-03T21:07:12ZWendell PiezWhat is 'position'?Coming out of the Word (formatting) is an element called 'position'. What is this and can it be removed?
Examples easily found in Zorba ch 2 (e.g. in end notes).
Coming out of the Word (formatting) is an element called 'position'. What is this and can it be removed?
Examples easily found in Zorba ch 2 (e.g. in end notes).
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/8Add terminal stylesheet to emit plain text2018-05-01T20:41:10ZWendell PiezAdd terminal stylesheet to emit plain textWe want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.We want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/7Remove iCs tag2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove iCs tagit appears ```<iCs>``` should be removed. It is a flag to denote operations on complex italics which does not have an equivalent in HTML and does not by itself equate to italicisation.
https://msdn.microsoft.com/en-us/library/document...it appears ```<iCs>``` should be removed. It is a flag to denote operations on complex italics which does not have an equivalent in HTML and does not by itself equate to italicisation.
https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.italiccomplexscript(v=office.14).aspx
eg from Zorba, Intro, ln 840
```
<span style="font-family: Palatino-Roman; font-size: 20">
<i>
<iCs>Modernity at Large: Cultural Dimensions of Globalization</iCs>
</i>
```Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/6Remove noproof tag and container (if empty)2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove noproof tag and container (if empty)it seems that ```noProof``` is a wrapper for MS Word instructions and hence can be removed. If the parent contains no content then I think it can also be removed and all other extraneous elements.
"This element specifies that the cont...it seems that ```noProof``` is a wrapper for MS Word instructions and hence can be removed. If the parent contains no content then I think it can also be removed and all other extraneous elements.
"This element specifies that the contents of this run shall not report any errors when the document is scanned for spelling and grammar"
https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.noproof(v=office.14).aspx
From Zorba, 2 - Intro line 3844
```
<p>
<span style="font-family: Palatino; font-size: 20">
<b>
<i>
<noProof>
<lang/>
</noProof>
</i>
</b>
</span>
</p>
'''Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/5Simplify p tags?2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationSimplify p tags?what do we do about something like this complex nested ```<p>```.
```
<p>
<span style="font-family: Palatino; font-size: 18; font-size: 18">Max Weber’s famous metaphor in </span>
<span style="font-family: P...what do we do about something like this complex nested ```<p>```.
```
<p>
<span style="font-family: Palatino; font-size: 18; font-size: 18">Max Weber’s famous metaphor in </span>
<span style="font-family: Palatino; font-size: 18; font-size: 18">
<i>The Protestant Ethic</i>
</span>
<span style="font-family: Palatino; font-size: 18; font-size: 18"> of religion striding into the marketplace of worldly affairs and slamming the monastery door behind it becomes further transformed in modern society with religion placed very much in the consumer marketplace alongside other meaning complexes. </span>
</p>
```
Is it safe to assume any span beneath a ```<p>``` is going to be either an inline style (ie. not a block style like a heading) and so if it contains a ```<span>``` with *only* font size and family information then this should be interpreted as extraneous? Alternatively, is it possible using xsl greps to identify that *if* sequential spans contain the same information then they should all just be contained within one span?
If, however, it contained another element (eg another nested ```<p>``` like below) then it is another kind of problem and should be left alone
```
<p>
<span style="font-family: Palatino; font-size: 20">The other great idea that the world wants from us today…is the eternal grand idea of the spiritual oneness of the whole universe…This is the dictate of Indian philosophy. This oneness is the rationale of all ethics and all spirituality. Europe wants it today just as much as our downtrodden masses do, and this great principle is even now unconsciously forming the basis of all the latest political and social aspirations that are coming up in England, in Germany, in France and in America.</span>
<span style="font-family: Palatino; font-size: 20">
<div class="endnote_fetched">
<p class="EndnoteText">
<span style="font-size: 20; font-size: 20">
<span class="EndnoteReference"/>
</span>
<span style="font-size: 20; font-size: 20"> Vivekananda,</span>
<span style="font-size: 20; font-size: 20"> </span>
<span style="font-size: 20; font-size: 20">
<i>The Complete Works of Swami Vivekananda,</i>
</span>
<span style="font-size: 20; font-size: 20"> vol.3 (Calcutta: Advaita Ashram, 1983), </span>
<span style="font-size: 20; font-size: 20">p.189</span>
<span style="font-size: 20; font-size: 20">.</span>
</p>
</div>
</span>
</p>
```
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/4Handling colors2017-08-16T19:27:59ZWendell PiezHandling colorsAt least Chapter 2 has at least some examples of color, however not coming across very well.
```
<span class="Hyperlink">
<i>
<color>
<u>The Rajneesh Chronicles: The Story of the Cult that Unleashed the First Act of Bi...At least Chapter 2 has at least some examples of color, however not coming across very well.
```
<span class="Hyperlink">
<i>
<color>
<u>The Rajneesh Chronicles: The Story of the Cult that Unleashed the First Act of Bioterrorism on U.S. Soil </u>
</color>
</i>
</span>
```
Two items:
* Extend extract logic so we learn more about the color (is a value available)?
* Determine how we wish to handle color in this instance (and all chapters in this example):
* Map it into something (a named class on a `span` or other element?)
* Remove it as uninformative (for example, if `color` appears only with `span[@class='Hyperlink']`
For the second, it would be helpful to poll the data.
Wendell PiezWendell Piez