XSweet issueshttps://gitlab.coko.foundation/groups/XSweet/-/issues2017-08-08T19:06:44Zhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/25Multiple "style=font size" attributes for the same elements2017-08-08T19:06:44ZAlex ThegMultiple "style=font size" attributes for the same elementsFrom the title page of Bakker, some elements have multiple font sizes declared on them:
```
<div class="docx-body">
<p style="font-size: 28pt; font-size: 48pt">
<span style="font-size: 28pt; font-size: 48...From the title page of Bakker, some elements have multiple font sizes declared on them:
```
<div class="docx-body">
<p style="font-size: 28pt; font-size: 48pt">
<span style="font-size: 28pt; font-size: 48pt">Migrating into Financial Markets</span>
</p>
<p style="font-size: 24pt">
<span style="font-size: 24pt">How Remittances </span>
</p>
<p style="font-size: 24pt">
<span style="font-size: 24pt">Became a Development Tool</span>
</p>
<p style="font-size: 18pt; font-size: 28pt">
<span style="font-size: 18pt; font-size: 28pt">Matt Bakker</span>
</p>
</div>
```
Am I correct that when the same property is declared multiple times, the last declaration wins? Is it also correct that one of the refine steps orders the font sizes declarations from the smallest to the largest?
The original doc has a 48pt title, 24pt subtitle, and 18pt author name.
The current extraction makes the title 48pt, subheading 24pt, and the author's name 28pt.
So, the question is how do we get from multiple font size styles specified on a tag to only one, making sure we're keeping the font size that was actually displaying in the word doc?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/24Preliminary "Header inferencing" XSLT2018-05-01T20:44:41ZWendell PiezPreliminary "Header inferencing" XSLTThe develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
...The develop branch @aaad1151a46ab75771c5b6f92912680b610d4bd1 now has a demo pipeline showing dynamic attribution of header levels h1-hx to paragraphs in a document based on their formatting. It is fairly crude but seemingly effective.
It needs demo and discussion, and in particular the rules for what makes a header need to be shaken out.
However it also needs an invocation script. (So far only an XProc pipeline, which does a little more work than the regular extraction pipeline, then generates an applies an XSLT to achieve this mapping.) So that comes next...Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/23Images2018-05-21T17:47:38ZWendell PiezImagesThe end of Zorba chapter 4 `b04%20Urban.docx` has photographs ... making this a reasonable case for analysis.The end of Zorba chapter 4 `b04%20Urban.docx` has photographs ... making this a reasonable case for analysis.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/22Locate, implement and test footnotes conversion2017-08-16T18:59:57ZWendell PiezLocate, implement and test footnotes conversionSo-called (in Word) "endnotes" have gotten some attention but footnotes need analysis, maybe a bit of specification (should they work just like endnotes?) and implementation. Find or create an example with actual footnotes (i.e. not just...So-called (in Word) "endnotes" have gotten some attention but footnotes need analysis, maybe a bit of specification (should they work just like endnotes?) and implementation. Find or create an example with actual footnotes (i.e. not just placeholders).Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/21Assessing coverage by comparing plain-text diffs2018-03-15T20:02:17ZWendell PiezAssessing coverage by comparing plain-text diffsA stylesheet `plaintext.xsl` is now provided, which is able to strip tagging from an XHTML (or other wf XML) document. To HTML it adds a bit of whitespace (line ends) for legibility.
It will be really interesting to compare the result...A stylesheet `plaintext.xsl` is now provided, which is able to strip tagging from an XHTML (or other wf XML) document. To HTML it adds a bit of whitespace (line ends) for legibility.
It will be really interesting to compare the results of this transformation with what Word provides in a "Save As (Plain Text .txt)" over the same input file.1.0.0Alex ThegAlex Theghttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/20Early aggregator to call extractor over a sequence of files2017-08-09T19:08:49ZWendell PiezEarly aggregator to call extractor over a sequence of filesA simple file manifest source (`xsweet-manifest.xml`) could be used to drive the extractor and produce an assembly of HTML `section` elements within a single file, ready for subequent pipelining. (This is a job that INK may ultimately be...A simple file manifest source (`xsweet-manifest.xml`) could be used to drive the extractor and produce an assembly of HTML `section` elements within a single file, ready for subequent pipelining. (This is a job that INK may ultimately be doing but we probably need the functionality sooner. Saxon can do the job if it can read the source data.) Such a single merged result for the book will help us in mapping since we'd have a comprehensive view of all the sources together. Plus, it may help to scale (or not) -- total throughput time should be lower since there is only one pipeline not one per chapter, so less overhead.
Something really simple to start such as
``` xml
<source-files>
<dir name="a1 Urban_tit"/>
<dir name="a2 Urban_toc"/>
<dir name="a3 Urban_pref"/>
<dir name="a4 Urban_Illus"/>
<dir name="b00 Urban_Intro"/>
<dir name="b01 Urban"/>
<dir name="b02 Urban"/>
<dir name="b03 Urban"/>
<dir name="b04 Urban"/>
<dir name="b05 Urban"/>
<dir name="b06 Urban"/>
<dir name="b07 Urban_Concl"/>
<dir name="z Urban_bib"/>
</source-files>
```
The idea being that a `word/document.xml` will be found inside each directory named, and its extracted results are to be merged as a `section` into a single HTML document.
Question: will this work in testing? What can we assume about the extraction process and availability of inputs (to Saxon) at runtime?Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/19Preliminary analysis suggesting 'Zorba' optimizations2017-08-09T19:08:49ZWendell PiezPreliminary analysis suggesting 'Zorba' optimizationsAnalysis of one of the Zorba example(s):
[b04-html-analysis.html](/uploads/8a53b8021b4b0abbd70fb6c4e51cbd2a/b04-html-analysis.html)
Probably the following can be stripped (retaining their contents):
- `b/b` (directly nested `b' ...Analysis of one of the Zorba example(s):
[b04-html-analysis.html](/uploads/8a53b8021b4b0abbd70fb6c4e51cbd2a/b04-html-analysis.html)
Probably the following can be stripped (retaining their contents):
- `b/b` (directly nested `b' elements)
- `span[@class=('apple-converted-space','basket-total']`
- `span[@class='Hypertext']//*` (any element e.g. `u` or `i` appearing inside 'Hypertext' spans, whose formatting should be provided by CSS) - at least when `normalize-space(.)=normalize-space(..)` (the element is co-extensive with its parent)
- `@class[.=('NormalWeb','bCs')]` (superfluous class values)
Also, apart from providing a mapping into `h1` and `h2`, font and color info is apparently not particularly helpful, so could probably safely be stripped in general...
We should also run this analysis on an aggregation of all Zorba samples.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/18Consider adding another bit of "scrub" logic to scrub.xsl2017-04-03T21:07:12ZWendell PiezConsider adding another bit of "scrub" logic to scrub.xslOftentimes HTML results from .docx show how formatting was controlled at the inline level not paragraph level, so we get things like:
```
<p>
<span style="font-size: 18">A: Nobody can beat me! I am the best showman in the whole hi...Oftentimes HTML results from .docx show how formatting was controlled at the inline level not paragraph level, so we get things like:
```
<p>
<span style="font-size: 18">A: Nobody can beat me! I am the best showman in the whole history of man. </span>
</p>
```
We might consider removing the `span` and promoting its properties to the `p`.
Don't do this when there's a `@class` collision; also think through `@style`.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/17Expose indents2017-04-03T21:07:12ZWendell PiezExpose indentsSamples in Zorba (see e.g. `2-b00Urban_Intro.docx`).
A related issue is whether we should start exposing formatting settings that are bound via (paragraph or inline) style.Samples in Zorba (see e.g. `2-b00Urban_Intro.docx`).
A related issue is whether we should start exposing formatting settings that are bound via (paragraph or inline) style.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/16Mapping formatting properties to header elements in 'Zorba'2017-08-08T18:33:44ZAdam Hydeadam@coko.foundationMapping formatting properties to header elements in 'Zorba'(Old title: 'UCP custom script config for headers')
Some interesting issues that denote headings (from chapter 3 of Zorba):
Item 1:
```
<p><span style="font-size: 24"><b>CHAPTER 3</b></span></p>
```
Item 2:
```
<p><...(Old title: 'UCP custom script config for headers')
Some interesting issues that denote headings (from chapter 3 of Zorba):
Item 1:
```
<p><span style="font-size: 24"><b>CHAPTER 3</b></span></p>
```
Item 2:
```
<p><span style="font-size: 24"><b>FROM SEX TO SUPERCONSCIOUSNESS</b></span></p>
```
Item 3:
```
<p><span style="font-size: 24"><b>Sexuality, Tantra, and Liberation in 1970s India</b></span></p>
```
Item 4:
```
<p><span style="font-size: 20"><b><i>From Sex to Superconsciousness: Rajneesh, Freud, Reich, and the Transmutation of Desire</i></b></span></p>
```
1, 2 and 3 are the same, largest font size in the doc, bolded, no period at the end of the line, listed at the top, and probably H1.
Item 4 I think is a H2 as it is the second largest font size (2 points larger than the body p), no period, and contains an ```<i>``` nested in a ```<b>```
the question is ...how specific is this to UCP, and how much of this is just this author. We need to check against other UCP books.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/15Remove spans with empty font-family attribute2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove spans with empty font-family attributeI see this in the fourth chapter:
```
<p>Reich had been one of Freud's leading
disciples during the early years of psychoanalysis; unlike
Freud, however, Reich was a <span style=
"font-family:">socialist</span> who ...I see this in the fourth chapter:
```
<p>Reich had been one of Freud's leading
disciples during the early years of psychoanalysis; unlike
Freud, however, Reich was a <span style=
"font-family:">socialist</span> who <span style=
"font-family:">thought</span> it <span style=
"font-family:">imperative</span> to combine <span style=
"font-family:">politicalactivism</span> and sexual theory.
Sexual repression, Reich argued, was the cornerstone of
totalitarianism, so in order to liberate people <span style=
"font-family:">politically</span> it was necessary to
<span style="font-family:">liberate themsexually</span>
first<span style="font-family:">.</span><a class=
"endnoteReference" href="#en21">21</a></p>
```
```<span style="font-family:">``` is obviously redundant and show be removed.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/14HTML5 namespace2017-08-08T18:27:52ZWendell PiezHTML5 namespaceHTML5 has certain namespaces defined as per https://www.w3.org/TR/2011/WD-html5-20110405/namespaces.html.
In particular, it appears namespace `http://www.w3.org/1999/xhtml` should be bound to unprefixed names (explicitly in the docume...HTML5 has certain namespaces defined as per https://www.w3.org/TR/2011/WD-html5-20110405/namespaces.html.
In particular, it appears namespace `http://www.w3.org/1999/xhtml` should be bound to unprefixed names (explicitly in the document); in other words it should look like XHTML4 in this regard (though not in `DOCTYPE` declaration):
```
<html xmlns="http://www.w3.org/1999/xhtml"> ... </html>
```
We should probably do this even though tools haven't seemed to care so far. Indeed we may wish to do it from the extractor forward (i.e. html should be namespace-qualified throughout), so as to avoid confusing users.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/13Need a rule mapping Word style names into HTML @class2017-04-03T21:07:12ZWendell PiezNeed a rule mapping Word style names into HTML @classSince Word styles can be named anything ("My paragraph style") we need a mapping that will cast spaces and other forbidden characters into permissible substrings, for @class assignment in the HTML.Since Word styles can be named anything ("My paragraph style") we need a mapping that will cast spaces and other forbidden characters into permissible substrings, for @class assignment in the HTML.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/12Internal cross-references2018-03-15T20:05:00ZWendell PiezInternal cross-referencesThe discovery that processing references (to citations, i.e. endnotes) may require providing generated text where the .docx has only a placeholder suggests we are going to have to take similar care with cross-references. (Word supplies m...The discovery that processing references (to citations, i.e. endnotes) may require providing generated text where the .docx has only a placeholder suggests we are going to have to take similar care with cross-references. (Word supplies mechanisms to cross-reference to arbitrary sections or other targets, showing page numbers or generated text dynamically. These could well be represented by empty elements in the .docx.)
Specifically, we need to be sure that elements indicating cross-references are not removed in the `scrub.xsl` stage, probably by providing them with some kind of content in an earlier pipeline step.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/11Document how to set up pipelines2017-04-03T21:07:12ZWendell PiezDocument how to set up pipelines[The 'Zorba' sample doc page](sample-zorba), which documents the XSLT sequence for the conversion of the _Zorba_ sample, needs to describe a CL sequence (calling Saxon) with details - and/or we need a page on reading and constructing cal...[The 'Zorba' sample doc page](sample-zorba), which documents the XSLT sequence for the conversion of the _Zorba_ sample, needs to describe a CL sequence (calling Saxon) with details - and/or we need a page on reading and constructing calls to Saxon. So users can have a hope of debugging pipelines when they change.
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/10Wiki page documenting pipeline!2017-04-03T21:07:12ZWendell PiezWiki page documenting pipeline!Probably every sample document, at least initially, is going to need its own wiki page, describing the pipeline steps to be run on the .docx input along with any special issues. The individual XSLTs should be linked (for inspection) and ...Probably every sample document, at least initially, is going to need its own wiki page, describing the pipeline steps to be run on the .docx input along with any special issues. The individual XSLTs should be linked (for inspection) and explained at a high level.
(Presumably this could evolve into a page on the particular INK recipe for the document.)Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/9What is 'position'?2017-04-03T21:07:12ZWendell PiezWhat is 'position'?Coming out of the Word (formatting) is an element called 'position'. What is this and can it be removed?
Examples easily found in Zorba ch 2 (e.g. in end notes).
Coming out of the Word (formatting) is an element called 'position'. What is this and can it be removed?
Examples easily found in Zorba ch 2 (e.g. in end notes).
Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/8Add terminal stylesheet to emit plain text2018-05-01T20:41:10ZWendell PiezAdd terminal stylesheet to emit plain textWe want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.We want the option to be able to output plain text, for example for diffing purposes. This can be addressed with an identity XSLT with a serialization `method='text'`, to be run as the terminal XSLT in a pipeline.Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/7Remove iCs tag2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove iCs tagit appears ```<iCs>``` should be removed. It is a flag to denote operations on complex italics which does not have an equivalent in HTML and does not by itself equate to italicisation.
https://msdn.microsoft.com/en-us/library/document...it appears ```<iCs>``` should be removed. It is a flag to denote operations on complex italics which does not have an equivalent in HTML and does not by itself equate to italicisation.
https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.italiccomplexscript(v=office.14).aspx
eg from Zorba, Intro, ln 840
```
<span style="font-family: Palatino-Roman; font-size: 20">
<i>
<iCs>Modernity at Large: Cultural Dimensions of Globalization</iCs>
</i>
```Wendell PiezWendell Piezhttps://gitlab.coko.foundation/XSweet/XSweet/-/issues/6Remove noproof tag and container (if empty)2017-04-03T21:07:12ZAdam Hydeadam@coko.foundationRemove noproof tag and container (if empty)it seems that ```noProof``` is a wrapper for MS Word instructions and hence can be removed. If the parent contains no content then I think it can also be removed and all other extraneous elements.
"This element specifies that the cont...it seems that ```noProof``` is a wrapper for MS Word instructions and hence can be removed. If the parent contains no content then I think it can also be removed and all other extraneous elements.
"This element specifies that the contents of this run shall not report any errors when the document is scanned for spelling and grammar"
https://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.noproof(v=office.14).aspx
From Zorba, 2 - Intro line 3844
```
<p>
<span style="font-family: Palatino; font-size: 20">
<b>
<i>
<noProof>
<lang/>
</noProof>
</i>
</b>
</span>
</p>
'''Wendell PiezWendell Piez