Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
---
title: "HTMLevator"
draft: false
weight: 220
part: 1
Intro : "HTMLevator is a series of enhancement utilities for improving HTML documents"
class: documentation
---
HTMLevator is a series of enhancement utilities for improving HTML documents. These tools can be used on HTML files made from `.docx`s with XSweet, or on other, arbitrary HTML files. HTMLevator features include:
* Semantic structure inferring, adding headings and sections to flat HTML files
* Copyediting cleanups to normalize text
* Tools to add customized transformations in simplified syntax
## Contents
* Heading promotion](#heading)
* Formatting-based analysis](#format-based)
* Outline-level heading promotion](#outline-based)
* Custom configuration](#custom-hp)
* Plain text URL linking](#url)
* Plain text list tagging](#list-detection)
* Copyediting cleanups](#ucp)
* ucp-text-macros.xsl](#text-cleanup)
* ucp-mappings.xsl](#ucp-mappings)
* Custom transformations (experimental)](#custom-trans)
* Section inferrer (experimental)](#sections)
## Heading promotion
HTMLevator includes a feature that attempts to infer which elements are headings, transforming them from `<p>`s into headings: `<h1>` through `<h6>`. This is more art than science, as the input is generally not semantically tagged and structured. It is sometimes trivial to infer headers but it is also frequently quite difficult or impossible to do so unassisted or programmatically. As such, heading promotion will not catch all headings all the time, and it will work better on some documents than on others.
There are 3 heading promotion strategies built into XSweet:
1. Format-based analysis (default)
2. Outline-level heading promotion
3. Custom configuration for named Word styles or specific text
The `header-promote/header-promotion-CHOOSE.xsl` sheet will try to pick the best approach to use for a given document:
* If no custom configurations are supplied, `header-promotion-CHOOSE.xsl` checks to see whether outline levels appear to have been used. If outline level data exists, it is used as the basis for heading promotion
* If outline levels have not been used, format-based analysis is used to infer and promote headings
Alternatively, you can specify the header promotion method to use by passing it as a runtime parameter with `header-promotion-CHOOSE.xsl`:
* `method=ranked-format`
* `method=outline-level`
* `method=my-styles.xml`
### Format-based analysis
As a rule, authors indicate headings with visual formatting far more commonly than by applying named MS Word styles. It's not possible to have a discrete list of what kind of formatting indicates a heading, as it changes from file to file and is highly contextual. Instead, each individual document and its formatting must be analyzed as a whole before making guesses about headings. Format-based heading promotion does just this.
This approach works well for some documents and poorly for others. One size does not fit all, and the approach is simply to optimize for what works well with the greatest number of documents. Table of contents and reference files often contain many short paragraphs, leading to erroneous heading promotion.
The `header-promote/digest-paragraphs.xsl` sheet performs this file analysis. It makes a representation of every `<p>` in the document with relevant formatting properties:
* `font-size`
* `font-style`
* `font-weight`
* `text-decoration`
* `color`
* `text-align`
Next, it sorts paragraphs into groups that share identical formatting, one group for each distinct combination of properties. These groups are candidates for promotion from `<p>` to `<h1-6>`. HTMLevator considers:
* How many paragraphs are formatted the same way
* The average paragraph length in each format group
* How often paragraphs in one format group appear in continuous runs (and thus probably aren't headings)
* Whether paragraphs are all caps
Decisions about what to consider headings are made as follows:
* Anything that is right-aligned is not considered for heading promotion
* The most common type of paragraph in the document (i.e. the combination of paragraph properties that occurs the most) is not considered for heading promotion
* Promote a paragraph group to headings if:
* The average run of consecutive paragraphs styled the same way is 4 or fewer (long runs of `<p>`s with the same styling suggest the paragraphs aren't headings), AND
* The font size specified is not the smallest font size found in the document, AND
* The average length of paragraphs with the given set of properties is not more than 120 characters
* Promote a paragraph group if it is:
* Centered, AND
* Less than 200 characters in average length, AND
* The average consecutive paragraph run is less than 2
* Promote a paragraph group if it never ends in a period
After HTMLevator has identified paragraph groups to mark as headings, it must guess the outline level. It does so based on the following attributes, in these order:
1. Font size (bigger = higher heading level)
2. Italics
3. Bold
4. Underline
5. Always caps
Generally speaking, HTMLevator's heading detection does a better job detecting headings than it does at guessing the heading's level.
#### XSLT sequence
This is the default heading promotion method, run if outline level data is not present. You can also run this `method=ranked-format`
1. First, `header-promote/digest-paragraphs.xsl` makes the paragraph groupings, and guesses what formats should be headings (and what level those headings should be).
2. The `header-promote/make-header-escalator-xslt.xsl` sheet uses the `digest-paragraphs.xsl` output as its input, which it uses to produce a bespoke `XSL` sheet.
3. Running this sheet on the original HTML file implements the heading promotion, replacing the `<p>`s thought to be headings with `<h1-6>`.
### Outline-level heading promotion
An outline level can be specified on a paragraph in Word (which often comes from a named Word style. Some writers use this outlining functionality in Word, either deliberately, or implicitly through careful use of named styles. In these instances, outline levels are often a reliable indicator of headings and heading levels.
When outline levels are specified in Word's XML (e.g. `<w:outlineLvl w:val="0"/>`), they are extracted by XSweet as an `-xsweet-outline-level` property on the `<p>`.
When this property is present at least twice in the HTML document, the `header-promote/header-promotion-CHOOSE.xsl` sheet will elect to use outline levels to promote headings.
### Custom configuration
To create a custom configuration:
1. Create a custom mapping (`my-styles.xml` or what have you). See the example provided in `config-mockup.xml` for syntax.
2. Run the `header-promotion-CHOOSE.xsl` sheet, passing the custom mapping `.xml` sheet as a runtime parameter (`method=my-styles.xml`)
3. The `make-header-mapper-xslt.xsl` will generate and apply custom XSL sheet based on your XML file
## Plain text URL linking
`hyperlink-inferencer/hyperlink-inferencer.xsl`
This sheet searches for plain-text URLs and automatically links them. It can recognize links with the following TLDs:
* .com
* .org
* .net
* .gov
* .mil
* .edu
* .io
* .foundation
* country TLDs
XSweet looks for a top level domain preceded by preceded by one or more strings that contain only letters, numbers, underscores and dashes (no spaces or other punctuation). These strings can be separated by periods (".") Note that this rule will capture a `www.` if it is present.
XSweet will recognizes and include in the link the protocol, if it has been specified (`http://`, `https://`, `ftp:`). If the protocol has not been specified, the link's `href` will be appended with `http://`.
This sheet will also capture query strings on links.
## Plain text list tagging
`DETECT-ITEMIZE-LISTS.xsl`
This module will recognize plain text that looks like a numbered lists and mark the corresponding list (as an `<ol>`) and list items (`<li>`s).
`DETECT-ITEMIZE-LISTS.xsl` runs from within it 3 separate sheets in sequence:
* `detect-numbered-lists.xsl`, which detects lists and bookends them with `<xsw:list xmlns:xsw="http://coko.foundation/xsweet" level="0">`
* `itemize-detected-lists.xsl`, which converts the `<xsw:list>` tags to `<ol>`, and wraps each paragraph in `<li>`s
* `scrub-literal-numbering-lists.xsl`, which removes from each list item the leading whitespace, literal text numbering, the period, and the whitespace after it
Lists must match the following pattern to be detected and marked as a numbered list:
* Each list item paragraph may start with any amount of white space (including none), followed by
* a string of one or more numerals, followed by
* a period, followed by
* one or more white space characters.
* Further, at least two or more consecutive paragraphs must meet these criteria to be marked as a list
List items that meet this criteria are scrubbed of their literal numbering (and following white space) in favor of automatically generated `<ol>` numbering.
Note that this feature creates a flat list (one level), rather than nested lists based on indentation.
This module can be run before or after the `PROMOTE-lists.xsl` feature in XSweet Core. To use it, you can modify the `execute_chain.sh` file of the XSweet_runner_scripts to include this step before the `final-rinse.xsl` step.
See also the documentation for [marked list handling](/xsweet-core/#lists).
## Copyediting cleanups
`ucp-cleanup/ucp-text-macros.xsl`
This sheet contains a suite of text cleanups, built specifically for use by the [University of California Press](https://www.ucpress.edu/ "www.ucpress.edu"). It automates many copyediting improvements:
* Hyphens between numerals are converted to en dashes
* Two or more consecutive spaces are converted to a single space
* Any number of spaces before or after em dashes are removed
* Series of periods are converted to ellipses
* Two adjacent hyphens become an em dash
* En dashes surrounded on both sides by spaces are converted to an em dash
* Equal signs are normalized to be surrounded by one space on either side
* Spaces adjacent to tabs are removed
* Spaces at the beginning and end of paragraphs are removed
* Tabs at the end of paragraphs are removed
* Empty paragraphs are removed
* Single and double quotation marks (including backticks) are converted to directional quotation marks
* Hair spaces are inserted between single and double quotation marks
* Punctuation marks are coerced to match the formatting of the previous word; e.g. `<i>extraordinary</i>!` becomes `<i>extraordinary!</i>`. This rule applies to the following punctuation marks:
* "
* '
* :
* ;
* ?
* !
`ucp-cleanup/ucp-mappings.xsl`
In this step, underlining and bolding is converted to italics, either as inline tags or `style` CSS:
* `<b>`s and `<u>`s are replaced with `<i>`s
* `style="font-weight: bold"` and `style="text-decoration: underline"` become `style="font-style: italic"`
Short and sweet.
## Custom transformations (experimental)
The files in the `html-tweak` folder can be used to extend XSweet, by defining custom transformations to apply to the text. This can be done on a per-document basis, or to implement generic rules according to your use case.
Use is as follows:
1. Define the custom transformations to be applied in an `.xml` file
2. Run the `APPLY-html-tweaks.xsl` sheet, referencing the above transformations defined in your `xml` file. This:
(A) reads the user-defined transformations from your `.xml` file
(B) creates a new XSL sheet based on the `.xml` file that will implement the specified transformation (done with the `make-html-tweak-xslt.xsl` sheet)
(C) applies the created XSL sheet to the input file
Example use (exact script will depend upon how you are running your XSLT:
`XSLT my-source.html APPLY-html-tweaks.xsl config=my-html-tweaks.xml`
### Tweak definition syntax
The user-specified tweaks work by establishing matches between categories of HTML elements (most commonly but certainly not limited to `<p>`s or `<span>`s), as indicated by:
* CSS property or CSS property-value (on a `style` attribute), or
* Named classes (the `class` attribute)
The syntax to define HTML tweaks uses the following components:
* `where`: a wrapper for a rule
* `match`: conditions on an element for it to match
* `style`: a `style` property name or `property-name: value` combination
* `class`: a class value (name token)
### Example 1
Remove `Default` classes from HTML elements where they appear:
```html
<p class="Default">Here is default class paragraph</p>
```
becomes:
```html
<p>Here is default class paragraph</p>
```
HTML tweak rule:
```html
<where>
<match><class>Default</class></match>
<remove><class>Default</class></remove>
</where>
```
### Example 2
Remove a specific styling property wherever it's present:
```html
<p style="text-indent:1em; margin-bottom: 1em">Styling includes a property</p>
```
becomes:
```html
<p style="text-indent:1em">Styling includes a property</p>
```
HTML tweak rule:
```html
<where>
<match><style>margin-bottom</style></match>
<remove><style>margin-bottom</style></remove>
</where>
```
### Example 3
Remove a `style` property if it has a given value:
```html
<p style="font-family: Helvetica; font-size: 12pt">Remove a property if it has a specific value</p>
```
becomes:
```html
<p style="font-size: 12pt">Remove a property if it has a specific value</p>
```
HTML tweak rule:
```html
<where>
<match><style>font-family: Helvetica</style></match>
<remove><style>font-family</style></remove>
</where>
```
### Example 4
The following tweak rule will map a specific `class` and `style` to another `class` and `style`:
```html
<where>
<match>
<style>font-size: 18pt</style>
<class>FreeForm</class>
</match>
<remove>
<style>font-size</style>
<class>FreeForm</class>
</remove>
<add>
<class>FreeFormNew</class>
<style>color: red</style>
</add>
</where>
```
For further examples, see the demo files included in the repository:
* `html-tweak-map.xml` defines example transformation definitions
* `html-tweak-demo.xsl` is the resulting XSL sheet made by the `make-html-tweak-xslt.xsl`, which will effect the specified transformation. (This relies on the `html-tweak-lib.xsl` file as a dependency)
## Section inferrer (experimental)
This utility uses headings (`<h1-6>`) as markers and attempts to add `<section>`s to an HTML file. It is run as a single XSL sheet, `induce-sections/induce-sections.xsl`, which returns the document HTML file unchanged except for the addition of `<section>` tags.
* Sections are only added when higher-level headings wrap lower-level ones. Lower-level headings wrapping higher-level ones are not captured as `<section>`s
* Paragraphs and blocks preceding the first header, appear without a section wrapper
(before the first section)
* Paragraphs and all other elements travel with the immediately preceding header
* Files with no headings are unchanged
* Your document must be wrapped in a `<div class="docx-body">` for this sheet to work. It will be wrapped if it has been extracted by XSweet; otherwise you will have to add this element yourself
* If headings skip levels, a note will be added: `<!-- Headers out of regular order: h1, h2, h3, h1, h3-->`
Example:
```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"></meta>
<title>sections</title>
</head>
<body>
<div class="docx-body">
<h1>h1</h1>
<p>h1 para</p>
## h2
<p>h2 para</p>
### h3
<p>h3 para</p>
<p>h3 para</p>
<h1>h1</h1>
<p>h1 para</p>
### h3
<p>h3 para</p>
</div>
</body>
</html>
```
becomes
```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>sections</title>
</head>
<body>
<div class="docx-body">
<!-- Headers out of regular order: h1, h2, h3, h1, h3-->
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
## h2
<p>h2 para</p>
<section>
### h3
<p>h3 para</p>
<p>h3 para</p>
</section>
</section>
</section>
<section>
<h1>h1</h1>
<p>h1 paragraph</p>
<section>
### h3
<p>h3 para</p>
</section>
</section>
</div>
</body>
</html>
```
`mark-sections.xsl` and `nest-sections.xsl` are deprecated; the `induce-sections.xsl` sheet encapsulates the functionality from both.