Add UCP cleanup ingest/output macros to Typescript
UCP runs a series of cleanup macros on book chapters, both when they are ingested before editing, and also again at output to prevent any of the same cleanups from being accidentally reintroduced. Since these cleanups are to be used at the beginning and the end of the editing process, and since different presses will have slightly different cleanups, these changes should all be housed in a single XSLT sheet.
Here are several cleanups to implement - there may be more to follow. I am checking this with Erich at UCP, so we will probably add to this list going forward:
- Hyphens between numerals should be converted to en dashes: "2-3" -> "2–3"
Double spaces should be converted to single spaces, anywhere they're found:
"...touches. However, the..." -> "...touches. However, the..."
Spaces around em dashes should be removed (any number of consecutive spaces spaces before or after an em dash)
- "that sentence —as I’ve done" -> "that sentence—as I’ve done"
- "that sentence — as I’ve done" -> "that sentence—as I’ve done"
Series of periods converted to ellipses
- "..." -> "…"
I have gotten through a lot of the macro but it is long. Here are some of the remaining cleanups. I will post the rest tomorrow morning. These are in no special order and @wendell you can draw from these and start checking them off as you can - let me know if any of these need clarification. There are plenty more rules left. The most complicated is probably smart quotes. The macro actually does a deletes and replaces all the "s and 's and lets Word's auto-formatting determine their direction when it inserts them in again, which I don't think is an option for us. In any event, more to come!
- Two adjacent hyphens become an em dash: "--" -> "—"
- An en dash surround on both sides by spaces should be converted to an em dash: " – " -> " — "
- Equal signs should be surrounded on either side by one and only one space: " = "
- Replace runs of multiple consecutive spaces with just one space
Replace runs of multiple consecutive tabs with just one tabUpdate: we scrub these out anyway in preparation for Wax, so this is not necessary
- Spaces touching tabs should be removed
Remove spaces at the very beginning and ends of
- Remove tabs that end a paragraph (not ones that start)
- Delete empty paragraphs (I believe we are already doing this)
All straight, non-directional single and double quotes should be converted into "smart" directional quotes, depending on context. Since the original macro uses Word's auto-formatting, we'll have to make the rules for determining which direction they should point.
Straight quotation marks:
- u0022: quotation mark
- u0027: apostrophe
Should all be replaced by one of the following:
- u2018: left single quotation mark
- u2019: right single quotation mark
- u201c: left double quotation mark
- u201d: right double quotation mark
Replacement rules from macro:
- ' -> right or left single quotation mark (u2018 or u2019)
- '' -> right or left double quotation mark (u201c or u201d)
- ` -> right or left single quotation mark (u2018 or u2019)
- `` -> right or left double quotation mark (u201c or u201d)
- em dash+right double quote (u2014+u201d) -> em dash+left double quote (u2014+u201c)
- left double quote+em dash (u201c+u2014)-> right double quote+em dash (u201d+u2014)
The following 3 search pattern should look for a straight single quote or a left single quote and replace with a right single quote
- " 'em" or " ‘em" (space+u0027+"em" or space+u2019+"em") -> " ’em" (space+u2019+"em")
- "'n'" or "'n'" (u0027+"n"+u0027 or u2018+"n"+u2018) -> "’n’" (u2019+"n"+u2019)"
- " 'tis" (space+u0027+"tis" or space+u2018+"tis") -> " ’tis" (space+u2019+"tis")
Insert hair space (u200a) btwn pairs of single/double quotes. Note that order of operations matters; this assumes that straight quotes and apostrophes have been replaced with their directional counterparts.update: tracking in #28 (moved)
- left single quote+left double quote (u2018+u201c)
- left double quote+left single quote (u201c+u2018)
- right single quote+right double quote (u2019+u201d)
- right double quote+right single quote (u201d+u2019)
Here are my proposed rules for direction. They would have to be executed before all of the above rules from the macro:
First, replace all 4 directional quotation marks with their non-directional counterparts:
- u2018 and u2019 -> u0027
- u201c and u201d -> u0022
- also ` and `` to their respective u0027 and u0022
- apostrophe+alphabetical character (u0027+letter) -> left single quotation mark (u2018+letter)
- alphabetical character+apostrophe (letter+u0027( -> alphabetical character+right single quotation mark (letter+u2019)
- quotation mark+alphabetical character (u0022+letter) -> left double quotation mark+alphabetical character (u201c+letter)
- alphabetical character+quotation mark (letter+u0022) -> alphabetical character+right double quotation mark (letter+u201d)
In any case, these will probably need some refinement but double check me and let me know what you think!
- Convert underlining to italics
Convert bold to italicsupdate: tracking in #29 (closed)
We currently convert literal
<u> tags into
<i>s in the “Editoria basic” step. But, that can sometimes get scrubbed out in the “Editoria reduce” step. We should also catch underlining, italics, and bold when it’s specified in the css style, which we’re not currently doing. Wax looks for an
<em> tag for italics. So, the following should all be converted into text wrapped in
<p style=“font-weight: bold”>
<p style=“font-style: italic”>
<p style=“text-decoration: underline”>
Once this is implemented, we should also update the “Push mappings” to reflect this.
Force punctuation to match formatting of preceding wordtracking in #27 (moved)
Since we're porting into Wax, we don't need to worry about fonts/font size. The only thing I can think to catch is formatting (italics, bold, underline). And, since all of these should get flattened to
<em>s, I think this could be as simple as ensuring that if the preceding word is
<em>, the trailing punctuation is as well. These are the punctuation marks that this rule should apply to:
Rules already implemented
The following cleanups don't require any additional coding, since XSweet is handling these as it should already:
- Remove page breaks and section breaks
- Page breaks are extracted as
<br class=“br”>, and the pipeline replaces these by breaking paragraphs on
- Section breaks are dropped, since we’re not explicitly catching them
- Page breaks are extracted as
- Remove any comments: already happens, since wed don’t handle them
- Delete headers and footers: we’re already dropping these
- Remove soft hyphens: these do not come through into the html.