Chapter title promotion

Let’s see if we can identify the titles in most book chapters. If we can find a way to implement good “title promotion” logic, we could automatically rename Editoria Book Builder components (chapters) to reflect our best guess at the chapter’s title. For example, if someone uploads a chapter into the “Untitled” components in the screen below, Editoria would extract the chapter title and put it in place of “Untitled”.

That would make the overall book assembly experience much nicer.

Here are a few considerations:

We want to extract the title of the chapter, but not the chapter numbers (“Chapter One”, “Chapter 3: ”, etc.). In the screenshot above, the numbers 4 and 5 are added automatically by Editoria - the chapter components can be reordered and auto-renumbered by click-and-dragging them. So if we see “Chapter Two: How Cars Work”, we’d want the extracted title as simply “How Cars Work”.
Unlike header promotion, we don’t actually want to overwrite anything in the HTML. This is really more of a metadata extraction: we’d want to pass the extracted chapter title to Editoria. What’s the best way to do that without actually changing any of the existing HTML elements?

Here’s my proposal for a starting place - I’m of course open to suggestions/different approaches, and these rules will probably need to change a bit no matter what. It's just a starting point. I think this step would happen near the end of the pipeline, probably as the last step before the clean html. Very interested to hear your thoughts!

1. Normal numbered chapters

Step one: search for elements that start with “chapter” (case insensitive), plus a number, either in numerals (“12”) or English (“seven”). To start, this is a required cue that a chapter title is in the vicinity; if no element starts with “chapter,” then no chapter title is returned.

From here, there are a few scenarios:

1.1: Chapter number and chapter title appear in same element, separated by a delimiter

If the starting “chapter + number” pattern is followed by

a colon, with space(s) after it, then more text, or
space(s) + [en dash/em dash/any dash] + space(s) + more text

then call the title everything after the delimiter. E.g.

“Chapter One: How Cars Work” - extracted title is “How Cars Work”
“Chapter 2 - Love and Other Drugs” - extracted title is “Love and Other Drugs”

1.2: Chapter number and chapter title are in a different elements

Example:

<h1>Chapter One”</h1>
<h2>A Walk Down Memory Lane</h2>

If the element ends after

“chapter + number”
“chapter + number + delimiter”, where the delimiter is a colon or some variation of “ - “

then look for the chapter title in the next element:

If the next element is a heading, the text of this element becomes the title. From the example, this rule extracts “A Walk Down Memory Lane”. In deciding what the “next element” is, XSweet should skip empty elements, breaks, etc.
If the next element is not a heading, stop looking for a title to extract - perhaps the chapter is unnamed. At this point, we give up. Editoria can decide what to do next.

2. Specific front/back matter parts to handle differently

2.1. Front and back matter parts that might have a title too

If the “chapter + [pattern]” search pattern doesn’t find anything, look for the following words at the beginning of an element (case insensitive):

Preface
Introduction
Conclusion

If any elements that start with the above text are followed by

a delimiter (as above) + more text
a heading as the next element
a delimiter AND also a heading afterwards

Then concatenate the above word (Preface, Intorduction, or Conclusion) and the next heading, with a “: “ in the middle.

Otherwise, just extract the title as “Preface”, “Introduction”, or “Conclusion”

Examples:
<h1>Conclusion: A Sad Tale</h1> - extracted title: “Conclusion: A Sad Tale”

<h1>Introduction</h1>
<h1>Before Our Story</h1>

gets extracted as “Introduction: Before Our Story”

and <h1>Introduction</h1> gets extracted as “Introduction”

2.2 Front and back matter that don’t have titles too

Finally, if there are no elements that start with “chapter”, “preface”, “introduction”, or “conclusion”, then look for elements that start with:

Acknowledgements
References
Bibliography

If there is one, extract extract the entire element as the title (“Acknowledgements”, “References”, “Bibliography”). That’s it.