Ornament detector
Authors often include ornaments to create divisions within chapters in their Word files. These are things like:
chapter content chapter content chapter content.
* * *
Back to more chapter content
Authors can and do implement these in a few different ways:
- Any number of text dividers:
***
,*****
,- - -
, etc. - Using a horizontal rule in Word
It would be good as an enhancement step (not extraction) to be able to port these into Wax, so I propose we implement the following:
- Add an optional enhancement step to HTMLevator that can recognize a range of ornaments and convert them into
<hr>
s - Then, add a step into Editoria Typescript to convert
<hr>
s into ornaments for Wax. There's a ticket in for implementing this in Wax (wax/wax#178 (closed)), so we'll need to wait for this to be implemented to have a target format for ornaments. But it should be a straightforward mapping.
But we can start on the first part: ornament recognition.
I think there are 2 parts to this that would get us most of the way there:
1. Recognizing text ornaments
I think the rule for this is pretty simple: any paragraph that contains ONLY any combination of
- asterisks
- spaces
- hyphens
- en dashes
- em dashes
- tabs
is an ornament. The paragraph and its content should be clobbered and replaced with an <hr>
<hr>
s
2. Convert horizontal rules to In Word, on a new line, typing 3 or more hyphens in a row then hitting enter creates a horizontal rule. Under the hood, it's achieved by applying a bottom border to the previous paragraph, like so:
How it looks in Word:
Content
content
The OOXML:
<w:p w14:paraId="4F67C0DD" w14:textId="77777777" w:rsidR="00B82E58" w:rsidRDefault="00C8440C">
<w:pPr>
<w:pBdr><w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto"/></w:pBdr>
</w:pPr>
<w:r>
<w:t>Content</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5A5EC17D" w14:textId="77777777" w:rsidR="00C8440C" w:rsidRDefault="00C8440C">
<w:r>
<w:t>content</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
</w:p>
So, HTMLevator would need to recognize this bottom border, and add an <hr>
after the end of that paragraph.
What do you think?