Dropped spaces due to long strings of repeated tags
The book by Green has 4 parts, each with an introductory section. Part 1 come through rinsed really nicely and very clean, but parts 2, 3, and 4 drop almost all the spaces.
For whatever reason, large portions of this book are being extracted with each word wrapped in its own tag. A big part of Part 1 gets extracted as long strings of <iCs>
tags, with empty iCs tags for spaces:
<iCs>remained</iCs>
<iCs></iCs>
<iCs>the</iCs>
<iCs></iCs>
<iCs>majority</iCs>
<iCs></iCs>
<iCs>population</iCs>
<iCs></iCs>
<iCs>of</iCs>
<iCs></iCs>
<iCs>many</iCs>
These all gets collapsed into a p tag with spaces preserved and the HTML looks great.
Parts 2, 3, and 4, though, have strings of spans instead:
<span style="font-size: 12pt">Safavids,</span>
<span style="font-size: 12pt"></span>
<span style="font-size: 12pt">and</span>
<span style="font-size: 12pt"></span>
<span style="font-size: 12pt">Uzbeks—seized</span>
<span style="font-size: 12pt"></span>
<span style="font-size: 12pt">control</span>
<span style="font-size: 12pt"></span>
When this happens, the content ends up in one long p tag with no spaces.
The introduction has a long string of <lang>
tags, similar to the iCs tags above. They don’t cause dropped spaces.
This is related to #35 (closed) but may be caused by a different underlying issue.