Early aggregator to call extractor over a sequence of files
A simple file manifest source (xsweet-manifest.xml
) could be used to drive the extractor and produce an assembly of HTML section
elements within a single file, ready for subequent pipelining. (This is a job that INK may ultimately be doing but we probably need the functionality sooner. Saxon can do the job if it can read the source data.) Such a single merged result for the book will help us in mapping since we'd have a comprehensive view of all the sources together. Plus, it may help to scale (or not) -- total throughput time should be lower since there is only one pipeline not one per chapter, so less overhead.
Something really simple to start such as
<source-files>
<dir name="a1 Urban_tit"/>
<dir name="a2 Urban_toc"/>
<dir name="a3 Urban_pref"/>
<dir name="a4 Urban_Illus"/>
<dir name="b00 Urban_Intro"/>
<dir name="b01 Urban"/>
<dir name="b02 Urban"/>
<dir name="b03 Urban"/>
<dir name="b04 Urban"/>
<dir name="b05 Urban"/>
<dir name="b06 Urban"/>
<dir name="b07 Urban_Concl"/>
<dir name="z Urban_bib"/>
</source-files>
The idea being that a word/document.xml
will be found inside each directory named, and its extracted results are to be merged as a section
into a single HTML document.
Question: will this work in testing? What can we assume about the extraction process and availability of inputs (to Saxon) at runtime?