Timeouts when uploading converted file >10 MB
We need to do development to support converted files over 10MB in chapter processed and wholebooks.
Some details provided from NCBI about converted files are:
- average converted file is ~350kb
- dia3ed, hhv, pmh_iqwig books have converted files over 10mb book_sizes2.csv These books have chapters over 10mb chapter_sizes.csv
Book | chapter |
---|---|
gbd | A592.pxml |
niceng12 | appf.pxml |
healthus09 | trendtables.pxml |
healthus05 | trend-tables.pxml |
healthus11 | trendtables.pxml |
@John.kopanas 's suggestion for this issue so it can be resolved within a reasonable time without making too many changes and not changing the main library we use at least for the MVP is the following:
The problem and solution
The problem is that the file we give to the cheerio is too big to handle , however the good part is that our codebase doesnt necessarilly need the whole file in order to parse the needed info. It is organized in a way that in order to reach the contributors tag, we need first to find the metadata tag and then the contributors tag that live inside the metadata tag and then each individual author/editor. That means that we can split the file in several parts and give the information that is necerrary for each case . So For example if we want to parse from the file the affilation ,degrees, email etc from the contributors tag all we have to give as an input to the class /ncbi/server/services/xmlBaseModel/contributor.js
is that specific part of that file not the whole file and that means that we minimize the size of the file that we can give to cheerio.
Of course the problem is what will happen to the first level of parsing where we need to load the whole file at the beginning. for example for the whole books we need to parse at the first level the body, book-meta, named-book-part-body check this file here: server/services/xmlBaseModel/book.js
. in order to solve that issue we can just stream the file instead of loading it all at once.By using a library like this https://github.com/isaacs/sax-js
and by using the current codebase that helps alot since it marks the wanted tag that we want to match is easy to just select a specific part of the file and not the whole file.
How to implement
Create a wrapper of the stream library so it can fit in our needs and call it in the constructor of this file server/services/xmlBaseModel/xmlBase.js, this file is an abstract file and extends all the tags we want to parse. So for example we could pass at the wrapper the current location that we want to find in the stream since we find that part and save it to the memory then we can pass this part of the file to the cheerio to do the real parsing as we were doing it up to now.
Hi @lathrops1
We have recently encountered, for the first time in our testing, converted files with a size of 10 MB. Files of this size are timing out when being uploaded to the converted file section due to limitations with the library responsible for parsing the file and extracting certain data from the xml.
We have some thoughts on how to extend our handling for larger converted files, but those need more investigation and work.
The first thing we want to do is get an idea from NCBI of the general maximum sizes of converted files. Can you please do a data call and provide that info?
QA Steps
- Create a PDF Wholebook (is the same for other workflows)
- Upload on the converted file the following: bookjaspers2011.xml
- It will take few seconds because it's a big file, but it should upload successfully without error popups.
These steps can be replicated with the other files from the list of those over 10MB.