NCBI Word workflow integration proposal
After internal discussion with Bookshelf / PMC developers, we propose to decouple internal NCBI processing from Coko database and rely on file-based job submissions via FTP and completion notifications instead. We will set up FTP access for Coko service account and create directories for Word conversion and PMC ingest.
In short, we propose the following workflow for Coko review:
Word conversion:
- CoKo deposits a Word package and a meta JSON file to
/ftp/convert/word/in
- That triggers the Task Manager (TM) session book_convert_word (to be developed by NCBI) which will run asynchronously for several minutes.
- TM deposits output package in
/ftp/convert/word/out
and sends a success / failure notification JSON back to CoKo. Exact mechanism of sending notifications is TBD - it may be a HTTP POST to CoKo CMS API endpoint, or uploading a JSON file to AWS S3 bucket that triggers CoKo AWS Lambda function to load the file from S3, or something else. - Coko picks up the converter output and loads it into the database
Proposed file format for the Word conversion:
a) JSON meta file - assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.json
{
"job_id": 9876543210, // Reference ID for the Word conversion job. It's generated by Coko and used by NCBI to manage the conversion queue and report back
// the status of the conversion.
"user_name": "jordandc", // Name of a user initiating a conversion
"domain": "assaygui", // Book domain name in PMCBook, that chapter being converted belongs to.
"package": "assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.input.zip", // Name of the package with Word .doc(x) file, in domain.filename.ext.job_id.timestamp format
"target_server": "dev", // Valid values are 'dev', 'test', 'prod', aliases to which server(s) to send a conversion request
"timeout": 720, // Default is 720 sec, conversion jobs longer than that will be aborted
"citation_type": 0, // Citation style used by eXtyles, valid values are 0 - 4 (Harvard, numbered square, numbered parentheses, Harvard numbered square,
// Harvard numbered parentheses)
"notification_recipients": { // List of emails for NCBI Task Manager (not CoKo) to send notifications to
"success": ["bookshelf@ncbi.nlm.nih.gov","fritz@publisher.org"],
"failure": ["bookshelf@ncbi.nlm.nih.gov"]
}
}
b) Successful response - assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.success.json
{
"job_id": 9876543210, // Word conversion job reference ID
"status": 0, // Success = 0, Waiting = 1, Loading = 2, Converting = 2, Error = 3, SevereError = 4, Timeout = 5, Killing = 9, Killed = 10
"timestamp": "2020-11-07 15:14:59", // Completion time
"converted_files": "assaygui.microplates.docx.9876543210.2020_05_15-09_30_19.output.zip", // domain.filename.ext.job_id.timestamp. Contains converted BXML and image files
"notices": [ // An array of error and/or warning notices. Should be treated as a key/value pairs dictionary. "severity" and "message" are always
// present, all other attributes are optional.
{
"filename": "microplates.docx",
"severity": "warning", // Can be in either lower or upper case
"message": "eXtyles Warning: Content model for sec does not allow element p here in unnamed entity at line 539 char 3 of file:///Program Files (x86)/eXtyles/TEMP/TMP/RXPin.xml"
}
]
}
c) Failed response - assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.failure.json
{
"job_id": 9876543210, // Word conversion job reference ID
"status": 3, // Success = 0, Waiting = 1, Loading = 2, Converting = 2, Error = 3, SevereError = 4, Timeout = 5, Killing = 9, Killed = 10
"timestamp": "2020-11-07 15:14:59", // Completion time
"converted_files": "assaygui.microplates.docx.9876543210.2020_05_15-09_30_19.output.zip", // domain.filename.ext.job_id.timestamp. Contains converted BXML and image files.
// May be returned even in case of an error, to assist with troubleshooting
"notices": [
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "Mismatched end tag: expected </td>, got </list-item> in unnamed entity at line 79 char 155 of file:///Users/msword/AppData/Local/Temp/EXTYLES/TMP/RXPin.xml"
},
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "eXtyles Error: Mismatched end tag: expected </p>, got </sec> in unnamed entity at line 1038 char 6 of file:///Program Files (x86)/eXtyles/TEMP/TMP/RXPin.xml"
},
{
"severity": "WARNING",
"message": "eXtyles Error: cannot resolve PubMed references"
}
]
}
assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.json
assaygui.microplates.docx.9876543210.2020_05_15-09_30_19.failure.json
assaygui.microplates.docx.9876543210.2020_05_15-09_30_19.success.json
assaygui.microplates.docx.9876543210.2020_05_15-09_30_19.output.zip
assaygui.microplates.docx.9876543210.2020_05_15-09_25_45.input.zip
Load to PMC
- CoKo deposits a chapter package and a meta JSON file to
/ftp/ingest/chapter/
- That triggers the TM session ingest_book_chapter (to be configured by NCBI from existing components) which will run asynchronously for several minutes.
- TM loads the chapter to PMC and sends a success / failure notification JSON back to CoKo using the same mechanism as for Word conversion.
Proposed file format for Load to PMC:
a) JSON meta file - assayguide.microplates.2.123457890.2020_05_15-09_30_19.json
{
"package_id": 1234567890, // Reference ID for the package (and thus the chapter version). It's generated by Coko and
// used by NCBI mainly for reporting back the status of the package ingest.
"domain": "assaguide", // Book domain name in PMCBook, that chapter being converted belongs to.
"chapter": "microplates", // Chapter ID as found in the XML
"version": 2, // Version number of the chapter. NCBI needs domain, chapter, and version in order to build the ingest directory.
"xml_file": "microplates.xml", // Name of the chapter XML file, in case there are supplementary XML files.
"package": "assayguide.microplates.2.123457890.2020_05_15-09_30_19.zip", // Name of the package data file. The package includes an XML <book-part-wrapper>
// (converter output from the eXtyles conversion), images and supplementary files
"target_database": "prod", // Values are "prod", "preview", or "dev". These are aliases for which database to
// load to (prod=PMCBook, preview=PMCBookTest, dev=PMCA3Book)
"release": true, // Boolean flag for whether to release the chapter version or not
"notification_recipients": { // List of emails for NCBI Task Manager (not CoKo) to send notifications to
"success": ["bookshelf@ncbi.nlm.nih.gov","fritz@publisher.org"],
"failure": ["bookshelf@ncbi.nlm.nih.gov"]
},
"custom1": "value", // Placeholder for other processing parameters that may become necessary
"custom2": "value",
"customN": "value"
}
b) Successful response - assayguide.microplates.2.123457890.2020_05_15-09_30_19.success.json
{
"package_id": 1234567890, // Loading job reference ID
"status": 0, // Success = 0, any other value is a failure
"timestamp": "2020-11-07 15:14:59", // Completion time
"url": "https://www.ncbi.nlm.nih.gov/books/NBK123.2", // URL to view the loaded chapter
"notices": [
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "Mismatched end tag: expected </td>, got </list-item> in unnamed entity at line 79 char 155 of file:///Users/msword/AppData/Local/Temp/EXTYLES/TMP/RXPin.xml"
},
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "eXtyles Error: Mismatched end tag: expected </p>, got </sec> in unnamed entity at line 1038 char 6 of file:///Program Files (x86)/eXtyles/TEMP/TMP/RXPin.xml"
},
{
"severity": "WARNING",
"message": "eXtyles Error: cannot resolve PubMed references"
}
]
}
c) Failed response - assayguide.microplates.2.123457890.2020_05_15-09_30_19.failure.json
{
"package_id": 1234567890, // Loading job reference ID
"status": 2, // Status code, non-zero means an error
"timestamp": "2020-11-07 15:14:59", // Completion time
"url": "http://ipmc-prod.be-md.ncbi.nlm.nih.gov:5701/internal/utils/tm/index.fcgi?s=monitor&sel=1&sessid=8904088", // Task Manager URL to view the session log with errors
"notices": [
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "Mismatched end tag: expected </td>, got </list-item> in unnamed entity at line 79 char 155 of file:///Users/msword/AppData/Local/Temp/EXTYLES/TMP/RXPin.xml"
},
{
"filename": "microplates.docx",
"severity": "ERROR",
"message": "eXtyles Error: Mismatched end tag: expected </p>, got </sec> in unnamed entity at line 1038 char 6 of file:///Program Files (x86)/eXtyles/TEMP/TMP/RXPin.xml"
},
{
"severity": "WARNING",
"message": "eXtyles Error: cannot resolve PubMed references"
}
]
}
assayguide.microplates.2.123457890.2020_05_15-09_30_19.success.json
assayguide.microplates.2.123457890.2020_05_15-09_30_19.failure.json
assayguide.microplates.2.123457890.2020_05_15-09_30_19.zip
assayguide.microplates.2.123457890.2020_05_15-09_30_19.json
JSON formats described above may be extended in a future to add more options / processing parameters as necessary,
Note about TM jobs and notifications
TM users can “abort” a session:
- User started the load or conversion, session fails, failure notice gets sent, user decides that was not a valid session anyway, and “aborts” the session, notification gets sent. It’s then no longer a failed session. TM users can “restart” a session:
- User started the load or conversion, session fails, failure notice gets sent, a user fixes something and restarts the session, now it succeeds, success message gets sent. So, CoKo needs to be prepared to receive several notifications per package.