Set up - Metadata use cases
Hi @DioneMentis -
I am using this issue to document our Metadata use cases for review and discussion by all necessary parties so the technical details can be sorted out systematically.
WORD
Chapter
Currently, ALL chapter metadata (EXCEPT FUNDING INFORMATION) is handled via tables in the source Word documents and converted to book part metadata via our extyles2bxml
script. This will not change in early phases of the BCMS. There has been some very initial discussion about curation and standardizing curation for this content (such as with our medical genetics variation program) for consistent integration with other NCBI databases, but that has not been fully scoped out nor the requirements clearly defined.
We currently have one Word book that requires a grant from GrantHub be linked to it. The award contract must be applied to each chapter in the book. The contract for the word conversion book remains the same for a period of years until it is renewed, so could be automatically clone from the book to each chapter until it is renewed and the new one is applied to new and updated chapters only.
Book and Collection
ALL book and collection metadata (including cover and funding) must be entered manually. In Phase I, we need to support in the UI metadata fields required for indexing and scientific discovery in PubMed, and any additional ones will need to be applied through an XML or Excel file upload or edit of the converted BXML.
All book and collection metadata fields (except cover and funding) must be written into each book-part-wrapper whenever that metadata is modified in any way.
Funding and cover need to be written in the JSON per provided technical specifications any time they are added or modified.
Currently, Bookshelf staff manage what they call "meta file templates" by ORGANIZATION AND / OR COLLECTION. These templates hold shared fields across ALL PDF2XML converted files within that ORG / COLLECTION, which may include the following (hopefully not missing anything significant; Susie / Diana please comment if so):
- publisher name
- publisher location
- ncbi specific metadata (ncbi collection information and source type custom meta)
- ncbi domain / pmcid
- permissions not in the PDFS per agreements (copyright / license information)
- ncbi custom notes for funded information
- internal ncbi processing instructions (TBD - are these metadata or settings???)
There are some unique use cases such as order of investigators or published series title or monograph abstracts which are actually blurbs from publisher book catalogs, etc, but we will work with PDF2XML taggers to try to take those edge cases on themselves IF they can OR publishers to ensure that information is provided in source files (even if that includes their own additional metadata XML / EXCEL file to be forward to taggers).
In these cases those metadata templates are provided to the PDF2XML taggers to add into the converted files at time of conversion. If anything needs to be changed AFTER conversion staff download the converted BXML and edit it.
Similarly, Bookshelf staff currently manage settings whether a cover template will be used from a publisher or collection, OR a cover supplied by the publisher for the book will be used, OR if the PDF2XML taggers will create the cover from a cover page in the PDF.
For chapter-processed books there are two cases on top of the above:
- database-like books, in which the book metadata is handled like WORD - all its book metadata MUST be entered manually. TBD - would that book metadata be provided to the PDF2XML taggers to add to the book-part-wrappers consistent with "meta file" template metadata? That seems most consistent / logical from their perspective.
- for funded chapters in a funded collection, the investigator completes a form with all required metadata fields necessary for PubMed indexing. If there is a discrepancy between what they enter and what is provided in a source PDF or manuscript, the PDF2XML taggers provide a query for ideally the investigator to resolve. With that form, the investigator selects their grant from an API (to be GrantHub) for approval by the funder for inclusion.
Note funded books are handled just like funded chapters currently EXCEPT chapter metadata is not required.
IF metadata is managed at set up and PDF2XML conversion, then the metadata fields for that BOOK / CHAPTER must be locked after PDF2XML tagging and any edits handled only by editing the BXML to avoid verisioning issues.
Only edge case I'm not sure about - if we edit the PUBLISHER / COLLECTION metadata templates, do those edits only get inherited by FUTURE content created within those containers. This seems cleanest to me, and most valid, except based on experience, I know sometimes mistakes happen and we need to "bulk download" content, edit it by script, and reupload it to fix things across all books in a collection / publisher or chapters in a book, etc. If that needs to happen, can we reuse any migration script @deniskar creates?
XML
Currently, like the PDF cases Bookshelf staff manage what they call "meta file templates" by ORGANIZATION AND / OR COLLECTION. These templates hold shared fields across ALL XML submissions within that ORG / COLLECTION, which may include the following (hopefully not missing anything significant; Susie / Diana please comment if so):
- publisher name
- publisher location
- ncbi specific metadata (ncbi collection information and source type custom meta)
- ncbi domain / pmcid
- FOR LEGACY CONTENT, permissions not in the PDFS per agreements (copyright / license information)
- ncbi custom notes for funded information
- internal ncbi processing instructions (TBD - are these metadata or settings???)
There are some unique use cases in which Bookshelf staff will add additional XML fields, particularly abstracts not contained in the source XML published files, but like PDF, we should work with providers to send this information in their XML. The ONLY exception is funding information for funder collections - funders have difficulty providing this a time of submission since it is not provide by their authors or editors during authoring. We will need to have a way for funding to be selected by GrantHub for XML funded chapters and books (in phase 1 by Bookshelf staff or Org Admin) and those to be written to the JSON files per technical specifications.
XML submitters are required to provide either a COVER for their book OR to select a COVER for an organization / collection to be inherited by every book in that Org / Collection.
For chapter-processed books there are two possibilities of which I'm aware (after the template metadata is inherited):
- Publisher provides book and their own internal series information in their book-part-wrappers
- Bookshelf staff creates the book metadata to be inherited by every book-part-wrapper in that book component (Here two funding and covers are handled differently as described above.) TBD: How process both cases consistently? If Bookshelf staff creates book metadata like Word, does BCMS write it into every book-part-wrapper?
This edge case from PDF content also applies for XML:
If we edit the PUBLISHER / COLLECTION metadata templates, do those edits only get inherited by FUTURE content created within those containers. This seems cleanest to me, and most valid, except based on experience, I know sometimes mistakes happen and we need to "bulk download" content, edit it by script, and reupload it to fix things across all books in a collection / publisher or chapters in a book, etc. If that needs to happen, can we reuse any migration script ...
--
In terms of display of metadata in the UI, currently Bookshelf staff is able to run a "metadata check" in our CMS to display key metadata fields for indexing in PubMed and Bookshelf and to ensure those fields meet Bookshelf style requirements. I think in Phase 1, at minimum the UI fields should display those essential fields for quick QA without having to download files. We can look into adding additional automated QA tagging quality checks through Schematron written by @jordandc.
Please let me know if you have any questions or if I can try to make this clearer.