Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
conversion-pipeline.md 1.96 KiB

Experimental Pipeline

This pipeline is currently under development. It uses the CRF or computer vision model trained by ScienceBeam Gym.

What you need before you can proceed:

To use the CRF model together with the CV model, the CRF model will have to be trained with the CV predictions.

The following command will process files locally:

python -m sciencebeam_gym.convert.conversion_pipeline \
  --data-path=./data \
  --pdf-file-list=./data/file-list-validation.tsv \
  --crf-model=path/to/crf-model.pkl \
  --cv-model-export-dir=./my-model/export \
  --output-path=./data-results \
  --pages=1 \
  --limit=100

The following command would process the first 100 files in the cloud using Dataflow:

python -m sciencebeam_gym.convert.conversion_pipeline \
  --data-path=gs://my-bucket/data \
  --pdf-file-list=gs://my-bucket/data/file-list-validation.tsv \
  --crf-model=path/to/crf-model.pkl \
  --cv-model-export-dir=gs://mybucket/my-model/export \
  --output-path=gs://my-bucket/data-results \
  --pages=1 \
  --limit=100 \
  --cloud

You can also enable the post processing of the extracted authors and affiliations using Grobid by adding --use-grobid. In that case Grobid will be started automatically. To use an existing version also add --grobid-url=<api url> with the url to the Grobid API. If the --use-grobid option is used without a CRF or CV model, then it will use Grobid to translate the PDF to XML.

Note: using Grobid as part of this pipeline is considered deprecated and will likely be removed.

For a full list of parameters:

python -m sciencebeam.examples.conversion_pipeline --help