Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
README.md 3.81 KiB

ScienceBeam Trainer for GROBID

The Trainer for GROBID is a thin wrapper and Docker container around GROBID Training commands. While this container is not complete yet (Header model only), it is cloud-ready.

Prerequisites

Recommended

Using the Docker Container

Header Model Training with Default Dataset

This isn't very useful unless you want to re-train the model. It is a good test to see how long training takes though.

Using Docker:

docker run --rm -it \
    elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model.sh \
        --use-default-dataset

Using Kubernetes:

kubectl run --rm --attach --restart=Never --generator=run-pod/v1 \
    --image=elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model -- \
    train-header-model.sh \
        --use-default-dataset

Header Model Training with your own dataset

Using a mounted volume:

docker run --rm -it \
    -v /data/mydataset:/data/mydataset \
    elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model.sh \
        --dataset /data/mydataset \
        --use-default-dataset

You could also specify a cloud location that gsutil understands (assuming that the credentials are mounted too).

The --use-default-dataset flag is optional.

You may also add --cloud-models-path <cloud path> to copy the resulting model to a cloud storage.

Make Targets

Example End-to-End

make example-data-processing-end-to-end

Downloads example PDF, converts it to training data and runs the training. The resulting model won't be of much use and merely provides an example.

Get Example Data

make get-example-data

Downloads example PDF to the data Docker volume.

Generate GROBID Training Data

make generate-grobid-training-data

Converts the previously downloaded PDF from the Data volume to GROBID training data. The tei files will be stored in tei-raw in the dataset. Training on the raw XML wouldn't be of as that the annotations the model already knows. Usually one would review and correct those generated XML files using the annotation guidelines. The final tei files should be stored in the tei sub directory of the corpus in the dataset.

Copy Raw Header Training Data to TEI