ScienceBeam Gym
This is where the ScienceBeam model is trained.
You can read more about the computer vision model in the Wiki.
Pre-requisites
- Python 2.7 (currently Apache Beam doesn't support Python 3)
- Apache Beam
- TensorFlow with google cloud support
- gsutil
Cython
Run:
python setup.py build_ext --inplace
Local vs. Cloud
Almost all of the commands can be run locally or in the cloud. Simply add --cloud
to the command to run it in the cloud. You will have to have gsutil installed even when running locally.
Before running anything in the cloud, please run upload-config.sh
to copy the required configuration to the cloud.
Configuration
The default configuration is in the prepare-shell.sh script. Some of the configuration can be overriden by adding a .config
file which overrides some of the variables, e.g.:
#!/bin/bash
export TRAINING_SUFFIX=-gan-1-l1-100
export TRAINING_ARGS="--gan_weight=1 --l1_weight=100"
export USE_SEPARATE_CHANNELS=true
Inspecting Configuration
By running source prepare-shell.sh
the configuration can be inspected.
e.g. the following sequence of commands will print the data directory:
source prepare-shell.sh
echo $DATA_PATH
The following sections may refer to variables defined by that script.
Pipeline
The TensorFlow training pipeline is illustrated in the following diagram:
The steps from the diagram are detailed below.
Generate PNG
This step is currently not part of this repository (it will be made available in the future).
Instead you will need access to the annotated PNGs. You can download the example data which is CV training for first pages of the PMC_sample_1943 dataset (see Grobid End-to-end evaluation).
The data need to be made available in $GCS_DATA_PATH
or $LOCAL_DATA_PATH
depending on whether running it in the cloud.
Running ./upload-data.sh
(optional) will copy files from $LOCAL_DATA_PATH
to $GCS_DATA_PATH
.
Generate TFRecords
To make the training more efficient, it is recommended to use TFRecords for the training data.
The following script will resize the images from $DATA_PATH
to the required size and generate TFRecords, which will be written to $PREPROC_PATH
:
./preprocess.sh [--cloud]
You can inspect some details (e.g. count) of the resulting TFRecords by running the following command:
./inspect-tf-records.sh [--cloud]
Train TF Model
Running the following command will train the model:
./train.sh [--cloud]
Export Inference Model
This step is currently not implemented.
TensorBoard
Run the TensorBoard with the correct path:
./tensorboard.sh [--cloud]
Visual Studio Code Setup
If you are using Visual Studio Code and are using a virtual environment for Python, you can add the following entry to .vscode/settings.json
:
"python.pythonPath": "${workspaceRoot}/venv/bin/python"
And then create link to the virtual environment as venv
.