diff --git a/README.md b/README.md index 6ad7d7464f39a990948d346ba6c6bcf9659e23c2..1bf927a1b1461575ad17511a8e20e907188fdf53 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,7 @@ This is where the [ScienceBeam](https://github.com/elifesciences/sciencebeam) mo You can read more about the computer vision model in the [Wiki](https://github.com/elifesciences/sciencebeam-gym/wiki/Computer-Vision-Model). -Pre-requisites --------------- +# Pre-requisites - Python 2.7 ([currently Apache Beam doesn't support Python 3](https://issues.apache.org/jira/browse/BEAM-1373)) - [Apache Beam](https://beam.apache.org/) @@ -20,17 +19,16 @@ Run: python setup.py build_ext --inplace ``` -Local vs. Cloud ---------------- +# Local vs. Cloud Almost all of the commands can be run locally or in the cloud. Simply add `--cloud` to the command to run it in the cloud. You will have to have [gsutil](https://cloud.google.com/storage/docs/gsutil) installed even when running locally. Before running anything in the cloud, please run `upload-config.sh` to copy the required configuration to the cloud. -Configuration -------------- +# Configuration The default configuration is in the [prepare-shell.sh](prepare-shell.sh) script. Some of the configuration can be overriden by adding a `.config` file which overrides some of the variables, e.g.: + ```bash #!/bin/bash @@ -51,8 +49,7 @@ echo $DATA_PATH The following sections may refer to variables defined by that script. -Pipeline --------- +# Pipeline The TensorFlow training pipeline is illustrated in the following diagram: @@ -60,33 +57,99 @@ The TensorFlow training pipeline is illustrated in the following diagram: The steps from the diagram are detailed below. -### Generate PNG +## Preprocessing + +The individual steps performed as part of the preprocessing are illustrated in the following diagram: + + + +#### Find File Pairs + +The preferred input layout is a directory containing a gzipped pdf (`.pdf.gz`) and gzipped xml (`.nxml.gz`), e.g.: + +* manuscript_1/ + * manuscript_1.pdf.gz + * manuscript_1.nxml.gz +* manuscript_2/ + * manuscript_2.pdf.gz + * manuscript_2.nxml.gz + +Using compressed files is optional but recommended to reduce file storage cost. + +The parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general). + +Run: -This step is currently not part of this repository (it will be made available in the future). +```bash +python -m sciencebeam_lab.preprocess.find_file_pairs --data-path <source directory> --pdf-pattern *.pdf.gz --xml-pattern *.nxml.gz --out <output file list csv/tsv> +``` -Instead you will need access to the annotated PNGs. You can download the [example data](https://storage.googleapis.com/elife-public-data/PMC_sample_1943-page1-cv-training-data.zip) which is CV training for first pages of the PMC_sample_1943 dataset (see [Grobid End-to-end evaluation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/)). +e.g.: -The data need to be made available in `$GCS_DATA_PATH` or `$LOCAL_DATA_PATH` depending on whether running it in the cloud. +``` +python -m sciencebeam_lab.preprocess.find_file_pairs --data-path gs://some-bucket/some-dataset --pdf-pattern *.pdf.gz --xml-pattern *.nxml.gz --out gs://some-bucket/some-dataset/file-list.tsv +``` -Running `./upload-data.sh` (optional) will copy files from `$LOCAL_DATA_PATH` to `$GCS_DATA_PATH`. +That will create the TSV (tab separated) file `file-list.tsv` with the following columns: -### Generate TFRecords +* _pdf_url_ +* _xml_url_ -To make the training more efficient, it is recommended to use TFRecords for the training data. +That file could also be generated using any other preferred method. -The following script will resize the images from `$DATA_PATH` to the required size and generate TFRecords, which will be written to `$PREPROC_PATH`: +### Split File List + +To separate the file list into a _training_, _validation_ and _test_ dataset, the following script can be used: ```bash +python -m sciencebeam_gym.preprocess.split_csv_dataset --input <csv/tsv file list> --train 0.5 --validation 0.2 --test 0.3 --random --fill +``` + +e.g.: + +```bash +python -m sciencebeam_gym.preprocess.split_csv_dataset --input gs://some-bucket/some-dataset/file-list.tsv --train 0.5 --validation 0.2 --test 0.3 --random --fill +``` + +That will create three separate files in the same directory: + +* `file-list-train.tsv` +* `file-list-validation.tsv` +* `file-list-test.tsv` + +The file pairs will be randomly selected (_--random_) and one group will also include all remaining file pairs that wouldn't get include due to rounding (_--fill_). + +As with the previous step, you may decide to use your own process instead. + +Note: those files shouldn't change anymore once you used those files + +### Preprocess + +The output of this step are the [TFRecord](https://www.tensorflow.org/programmers_guide/datasets) files used by the training process. TFRecord files are a bit like binary csv files. + +The input files are pairs of PDF and XML files (using file lists generated in the previous steps). + +Run: + +``` ./preprocess.sh [--cloud] ``` +That will run the preprocessing pipeline for: + +* training dataset using `file-list-train.tsv` +* validation dataset using `file-list-validation.tsv` +* qualitative dataset using first _n_ files and first page of `file-list-validation.tsv` (optional) + +Part of the preprocessing is an auto-annotation step which aligns text from the XML with the text in the PDF to tag the corresponding regions appropriately. It is using the [Smith Waterman algorithm](https://en.wikipedia.org/wiki/Smith_waterman). It may take some time (roughly 6 seconds per page). It will also make mistakes but for the samples we used it was good enough. + You can inspect some details (e.g. count) of the resulting TFRecords by running the following command: ```bash ./inspect-tf-records.sh [--cloud] ``` -### Train TF Model +## Train TF Model Running the following command will train the model: @@ -94,11 +157,11 @@ Running the following command will train the model: ./train.sh [--cloud] ``` -### Export Inference Model +## Export Inference Model This step is currently not implemented. -### TensorBoard +## TensorBoard Run the TensorBoard with the correct path: @@ -106,8 +169,7 @@ Run the TensorBoard with the correct path: ./tensorboard.sh [--cloud] ``` -Visual Studio Code Setup ------------------------- +# Visual Studio Code Setup If you are using [Visual Studio Code](https://code.visualstudio.com/) and are using a virtual environment for Python, you can add the following entry to `.vscode/settings.json`: ```json diff --git a/doc/sciencebeam-preprocessing.png b/doc/sciencebeam-preprocessing.png new file mode 100644 index 0000000000000000000000000000000000000000..5029ef16a51044629b041fc377fdb8910a853f32 Binary files /dev/null and b/doc/sciencebeam-preprocessing.png differ