diff --git a/README.md b/README.md
index 6ad7d7464f39a990948d346ba6c6bcf9659e23c2..1bf927a1b1461575ad17511a8e20e907188fdf53 100644
--- a/README.md
+++ b/README.md
@@ -4,8 +4,7 @@ This is where the [ScienceBeam](https://github.com/elifesciences/sciencebeam) mo
 
 You can read more about the computer vision model in the [Wiki](https://github.com/elifesciences/sciencebeam-gym/wiki/Computer-Vision-Model).
 
-Pre-requisites
---------------
+# Pre-requisites
 
 - Python 2.7 ([currently Apache Beam doesn't support Python 3](https://issues.apache.org/jira/browse/BEAM-1373))
 - [Apache Beam](https://beam.apache.org/)
@@ -20,17 +19,16 @@ Run:
 python setup.py build_ext --inplace
 ```
 
-Local vs. Cloud
----------------
+# Local vs. Cloud
 
 Almost all of the commands can be run locally or in the cloud. Simply add `--cloud` to the command to run it in the cloud. You will have to have [gsutil](https://cloud.google.com/storage/docs/gsutil) installed even when running locally.
 
 Before running anything in the cloud, please run `upload-config.sh` to copy the required configuration to the cloud.
 
-Configuration
--------------
+# Configuration
 
 The default configuration is in the [prepare-shell.sh](prepare-shell.sh) script. Some of the configuration can be overriden by adding a `.config` file which overrides some of the variables, e.g.:
+
 ```bash
 #!/bin/bash
 
@@ -51,8 +49,7 @@ echo $DATA_PATH
 
 The following sections may refer to variables defined by that script.
 
-Pipeline
---------
+# Pipeline
 
 The TensorFlow training pipeline is illustrated in the following diagram:
 
@@ -60,33 +57,99 @@ The TensorFlow training pipeline is illustrated in the following diagram:
 
 The steps from the diagram are detailed below.
 
-### Generate PNG
+## Preprocessing
+
+The individual steps performed as part of the preprocessing are illustrated in the following diagram:
+
+![TensorFlow Training Pipeline](doc/sciencebeam-preprocessing.png)
+
+#### Find File Pairs
+
+The preferred input layout is a directory containing a gzipped pdf (`.pdf.gz`) and gzipped xml (`.nxml.gz`), e.g.:
+
+* manuscript_1/
+  * manuscript_1.pdf.gz
+  * manuscript_1.nxml.gz
+* manuscript_2/
+  * manuscript_2.pdf.gz
+  * manuscript_2.nxml.gz
+
+Using compressed files is optional but recommended to reduce file storage cost.
+
+The parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general).
+
+Run:
 
-This step is currently not part of this repository (it will be made available in the future).
+```bash
+python -m sciencebeam_lab.preprocess.find_file_pairs --data-path <source directory> --pdf-pattern *.pdf.gz --xml-pattern *.nxml.gz --out <output file list csv/tsv>
+```
 
-Instead you will need access to the annotated PNGs. You can download the [example data](https://storage.googleapis.com/elife-public-data/PMC_sample_1943-page1-cv-training-data.zip) which is CV training for first pages of the PMC_sample_1943 dataset (see [Grobid End-to-end evaluation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/)).
+e.g.:
 
-The data need to be made available in `$GCS_DATA_PATH` or `$LOCAL_DATA_PATH` depending on whether running it in the cloud.
+```
+python -m sciencebeam_lab.preprocess.find_file_pairs --data-path gs://some-bucket/some-dataset --pdf-pattern *.pdf.gz --xml-pattern *.nxml.gz --out gs://some-bucket/some-dataset/file-list.tsv
+```
 
-Running `./upload-data.sh` (optional) will copy files from `$LOCAL_DATA_PATH` to `$GCS_DATA_PATH`.
+That will create the TSV (tab separated) file `file-list.tsv` with the following columns:
 
-### Generate TFRecords
+* _pdf_url_
+* _xml_url_
 
-To make the training more efficient, it is recommended to use TFRecords for the training data.
+That file could also be generated using any other preferred method.
 
-The following script will resize the images from `$DATA_PATH` to the required size and generate TFRecords, which will be written to `$PREPROC_PATH`:
+### Split File List
+
+To separate the file list into a _training_, _validation_ and _test_ dataset, the following script can be used:
 
 ```bash
+python -m sciencebeam_gym.preprocess.split_csv_dataset --input <csv/tsv file list> --train 0.5 --validation 0.2 --test 0.3 --random --fill
+```
+
+e.g.:
+
+```bash
+python -m sciencebeam_gym.preprocess.split_csv_dataset --input gs://some-bucket/some-dataset/file-list.tsv --train 0.5 --validation 0.2 --test 0.3 --random --fill
+```
+
+That will create three separate files in the same directory:
+
+* `file-list-train.tsv`
+* `file-list-validation.tsv`
+* `file-list-test.tsv`
+
+The file pairs will be randomly selected (_--random_) and one group will also include all remaining file pairs that wouldn't get include due to rounding (_--fill_).
+
+As with the previous step, you may decide to use your own process instead.
+
+Note: those files shouldn't change anymore once you used those files
+
+### Preprocess
+
+The output of this step are the [TFRecord](https://www.tensorflow.org/programmers_guide/datasets) files used by the training process. TFRecord files are a bit like binary csv files.
+
+The input files are pairs of PDF and XML files (using file lists generated in the previous steps).
+
+Run:
+
+```
 ./preprocess.sh [--cloud]
 ```
 
+That will run the preprocessing pipeline for:
+
+* training dataset using `file-list-train.tsv`
+* validation dataset using `file-list-validation.tsv`
+* qualitative dataset using first _n_ files and first page of `file-list-validation.tsv` (optional)
+
+Part of the preprocessing is an auto-annotation step which aligns text from the XML with the text in the PDF to tag the corresponding regions appropriately. It is using the [Smith Waterman algorithm](https://en.wikipedia.org/wiki/Smith_waterman). It may take some time (roughly 6 seconds per page). It will also make mistakes but for the samples we used it was good enough.
+
 You can inspect some details (e.g. count) of the resulting TFRecords by running the following command:
 
 ```bash
 ./inspect-tf-records.sh [--cloud]
 ```
 
-### Train TF Model
+## Train TF Model
 
 Running the following command will train the model:
 
@@ -94,11 +157,11 @@ Running the following command will train the model:
 ./train.sh [--cloud]
 ```
 
-### Export Inference Model
+## Export Inference Model
 
 This step is currently not implemented.
 
-### TensorBoard
+## TensorBoard
 
 Run the TensorBoard with the correct path:
 
@@ -106,8 +169,7 @@ Run the TensorBoard with the correct path:
 ./tensorboard.sh [--cloud]
 ```
 
-Visual Studio Code Setup
-------------------------
+# Visual Studio Code Setup
 
 If you are using [Visual Studio Code](https://code.visualstudio.com/) and are using a virtual environment for Python, you can add the following entry to `.vscode/settings.json`:
 ```json
diff --git a/doc/sciencebeam-preprocessing.png b/doc/sciencebeam-preprocessing.png
new file mode 100644
index 0000000000000000000000000000000000000000..5029ef16a51044629b041fc377fdb8910a853f32
Binary files /dev/null and b/doc/sciencebeam-preprocessing.png differ