- Oct 13, 2021
-
-
Daniel Ecer authored
* log progress while matching figure image to each pdf page * yield empty image object match to keep total * added simple MultiLevelCache * implemented disk cache * added TestRegisterPickleFunction * increased default memory cache size
-
- Oct 11, 2021
-
-
Daniel Ecer authored
* moved get_image_array_with_max_resolution * added --pdf-scale-to argument * convert pdf from local pdf * show progress when loading pdf images * workaround for auto-scaling * log progress while checking for existing files
-
Daniel Ecer authored
-
- Sep 20, 2021
-
-
Daniel Ecer authored
* renamed --skip-errors to --ignore-unmatched-graphics * implemented --skip-errors * run test_should_annotate_using_jats_xml using beam * fixed MapOrLog usage when skipping errors
-
- Sep 15, 2021
-
-
Daniel Ecer authored
-
Daniel Ecer authored
* moved bbox main to separate module * renamed test module to match main module under test
-
Daniel Ecer authored
* conditionally write dummy copy of xml file * renamed to output_json_path * added coords attribute * using namespace for coords attribute * add namespace to nsmap
-
- Sep 14, 2021
-
-
Daniel Ecer authored
* process pdf and xml file lists * allow sub directories in output path * make output_annotated_images_path relative to output * make sure that output directory is created * use write_bytes in favour of explicit makedirs (cloud ready) * make --pdf-base-path required * added pipeline * not extending ABC due to serialization errors * log pipeline options * added test to check serialization * changed super import to avoid one of the serialization errors in Dataflow * moved most functionality to separate module * added libgl1 to setup.py * added PreventFusion * minor import grouping * added TransformAndCount * reverted super __init__ call * use parse args, not ignoring unknown args * expose all of the worker arguments
-
- Sep 09, 2021
-
-
Daniel Ecer authored
* added BoundingBoxScoreSummary * added TestGetBoundingBoxMatchScoreSummary * implemented algorithm to adjust final bounding box
-
- Sep 08, 2021
-
-
Daniel Ecer authored
* specify cache key for object keypoints * made image id required for lower level functions * added logging to find bounding boxes * 2nd it to find bounding box without bounding box * use fixed size when calculating image similarity
-
- Sep 07, 2021
-
-
Daniel Ecer authored
* calculate structural similarity using skimage * fixed crop_image_to_bounding_box * output score in JSON * optionally output images with bounding boxes * display bbox label inside if not enough space above * sort by score, then key points * fixed cache issue by using explicit cache key prefix (otherwise ids may have been reused after memory being freed)
-
- Sep 03, 2021
-
-
Daniel Ecer authored
* raise GraphicImageNotFoundError * allow skipping errors
-
Daniel Ecer authored
* enable debug logging for tests * added cli scaffolding * extract images from pdf * fixed type hint * added bounding box to_list * converted bounding box to named tuple * added tests for validate * implemented bounding box intersection * implemented finding bounding boxes of single image * added test for smaller partial image * added libgl1 for open cv * linting: use with statement for Popen * added support for multiple image files * added support for xml files * join graphic href with xml dirname * renamed cv2 to cv * using ObjectDetectorMatcher * moved funtions to image object matching module * added TestGetObjectMatch * added ImageObjectMarchResult * added test_should_match_smaller_image * added test_should_match_smaller_rotated_90_image * fixed typo ImageObjectMatchResult * moved object_detector_matcher parameter down * added get_image_list_object_match * added su...
-
- Aug 26, 2021
-
-
Daniel Ecer authored
* added mypy dependency * added dev-mypy * added mypy make target * declare EMPTY class prop * removed incorrect tensors type hint * added type hint to excluded_tokens * removed unused ProcessedWrapper * replacing backports.tempfile with builtin * added T_ArgumentParserOrGroup * fixed iter_tokenized_tokens return type hint * added types-requests * added type to DEFAULT_ANNOTATORS * removed blank line * ignore distutils import * replaced T_Element with etree.ElementBase * changed type check back to etree._Element * replaced project_tests.sh * removed second mypy make target dependency
-
dependabot[bot] authored
* Bump pylint from 2.8.3 to 2.10.2 Bumps [pylint](https://github.com/PyCQA/pylint) from 2.8.3 to 2.10.2. - [Release notes](https://github.com/PyCQA/pylint/releases) - [Changelog](https://github.com/PyCQA/pylint/blob/main/ChangeLog) - [Commits](https://github.com/PyCQA/pylint/compare/v2.8.3...v2.10.2 ) --- updated-dependencies: - dependency-name: pylint dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by:
dependabot[bot] <support@github.com> * make installing dependencies more predictable * downgrade numpy due to conflict with apache beam * downgraded numpy further due to conflict with tensorflow * linting: use dict literal * linting: pass in encoding to open function * linting: pcoll renamed to input_or_inputs * linting: iterate over list * linting: use from .. import * added pyarrow as explicit dependency * downgrade google-cloud-bigquery Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by:
Daniel Ecer <de-code@users.noreply.github.com>
-
- May 18, 2021
-
-
Daniel Ecer authored
* initial create vocabulary utility * extract vocabulary from embeddings * renamed to --output-word-count-file * added main call * extracted iter_tokenized_tokens * avoid empty tokens * using tokenizer from delft * optionally sort by count * added file list support * added support for remote files * added limit argument * added fsspec dependency * optionally use multi threading or processing * included full github link * renamed to create_vocabulary * moved to tools vocabulary * filter embeddings * renamed to embeddings * using fsspec to open embeddings file when extracting * use fsspec when filtering embeddings * document tools * added link to tools.md
-
- May 13, 2021
-
-
dependabot-preview[bot] authored
* Bump apache-beam[gcp] from 2.28.0 to 2.29.0 Bumps [apache-beam[gcp]](https://github.com/apache/beam) from 2.28.0 to 2.29.0. - [Release notes](https://github.com/apache/beam/releases) - [Changelog](https://github.com/apache/beam/blob/master/CHANGES.md) - [Commits](https://github.com/apache/beam/compare/v2.28.0...v2.29.0 ) Signed-off-by:
dependabot-preview[bot] <support@dependabot.com> * replaced mock with unittest.mock Co-authored-by:
dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by:
Daniel Ecer <de-code@users.noreply.github.com>
-
Daniel Ecer authored
* removed usage of six * removed six dependency
-
dependabot-preview[bot] authored
* Bump pylint from 2.4.4 to 2.8.2 Bumps [pylint](https://github.com/PyCQA/pylint) from 2.4.4 to 2.8.2. - [Release notes](https://github.com/PyCQA/pylint/releases) - [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog) - [Commits](https://github.com/PyCQA/pylint/compare/pylint-2.4.4...v2.8.2 ) Signed-off-by:
dependabot-preview[bot] <support@dependabot.com> * ignore lxml elemebt builder not callable * addressed most linting issues * ignore false positive unsubscriptable-object Co-authored-by:
dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by:
Daniel Ecer <de-code@users.noreply.github.com>
-
dependabot-preview[bot] authored
* Bump flake8 from 3.7.9 to 3.9.2 Bumps [flake8](https://gitlab.com/pycqa/flake8) from 3.7.9 to 3.9.2. - [Release notes](https://gitlab.com/pycqa/flake8/tags) - [Commits](https://gitlab.com/pycqa/flake8/compare/3.7.9...3.9.2 ) Signed-off-by:
dependabot-preview[bot] <support@dependabot.com> * addressed linting Co-authored-by:
dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by:
Daniel Ecer <de-code@users.noreply.github.com>
-
- Dec 07, 2020
-
-
Daniel Ecer authored
* added lxml end-to-end test * configured test logging * prefer /usr/bin/timeout * added debug logging * added pdf end to end preprocessing test * fixed serialization error * added debug logging * check pipeline results * test png output * fixed element type check * install poppler-utils into docker image * add --assume-yes
-
- Jan 17, 2020
-
-
dependabot-preview[bot] authored
* Bump pylint from 2.3.1 to 2.4.4 Bumps [pylint](https://github.com/PyCQA/pylint) from 2.3.1 to 2.4.4. - [Release notes](https://github.com/PyCQA/pylint/releases) - [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog) - [Commits](https://github.com/PyCQA/pylint/compare/pylint-2.3.1...pylint-2.4.4 ) Signed-off-by:
dependabot-preview[bot] <support@dependabot.com> * removed explicit astroid dependency * linting * more flake8 linting * explicit pylint dependency Co-authored-by:
Daniel Ecer <de-code@users.noreply.github.com>
-
- Sep 09, 2019
-
-
Daniel Ecer authored
* ugraded to python 3 * upgrade pylint and pytest * replaced StandardError * exclude useless-object-inheritance * python3 compatibilities uncovered by linting * fixed tests * fixed more python3 test incompatibilities
-
Daniel Ecer authored
-
- Jun 05, 2019
-
-
Daniel Ecer authored
* added dev-venv target * added subextract model * added nltk dependency * flake8 ignore line break before binary operator * moved dev dependencies up * added nltk punkt download * added nltk download to dev-venv; pytest and pytest-not-slow target * added subextract training pipeline * added optional xpath namespaces * log failed xml file * use recover parser option * added subextract app * start subextract server * renamed to autocut * declare slow and very_slow pytest markers * make autocut main test as slow * fixed post data * updated README * also build non-dev image as part of ci * added pytest.ini to dev image
-
- Jun 03, 2019
-
-
Daniel Ecer authored
-