Skip to content
Snippets Groups Projects
  1. Oct 13, 2021
  2. Oct 11, 2021
  3. Sep 20, 2021
  4. Sep 15, 2021
  5. Sep 14, 2021
    • Daniel Ecer's avatar
      added bounding box pipeline (#403) · 603edca9
      Daniel Ecer authored
      * process pdf and xml file lists
      
      * allow sub directories in output path
      
      * make output_annotated_images_path relative to output
      
      * make sure that output directory is created
      
      * use write_bytes in favour of explicit makedirs (cloud ready)
      
      * make --pdf-base-path required
      
      * added pipeline
      
      * not extending ABC due to serialization errors
      
      * log pipeline options
      
      * added test to check serialization
      
      * changed super import to avoid one of the serialization errors in Dataflow
      
      * moved most functionality to separate module
      
      * added libgl1 to setup.py
      
      * added PreventFusion
      
      * minor import grouping
      
      * added TransformAndCount
      
      * reverted super __init__ call
      
      * use parse args, not ignoring unknown args
      
      * expose all of the worker arguments
  6. Sep 09, 2021
  7. Sep 08, 2021
  8. Sep 07, 2021
    • Daniel Ecer's avatar
      check bounding box; fixed image cache key (#400) · f69743c2
      Daniel Ecer authored
      * calculate structural similarity using skimage
      
      * fixed crop_image_to_bounding_box
      
      * output score in JSON
      
      * optionally output images with bounding boxes
      
      * display bbox label inside if not enough space above
      
      * sort by score, then key points
      
      * fixed cache issue by using explicit cache key prefix
      
      (otherwise ids may have been reused after memory being freed)
  9. Sep 03, 2021
    • Daniel Ecer's avatar
      raise error when figure bounding box could not be found (#399) · 46e14b3c
      Daniel Ecer authored
      * raise GraphicImageNotFoundError
      
      * allow skipping errors
    • Daniel Ecer's avatar
      figure image bounding box annotation for single document (#389) · 0bf9e780
      Daniel Ecer authored
      * enable debug logging for tests
      
      * added cli scaffolding
      
      * extract images from pdf
      
      * fixed type hint
      
      * added bounding box to_list
      
      * converted bounding box to named tuple
      
      * added tests for validate
      
      * implemented bounding box intersection
      
      * implemented finding bounding boxes of single image
      
      * added test for smaller partial image
      
      * added libgl1 for open cv
      
      * linting: use with statement for Popen
      
      * added support for multiple image files
      
      * added support for xml files
      
      * join graphic href with xml dirname
      
      * renamed cv2 to cv
      
      * using ObjectDetectorMatcher
      
      * moved funtions to image object matching module
      
      * added TestGetObjectMatch
      
      * added ImageObjectMarchResult
      
      * added test_should_match_smaller_image
      
      * added test_should_match_smaller_rotated_90_image
      
      * fixed typo ImageObjectMatchResult
      
      * moved object_detector_matcher parameter down
      
      * added get_image_list_object_match
      
      * added su...
  10. Aug 26, 2021
  11. May 18, 2021
    • Daniel Ecer's avatar
      create vocabulary (#349) · e3ec9802
      Daniel Ecer authored
      * initial create vocabulary utility
      
      * extract vocabulary from embeddings
      
      * renamed to --output-word-count-file
      
      * added main call
      
      * extracted iter_tokenized_tokens
      
      * avoid empty tokens
      
      * using tokenizer from delft
      
      * optionally sort by count
      
      * added file list support
      
      * added support for remote files
      
      * added limit argument
      
      * added fsspec dependency
      
      * optionally use multi threading or processing
      
      * included full github link
      
      * renamed to create_vocabulary
      
      * moved to tools vocabulary
      
      * filter embeddings
      
      * renamed to embeddings
      
      * using fsspec to open embeddings file when extracting
      
      * use fsspec when filtering embeddings
      
      * document tools
      
      * added link to tools.md
  12. May 13, 2021
  13. Dec 07, 2020
    • Daniel Ecer's avatar
      fix serialisation issue (#280) · 3d7ba502
      Daniel Ecer authored
      * added lxml end-to-end test
      
      * configured test logging
      
      * prefer /usr/bin/timeout
      
      * added debug logging
      
      * added pdf end to end preprocessing test
      
      * fixed serialization error
      
      * added debug logging
      
      * check pipeline results
      
      * test png output
      
      * fixed element type check
      
      * install poppler-utils into docker image
      
      * add --assume-yes
  14. Jan 17, 2020
  15. Sep 09, 2019
  16. Aug 27, 2019
  17. Aug 21, 2019
  18. Jun 10, 2019
  19. Jun 05, 2019
    • Daniel Ecer's avatar
      added autocut model (#106) · 85754f2d
      Daniel Ecer authored
      * added dev-venv target
      
      * added subextract model
      
      * added nltk dependency
      
      * flake8 ignore line break before binary operator
      
      * moved dev dependencies up
      
      * added nltk punkt download
      
      * added nltk download to dev-venv; pytest and pytest-not-slow target
      
      * added subextract training pipeline
      
      * added optional xpath namespaces
      
      * log failed xml file
      
      * use recover parser option
      
      * added subextract app
      
      * start subextract server
      
      * renamed to autocut
      
      * declare slow and very_slow pytest markers
      
      * make autocut main test as slow
      
      * fixed post data
      
      * updated README
      
      * also build non-dev image as part of ci
      
      * added pytest.ini to dev image
  20. Jun 03, 2019
  21. May 06, 2019
  22. Jan 31, 2019
  23. Nov 02, 2018
    • Daniel Ecer's avatar
      pylint and flake8 checking (#39) · 91e1c0d0
      Daniel Ecer authored
      * added pylint check
      
      * added pylintrc to docker image
      
      * reduced accessive apache beam debug logging
      
      * configured pylint, addressed linting
      
      * enabled flake8 checks
      
      * downgrade pycodestyle to 2.3.1 due to error
      
      * switch to 4 spaces indent
      
      * autopep8
      
      * more flake8
      
      * added new line to .flake8
  24. Aug 24, 2018
  25. Jul 06, 2018
  26. Mar 20, 2018
  27. Feb 01, 2018
  28. Jan 30, 2018