Skip to content
Snippets Groups Projects
  1. Oct 13, 2021
  2. Oct 11, 2021
  3. Sep 20, 2021
  4. Sep 15, 2021
  5. Sep 14, 2021
    • Daniel Ecer's avatar
      added bounding box pipeline (#403) · 603edca9
      Daniel Ecer authored
      * process pdf and xml file lists
      
      * allow sub directories in output path
      
      * make output_annotated_images_path relative to output
      
      * make sure that output directory is created
      
      * use write_bytes in favour of explicit makedirs (cloud ready)
      
      * make --pdf-base-path required
      
      * added pipeline
      
      * not extending ABC due to serialization errors
      
      * log pipeline options
      
      * added test to check serialization
      
      * changed super import to avoid one of the serialization errors in Dataflow
      
      * moved most functionality to separate module
      
      * added libgl1 to setup.py
      
      * added PreventFusion
      
      * minor import grouping
      
      * added TransformAndCount
      
      * reverted super __init__ call
      
      * use parse args, not ignoring unknown args
      
      * expose all of the worker arguments
      603edca9
  6. Sep 09, 2021
  7. Sep 08, 2021
    • Daniel Ecer's avatar
      improve bounding box accuracy, use second pass (#401) · 022a6549
      Daniel Ecer authored
      * specify cache key for object keypoints
      
      * made image id required for lower level functions
      
      * added logging to find bounding boxes
      
      * 2nd it to find bounding box without bounding box
      
      * use fixed size when calculating image similarity
      022a6549
  8. Sep 07, 2021
    • Daniel Ecer's avatar
      check bounding box; fixed image cache key (#400) · f69743c2
      Daniel Ecer authored
      * calculate structural similarity using skimage
      
      * fixed crop_image_to_bounding_box
      
      * output score in JSON
      
      * optionally output images with bounding boxes
      
      * display bbox label inside if not enough space above
      
      * sort by score, then key points
      
      * fixed cache issue by using explicit cache key prefix
      
      (otherwise ids may have been reused after memory being freed)
      f69743c2
  9. Sep 03, 2021
    • Daniel Ecer's avatar
      raise error when figure bounding box could not be found (#399) · 46e14b3c
      Daniel Ecer authored
      * raise GraphicImageNotFoundError
      
      * allow skipping errors
      46e14b3c
    • Daniel Ecer's avatar
      figure image bounding box annotation for single document (#389) · 0bf9e780
      Daniel Ecer authored
      * enable debug logging for tests
      
      * added cli scaffolding
      
      * extract images from pdf
      
      * fixed type hint
      
      * added bounding box to_list
      
      * converted bounding box to named tuple
      
      * added tests for validate
      
      * implemented bounding box intersection
      
      * implemented finding bounding boxes of single image
      
      * added test for smaller partial image
      
      * added libgl1 for open cv
      
      * linting: use with statement for Popen
      
      * added support for multiple image files
      
      * added support for xml files
      
      * join graphic href with xml dirname
      
      * renamed cv2 to cv
      
      * using ObjectDetectorMatcher
      
      * moved funtions to image object matching module
      
      * added TestGetObjectMatch
      
      * added ImageObjectMarchResult
      
      * added test_should_match_smaller_image
      
      * added test_should_match_smaller_rotated_90_image
      
      * fixed typo ImageObjectMatchResult
      
      * moved object_detector_matcher parameter down
      
      * added get_image_list_object_match
      
      * added su...
      0bf9e780
  10. Aug 26, 2021
  11. May 18, 2021
    • Daniel Ecer's avatar
      create vocabulary (#349) · e3ec9802
      Daniel Ecer authored
      * initial create vocabulary utility
      
      * extract vocabulary from embeddings
      
      * renamed to --output-word-count-file
      
      * added main call
      
      * extracted iter_tokenized_tokens
      
      * avoid empty tokens
      
      * using tokenizer from delft
      
      * optionally sort by count
      
      * added file list support
      
      * added support for remote files
      
      * added limit argument
      
      * added fsspec dependency
      
      * optionally use multi threading or processing
      
      * included full github link
      
      * renamed to create_vocabulary
      
      * moved to tools vocabulary
      
      * filter embeddings
      
      * renamed to embeddings
      
      * using fsspec to open embeddings file when extracting
      
      * use fsspec when filtering embeddings
      
      * document tools
      
      * added link to tools.md
      e3ec9802
  12. May 13, 2021
  13. Dec 07, 2020
    • Daniel Ecer's avatar
      fix serialisation issue (#280) · 3d7ba502
      Daniel Ecer authored
      * added lxml end-to-end test
      
      * configured test logging
      
      * prefer /usr/bin/timeout
      
      * added debug logging
      
      * added pdf end to end preprocessing test
      
      * fixed serialization error
      
      * added debug logging
      
      * check pipeline results
      
      * test png output
      
      * fixed element type check
      
      * install poppler-utils into docker image
      
      * add --assume-yes
      3d7ba502
  14. Jan 17, 2020
  15. Sep 09, 2019
    • Daniel Ecer's avatar
      switched to python3 (#145) · b3473e4c
      Daniel Ecer authored
      * ugraded to python 3
      
      * upgrade pylint and pytest
      
      * replaced StandardError
      
      * exclude useless-object-inheritance
      
      * python3 compatibilities uncovered by linting
      
      * fixed tests
      
      * fixed more python3 test incompatibilities
      b3473e4c
    • Daniel Ecer's avatar
      normaize apos (#144) · d564a66c
      Daniel Ecer authored
      v0.0.1
      d564a66c
  16. Jun 05, 2019
    • Daniel Ecer's avatar
      added autocut model (#106) · 85754f2d
      Daniel Ecer authored
      * added dev-venv target
      
      * added subextract model
      
      * added nltk dependency
      
      * flake8 ignore line break before binary operator
      
      * moved dev dependencies up
      
      * added nltk punkt download
      
      * added nltk download to dev-venv; pytest and pytest-not-slow target
      
      * added subextract training pipeline
      
      * added optional xpath namespaces
      
      * log failed xml file
      
      * use recover parser option
      
      * added subextract app
      
      * start subextract server
      
      * renamed to autocut
      
      * declare slow and very_slow pytest markers
      
      * make autocut main test as slow
      
      * fixed post data
      
      * updated README
      
      * also build non-dev image as part of ci
      
      * added pytest.ini to dev image
      85754f2d
  17. Jun 03, 2019