Skip to content
Snippets Groups Projects
  1. Sep 15, 2021
  2. Sep 14, 2021
    • Daniel Ecer's avatar
      added bounding box pipeline (#403) · 603edca9
      Daniel Ecer authored
      * process pdf and xml file lists
      
      * allow sub directories in output path
      
      * make output_annotated_images_path relative to output
      
      * make sure that output directory is created
      
      * use write_bytes in favour of explicit makedirs (cloud ready)
      
      * make --pdf-base-path required
      
      * added pipeline
      
      * not extending ABC due to serialization errors
      
      * log pipeline options
      
      * added test to check serialization
      
      * changed super import to avoid one of the serialization errors in Dataflow
      
      * moved most functionality to separate module
      
      * added libgl1 to setup.py
      
      * added PreventFusion
      
      * minor import grouping
      
      * added TransformAndCount
      
      * reverted super __init__ call
      
      * use parse args, not ignoring unknown args
      
      * expose all of the worker arguments
  3. Sep 07, 2021
    • Daniel Ecer's avatar
      check bounding box; fixed image cache key (#400) · f69743c2
      Daniel Ecer authored
      * calculate structural similarity using skimage
      
      * fixed crop_image_to_bounding_box
      
      * output score in JSON
      
      * optionally output images with bounding boxes
      
      * display bbox label inside if not enough space above
      
      * sort by score, then key points
      
      * fixed cache issue by using explicit cache key prefix
      
      (otherwise ids may have been reused after memory being freed)
  4. Sep 03, 2021
    • Daniel Ecer's avatar
      raise error when figure bounding box could not be found (#399) · 46e14b3c
      Daniel Ecer authored
      * raise GraphicImageNotFoundError
      
      * allow skipping errors
    • Daniel Ecer's avatar
      figure image bounding box annotation for single document (#389) · 0bf9e780
      Daniel Ecer authored
      * enable debug logging for tests
      
      * added cli scaffolding
      
      * extract images from pdf
      
      * fixed type hint
      
      * added bounding box to_list
      
      * converted bounding box to named tuple
      
      * added tests for validate
      
      * implemented bounding box intersection
      
      * implemented finding bounding boxes of single image
      
      * added test for smaller partial image
      
      * added libgl1 for open cv
      
      * linting: use with statement for Popen
      
      * added support for multiple image files
      
      * added support for xml files
      
      * join graphic href with xml dirname
      
      * renamed cv2 to cv
      
      * using ObjectDetectorMatcher
      
      * moved funtions to image object matching module
      
      * added TestGetObjectMatch
      
      * added ImageObjectMarchResult
      
      * added test_should_match_smaller_image
      
      * added test_should_match_smaller_rotated_90_image
      
      * fixed typo ImageObjectMatchResult
      
      * moved object_detector_matcher parameter down
      
      * added get_image_list_object_match
      
      * added su...
  5. Aug 26, 2021
    • Daniel Ecer's avatar
      added mypy linting (#388) · e975eede
      Daniel Ecer authored
      * added mypy dependency
      
      * added dev-mypy
      
      * added mypy make target
      
      * declare EMPTY class prop
      
      * removed incorrect tensors type hint
      
      * added type hint to excluded_tokens
      
      * removed unused ProcessedWrapper
      
      * replacing backports.tempfile with builtin
      
      * added T_ArgumentParserOrGroup
      
      * fixed iter_tokenized_tokens return type hint
      
      * added types-requests
      
      * added type to DEFAULT_ANNOTATORS
      
      * removed blank line
      
      * ignore distutils import
      
      * replaced T_Element with etree.ElementBase
      
      * changed type check back to etree._Element
      
      * replaced project_tests.sh
      
      * removed second mypy make target dependency
  6. May 18, 2021
    • Daniel Ecer's avatar
      create vocabulary (#349) · e3ec9802
      Daniel Ecer authored
      * initial create vocabulary utility
      
      * extract vocabulary from embeddings
      
      * renamed to --output-word-count-file
      
      * added main call
      
      * extracted iter_tokenized_tokens
      
      * avoid empty tokens
      
      * using tokenizer from delft
      
      * optionally sort by count
      
      * added file list support
      
      * added support for remote files
      
      * added limit argument
      
      * added fsspec dependency
      
      * optionally use multi threading or processing
      
      * included full github link
      
      * renamed to create_vocabulary
      
      * moved to tools vocabulary
      
      * filter embeddings
      
      * renamed to embeddings
      
      * using fsspec to open embeddings file when extracting
      
      * use fsspec when filtering embeddings
      
      * document tools
      
      * added link to tools.md
  7. Jan 17, 2020
  8. Jun 03, 2019