Skip to content
Snippets Groups Projects
  1. Jul 19, 2021
  2. Jul 14, 2021
  3. Jul 09, 2021
  4. Jul 07, 2021
  5. Jun 28, 2021
  6. Jun 23, 2021
  7. Jun 14, 2021
  8. Jun 09, 2021
  9. Jun 01, 2021
  10. May 25, 2021
  11. May 24, 2021
  12. May 18, 2021
    • Daniel Ecer's avatar
      create vocabulary (#349) · e3ec9802
      Daniel Ecer authored
      * initial create vocabulary utility
      
      * extract vocabulary from embeddings
      
      * renamed to --output-word-count-file
      
      * added main call
      
      * extracted iter_tokenized_tokens
      
      * avoid empty tokens
      
      * using tokenizer from delft
      
      * optionally sort by count
      
      * added file list support
      
      * added support for remote files
      
      * added limit argument
      
      * added fsspec dependency
      
      * optionally use multi threading or processing
      
      * included full github link
      
      * renamed to create_vocabulary
      
      * moved to tools vocabulary
      
      * filter embeddings
      
      * renamed to embeddings
      
      * using fsspec to open embeddings file when extracting
      
      * use fsspec when filtering embeddings
      
      * document tools
      
      * added link to tools.md
  13. May 13, 2021
  14. May 12, 2021
  15. May 06, 2021
  16. May 05, 2021
  17. May 03, 2021
  18. Apr 26, 2021
  19. Apr 22, 2021
  20. Apr 14, 2021
  21. Apr 08, 2021
  22. Apr 06, 2021
  23. Apr 05, 2021
  24. Mar 22, 2021
  25. Mar 11, 2021
  26. Mar 08, 2021
  27. Feb 26, 2021
  28. Feb 23, 2021
  29. Feb 22, 2021