Skip to content
  • Daniel Ecer's avatar
    create vocabulary (#349) · e3ec9802
    Daniel Ecer authored
    * initial create vocabulary utility
    
    * extract vocabulary from embeddings
    
    * renamed to --output-word-count-file
    
    * added main call
    
    * extracted iter_tokenized_tokens
    
    * avoid empty tokens
    
    * using tokenizer from delft
    
    * optionally sort by count
    
    * added file list support
    
    * added support for remote files
    
    * added limit argument
    
    * added fsspec dependency
    
    * optionally use multi threading or processing
    
    * included full github link
    
    * renamed to create_vocabulary
    
    * moved to tools vocabulary
    
    * filter embeddings
    
    * renamed to embeddings
    
    * using fsspec to open embeddings file when extracting
    
    * use fsspec when filtering embeddings
    
    * document tools
    
    * added link to tools.md
    e3ec9802