-
Daniel Ecer authored
* initial create vocabulary utility * extract vocabulary from embeddings * renamed to --output-word-count-file * added main call * extracted iter_tokenized_tokens * avoid empty tokens * using tokenizer from delft * optionally sort by count * added file list support * added support for remote files * added limit argument * added fsspec dependency * optionally use multi threading or processing * included full github link * renamed to create_vocabulary * moved to tools vocabulary * filter embeddings * renamed to embeddings * using fsspec to open embeddings file when extracting * use fsspec when filtering embeddings * document tools * added link to tools.md
Unverifiede3ec9802
Code owners
Assign users and groups as approvers for specific file changes. Learn more.