README.md · develop · Sciencebeam / Sciencebeam Gym

Daniel Ecer authored May 18, 2021

* initial create vocabulary utility

* extract vocabulary from embeddings

* renamed to --output-word-count-file

* added main call

* extracted iter_tokenized_tokens

* avoid empty tokens

* using tokenizer from delft

* optionally sort by count

* added file list support

* added support for remote files

* added limit argument

* added fsspec dependency

* optionally use multi threading or processing

* included full github link

* renamed to create_vocabulary

* moved to tools vocabulary

* filter embeddings

* renamed to embeddings

* using fsspec to open embeddings file when extracting

* use fsspec when filtering embeddings

* document tools

* added link to tools.md

e3ec9802