README.md · 7de581aa08842c92f2e20242ae09d949b3d7a94a · Sciencebeam / Sciencebeam Gym

Daniel Ecer authored 3 years ago

* initial create vocabulary utility

* extract vocabulary from embeddings

* renamed to --output-word-count-file

* added main call

* extracted iter_tokenized_tokens

* avoid empty tokens

* using tokenizer from delft

* optionally sort by count

* added file list support

* added support for remote files

* added limit argument

* added fsspec dependency

* optionally use multi threading or processing

* included full github link

* renamed to create_vocabulary

* moved to tools vocabulary

* filter embeddings

* renamed to embeddings

* using fsspec to open embeddings file when extracting

* use fsspec when filtering embeddings

* document tools

* added link to tools.md

Unverified

e3ec9802

Code owners

Assign users and groups as approvers for specific file changes. Learn more.