transformers.js

History

Joshua Lochner db7d0f0f83 Tokenization improvements (#234 ) * Create basic tokenizer playground app * Default to no display when user adding large body of text * Optimize BPE algorithm - Use map instead of object for `bpe_ranks` - Replace reduction in BPE algorithm with for loop - Avoid conversions between sets and arrays * Use for loop to avoid stack issues with `.push(...items)` * Fix `mergeArrays` typing * Remove unnecessary try-catch block in BPE * Add Llama, T5, and BERT tokenizers to the playground * Improve how BERT/T5 tokens are displayed * Improve how token margins are displayed * Use `Map` for cache * Add efficient heap-based priority queue implementation * Add more unit tests for LlamaTokenizer Selected from https://github.com/belladoreai/llama-tokenizer-js/blob/master/llama-tokenizer.js#L381-L452 * Implement priority-queue-based BPE algorithm * Remove old code * Update `bpe` docstring * Add `data-structures` page to docs * Update JSDoc for data-structures.js * Update data-structures.js * Move `TokenLattice` and `CharTrie` to data-structures module * Minor refactoring	2023-08-08 12:11:35 +02:00
..
vite.svg	Tokenization improvements (#234 )	2023-08-08 12:11:35 +02:00

* Create basic tokenizer playground app

* Default to no display when user adding large body of text

* Optimize BPE algorithm

- Use map instead of object for `bpe_ranks`
- Replace reduction in BPE algorithm with for loop
- Avoid conversions between sets and arrays

* Use for loop to avoid stack issues with `.push(...items)`

* Fix `mergeArrays` typing

* Remove unnecessary try-catch block in BPE

* Add Llama, T5, and BERT tokenizers to the playground

* Improve how BERT/T5 tokens are displayed

* Improve how token margins are displayed

* Use `Map` for cache

* Add efficient heap-based priority queue implementation

* Add more unit tests for LlamaTokenizer

Selected from https://github.com/belladoreai/llama-tokenizer-js/blob/master/llama-tokenizer.js#L381-L452

* Implement priority-queue-based BPE algorithm

* Remove old code

* Update `bpe` docstring

* Add `data-structures` page to docs

* Update JSDoc for data-structures.js

* Update data-structures.js

* Move `TokenLattice` and `CharTrie` to data-structures module

* Minor refactoring

2023-08-08 12:11:35 +02:00

vite.svg

Tokenization improvements (#234 )

2023-08-08 12:11:35 +02:00