* Add `M2M100Tokenizer`
* Allow `added_tokens` list to be empty
* Apply hot-fix for issue in HF's `M2M100Tokenizer`
* Skip M2M100 tokenizer tests for now
TODO: Remove when https://github.com/huggingface/transformers/pull/25478 is merged
* Fix `_build_translation_inputs` for `M2M100Tokenizer`
* Add example code in JSDoc for `TranslationPipeline`
* Update supported_models.py
* Create basic tokenizer playground app
* Default to no display when user adding large body of text
* Optimize BPE algorithm
- Use map instead of object for `bpe_ranks`
- Replace reduction in BPE algorithm with for loop
- Avoid conversions between sets and arrays
* Use for loop to avoid stack issues with `.push(...items)`
* Fix `mergeArrays` typing
* Remove unnecessary try-catch block in BPE
* Add Llama, T5, and BERT tokenizers to the playground
* Improve how BERT/T5 tokens are displayed
* Improve how token margins are displayed
* Use `Map` for cache
* Add efficient heap-based priority queue implementation
* Add more unit tests for LlamaTokenizer
Selected from https://github.com/belladoreai/llama-tokenizer-js/blob/master/llama-tokenizer.js#L381-L452
* Implement priority-queue-based BPE algorithm
* Remove old code
* Update `bpe` docstring
* Add `data-structures` page to docs
* Update JSDoc for data-structures.js
* Update data-structures.js
* Move `TokenLattice` and `CharTrie` to data-structures module
* Minor refactoring
* Update extension to be module
* Update example extension
* Allow user to specify a custom cache system
* Implement custom cache system
Emulates the Web Cache API using chrome's local storage API
* Use custom cache system in extension
* Fix serialization
* Remove old folders
* Update extension readme
* Add note about JSON requirement for local storage
* Define custom CLIP ONNX configs
* Update conversion script
* Support specifying custom model file name
* Use int64 for CLIP input ids
* Add support for CLIP text and vision models
* Fix JSDoc
* Add docs for `CLIPTextModelWithProjection`
* Add docs for `CLIPVisionModelWithProjection`
* Add unit test for CLIP text models
* Add unit test for CLIP vision models
* Set resize precision to 3 decimal places
* Fix `RawImage.save()` function
* Throw error when reading image and status != 200
* Create basic semantic image search application
* Separate out components
* Add `update-database` script
* Update transformers.js version
* Add new tokenizer unit test (#199)
* Perform `NFKC` normalization for sentencepiece models w/ precompiled charmap
* Fix JSDoc indentation
* Add problematic string to unit tests
* Use consistent BPE split token
* Add second problematic string
* Support outputting attentions in generate function
* Add unit tests for concatenating tensors
* Implement `cat` for `dim>0`
* Add `cat` unit tests for > 2 tensors
* Allow for negative indexing + bounds checking
* Add test case for `cat` with negative indexing
* Clean up `safeIndex` helper function
* Allow indexing error message to include dimension
* Reuse `safeIndex` helper function for `normalize_`
* Optimize `cat` indexing
* Implement `stack` tensor operation
+ add unit tests
* Add TODOs
* Implement `mean` tensor operation
* Implement `std_mean` tensor ops
* Fix order of `std_mean` returns
* Implement median filter
* Implement dynamic time warping
* Implement `neg` tensor op
* Throw error if audio sent to processor is not a `Float32Array`
* Add `round` helper function
* [WIP] Implement basic version of word-level-timestamps
Known issues:
- timestamps not correct for index > 0
- punctuation not same as python version
* Fix typo
* Fix timestamps
* Round to 2 decimals
* Fix punctuation
* Fix typing
* Remove debug statements
* Cleanup code
* Cleanup
* Remove debug statements
* Update JSDoc for extract token timestamps function
* Add return type for `std_mean` tensor function
* Improve typing of private whisper tokenizer functions
* Indicate method is private
* Allow whisper feature extractor to be called with Float64Array input
* Fix typo
* Throw error if `cross_attentions` are not present in model output when extracting token timestamps
* Throw error during generate function
* Allow whisper models to be exported with `output_attentions=True`
* Add alignment heads to generation config
* Remove print statement
* Update versions
* Override protobufjs version
* Update package-lock.json
* Require onnx==1.13.1 for conversion
Will update once onnxruntime-web supports onnx IR version 9
* Add unit test for word-level timestamps
* Extract add attentions function out of `generate`
* Fix `findLongestCommonSequence` return types
* Downgrade back to onnxruntime 1.14.0
1.15.1 is a little to unstable right now.
* Cleanup
- use `.map`
- rename variables
* Update comments
* Add examples for how to transcribe w/ word-level timestamps
* Add example for transcribing/translating audio longer than 30 seconds
* Make example more compact
* Add example code for running text-generation models
* Fix non-greedy sampling functions
* Update samplers
* Remove duplicate requirement
`onnxruntime` is specified in `optimum[onnxruntime]`
* Align `generate` function output with python library
Include starting tokens in output
* [docs] Add example text-generation code
* Update demo site text streaming for causal language models
* Override default code highlighting for operators
* Fix order of link
* link to the conversion Space for maximum simplicity
* add some types to script (very optional)
* typo
* no need for trailing slash here
* Node is also a valid option
* Document how to find a compatible checkpoint on the hub
* Update README
* Fix typing
* Update docs index
---------
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Recursively replace tensors with custom class
* Add mobile vit models
* Add example code for `ImageClassificationPipeline`
* Fix example urls
* Add MobileViT models and processors
* Update optimum requirement in conversion script
Previous name is deprecated
* Update supported models
* Update supported_models.py
* Update supported_models.py
* Update tokenizer test generator script
* Add special test case for falcon tokenizers
* Update tokenizer test script
* Add support for `FalconTokenizer`
* Update `BertPreTokenizer` call parameter types
* Add `GPTNeoXTokenizer` tokenizer (mpt)
* Use transformers from source when testing
* Reuse `prepare_model_inputs` function type
Better than using `@see {@link ... }` since it works with intellisense.
* Allow user to set `per_channel` and `reduce_range` quantization parameters (#156)
Also save quantization options
* Get operators of graph and subgraphs
* Only run encoder with required inputs
* Add basic whisper unit tests
* Add newline after heading for docs
* Add unit test for transcribing english with timestamps
* Add multilingual test case