Commit Graph

1043 Commits

Author SHA1 Message Date
Joshua Lochner 5216fb461d
Fix `ByteLevel` pretokenizer
* Re-enable other whisper tests

* Fix `ByteLevel` pretokenizer

Only add prefix space to first word, when option is enabled.
2023-09-10 00:37:04 +02:00
Joshua Lochner ad7e8758bc [version] Update to 2.6.0 2023-09-08 15:41:59 +02:00
Joshua Lochner 9a3339239e
New models and refactoring (#276)
* Add `CodeLlamaTokenizer`

* Add `codellama` for testing

* Update default quantization settings

* Refactor `PretrainedModel`

* Remove unnecessary error message

* Update llama-code-tokenizer test

* Add support for `GPTNeoX` models

* Fix `GPTNeoXPreTrainedModel` config

* Add support for `GPTJ` models

* Add support for `WavLM` models

* Update list of supported models

- CodeLlama
- GPT NeoX
- GPT-J
- WavLM

* Add support for XLM models

* Add support for `ResNet` models

* Add support for `BeiT` models

* Fix casing of `BeitModel`

* Remove duplicate code

* Update variable name

* Remove `ts-ignore`

* Remove unnecessary duplication

* Update demo model sizes

* [demo] Update default summarization parameters

* Update default quantization parameters for new models

* Remove duplication in mapping

* Update list of supported marian models

* Add support for `CamemBERT` models

* Add support for `MBart` models

* Add support for `OPT` models

* Add `MBartTokenizer` and `MBart50Tokenizer`

* Add example of multilingual translation with MBart models

* Add `CamembertTokenizer`

* Add support for `HerBERT` models

* Add support for `XLMTokenizer`

* Fix `fuse_unk` config

* Do not remove duplicate keys for `Unigram` models

See https://huggingface.co/camembert-base for an example of a Unigram tokenizer that has two tokens with the same value (`<unk>`)

* Update HerBERT supported model text

* Update generate_tests.py

* Update list of supported models

* Use enum object instead of classes for model types

Fixes https://github.com/xenova/transformers.js/issues/283

* Add link to issue

* Update dependencies for unit tests

* Add `sentencepiece` as a testing requirement

* Add `protobuf` to test dependency

* Remove duplicated models to test
2023-09-08 15:17:05 +02:00
Joshua Lochner 109a7f9711 Fix unit test 2023-09-04 23:53:05 +02:00
Joshua Lochner dbea8a2990 Update to `checkout@v4`
See https://github.com/actions/checkout/issues/1448 for more info.
2023-09-04 23:20:57 +02:00
Hermann Rolfes 1488079f81
Make // @ts-ignore obsolete for _call overrides by respecting LSP (#278)
* Make // @ts-ignore obsolete for _call overrides by respecting LSP

* oops can't be undefined, back to how it was

* Use `...unused` instead to fix LSP errors
2023-09-04 23:06:44 +02:00
Joshua Lochner 57f2b5cd17
Add support for MPT models (Fixes #166) (#272)
* Add support for MPT models

* Fix `use_cache_branch`

* Update list of supported models
2023-09-02 22:17:01 +02:00
Joshua Lochner 96b9143b33 Update masked-lm tests 2023-09-02 03:47:06 +02:00
Joshua Lochner 9077c21540
Add support for BLOOM models (#273)
* Add support for Bloom models

* Update `BloomTokenizer` to fix the default (invalid) regex

* Update supported models

* Update default quantization settings for bloom models

* Fix `use_cache_branch`
2023-09-01 22:07:04 +02:00
Joshua Lochner 62159eb383 Fix `CustomWhisperOnnxConfig` 2023-09-01 16:14:49 +02:00
Joshua Lochner 0c2dcc7498 [version] Update to 2.5.4 2023-08-28 20:07:06 +02:00
Joshua Lochner 09cf91abd0
Add `DeiT`, `Swin`, and `Yolos` vision models (#262)
* Add support `DeiT` models

* Add `Swin` models for image classification

* Add support for `yolos` models

* Add `YolosFeatureExtractor`

* Remove unused import

* Update list of supported models

* Remove SAM for now

Move SAM support to next release
2023-08-28 17:29:15 +02:00
Joshua Lochner f0573175fd Add `DeiTFeatureExtractor` 2023-08-26 23:54:27 +02:00
Per Harald Borgen 76b8556110
Rename how-to guides to developer guides (#261) 2023-08-25 17:56:18 +02:00
Joshua Lochner 7076c8e401 [version] Update to 2.5.3 2023-08-22 23:31:00 +02:00
josephrocca 9bb6923242
[docs] Add links and compatible models to supported tasks table (#257) 2023-08-22 23:19:48 +02:00
Joshua Lochner 3fab8265cb
Update whisper unit test (#258) 2023-08-22 22:18:17 +02:00
Joshua Lochner 9c449c151c
Fix caching for LFS files from the Hugging Face Hub (#251)
* Fix model caching for LFS files from the HF Hub

* Ignore local model check on demo site
2023-08-22 18:28:37 +02:00
Joshua Lochner f61cc66e0e Fix link to API reference 2023-08-22 17:19:49 +02:00
Joshua Lochner c3af596443
Fix word-level timestamps for non-English languages w/ Whisper (#253)
* Fix language detection

* Remove debug statement

* Fix punctuation regex for whisper decoding (Closes #223)

* Fix word-level timestamps for audio < 30 seconds

Issue in python library: https://github.com/huggingface/transformers/issues/25605
PR for above: https://github.com/huggingface/transformers/pull/25607

* Add multilingual transcription w/ word-level timestamps unit test

* Fix unit tests
2023-08-22 15:50:30 +02:00
Joshua Lochner 276bdd06b8
Improve pipeline docs (w/ example code) - closes #134 (#255)
* Add example code for zero shot image classification

* Add example code for text classification pipeline

* Fix links to custom usage from pipelines docs

Reported on discord https://discord.com/channels/879548962464493619/1142943169068154950/1142943169068154950

* Fix relative links

* Rename .mdx -> .md

GitHub recently changed how mdx files are displayed, breaking a lot of the formatting. So, we just use .md now (same as transformers)

* Add example code for token classification pipeline

* Add example code for fill-mask pipeline

* Add text2text and summarization pipeline examples

* Add example code for image segmentation pipeline

* Remove redundant `@extends Pipeline`

* Add example code for image-to-text pipeline

* Cleanup example code outputs

* Cleanup JSDoc

* Cleanup pipeline example code

* Update codegen example
2023-08-22 04:30:56 +02:00
Joshua Lochner 254e99ef9a [version] Update to 2.5.2 2023-08-14 22:55:54 +02:00
Joshua Lochner d479953a62
[WIP] Add MMS and Wav2Vec2 models (Closes #209) (#220)
* Add example `wav2vec2` models

* Add support for `CTCDecoder` and `Wav2Vec2CTCTokenizer`

* Generate tokenizer.json files for wav2vec2 models

* Fix wav2vec2 custom tokenizer generation

* Implement wav2vec2 audio-speech-recognition

* Add `Wav2Vec2` as a supported architecture

* Update README.md

* Update generate_tests.py

* Ignore invalid tests

* Update supported wav2vec2 models

* Update supported_models.py

* Simplify pipeline construction

* Implement basic audio classification pipeline

* Update default topk value for audio classification pipeline

* Add example usage for the audio classification pipeline

* Move `loadAudio` to utils file

* Add audio classification unit test

* Add wav2vec2 ASR unit test

* Improve generated wav2vec2 tokenizer json

* Update supported_models.py

* Allow `added_tokens_regex` to be null

* Support exporting mms vocabs

* Supported nested vocabularies

* Update supported tasks and models

* Add warnings to ignore language and task for wav2vec2 models

Will add in future

* Mark internal methods as private

* Add typing to audio variable

* Update node-audio-processing.mdx

* Move node-audio-processing to guides

* Update table of contents

* Add example code for performing feature extraction w/ `Wav2Vec2Model`

NOTE: feature extraction of MMS models is currently broken in the python library, but it works correctly here. See
https://github.com/huggingface/transformers/issues/25485 for more info

* Refactor `Pipeline` class params

* Fix `pipeline` function

* Fix typo in `pipeline` JSDoc

* Fix second typo
2023-08-14 22:18:44 +02:00
Joshua Lochner 060ac830fc
Add M2M100 tokenizer (Closes #235) (#250)
* Add `M2M100Tokenizer`

* Allow `added_tokens` list to be empty

* Apply hot-fix for issue in HF's `M2M100Tokenizer`

* Skip M2M100 tokenizer tests for now

TODO: Remove when https://github.com/huggingface/transformers/pull/25478 is merged

* Fix `_build_translation_inputs` for `M2M100Tokenizer`

* Add example code in JSDoc for `TranslationPipeline`

* Update supported_models.py
2023-08-14 17:22:20 +02:00
Joshua Lochner cc4b857d54
Add problem type (Fixes #248) (#249)
* Add support for `problem_type` in text classification

* Add unit test for `multi_label_classification` problem type

* Update supported_models.py
2023-08-14 16:35:13 +02:00
Joshua Lochner d7a734342c Update tokenizer example documentation (Closes #245) 2023-08-13 23:24:50 +02:00
Joshua Lochner 2f70a5d37c
Fix typo in supported-tasks snippet 2023-08-11 01:45:39 +02:00
Celso Dias cfdfe9c6f1
word corrected in readme (#247) 2023-08-11 01:41:20 +02:00
Joshua Lochner b420a8841e [version] Update to 2.5.1 2023-08-09 22:25:53 +02:00
Joshua Lochner 46dd49064f
[Llama + LLama2] Add model support (#232)
* Add support for llama models

* Fix JSDoc
2023-08-09 13:35:28 +02:00
Joshua Lochner 1e157ba2d8
Add support for Deberta models (#244)
* add documentation for zero shot classification

* add multi_label example

* review comments

* edit examples data

* Add deberta and deberta-v2 model definitions

* Update model mapping

* Implement missing `Strip` normalizer

* Add deberta and deberta-v2 tokenizers

* Add fast path to `Strip` normalizer

* Add token types to deberta tokenizer output

* Update supported_models.py

* Fix default Precompiled normalization

* Update supported models list

* Update JSDoc

* Support `not_entailment` label

* Update mult-label example JSDoc

---------

Co-authored-by: Aschen <amaret93@gmail.com>
2023-08-09 11:58:16 +02:00
Joshua Lochner db7d0f0f83
Tokenization improvements (#234)
* Create basic tokenizer playground app

* Default to no display when user adding large body of text

* Optimize BPE algorithm

- Use map instead of object for `bpe_ranks`
- Replace reduction in BPE algorithm with for loop
- Avoid conversions between sets and arrays

* Use for loop to avoid stack issues with `.push(...items)`

* Fix `mergeArrays` typing

* Remove unnecessary try-catch block in BPE

* Add Llama, T5, and BERT tokenizers to the playground

* Improve how BERT/T5 tokens are displayed

* Improve how token margins are displayed

* Use `Map` for cache

* Add efficient heap-based priority queue implementation

* Add more unit tests for LlamaTokenizer

Selected from https://github.com/belladoreai/llama-tokenizer-js/blob/master/llama-tokenizer.js#L381-L452

* Implement priority-queue-based BPE algorithm

* Remove old code

* Update `bpe` docstring

* Add `data-structures` page to docs

* Update JSDoc for data-structures.js

* Update data-structures.js

* Move `TokenLattice` and `CharTrie` to data-structures module

* Minor refactoring
2023-08-08 12:11:35 +02:00
Joshua Lochner ebc9722305 Update supported_models.py 2023-08-01 22:27:40 +02:00
Joshua Lochner d2a0aa9133 Add link to semantic image search application 2023-08-01 18:56:51 +02:00
Joshua Lochner a9a955c76f Update .env.local.example 2023-08-01 18:55:46 +02:00
Joshua Lochner 99db37864d Update semantic image search example README 2023-08-01 18:55:41 +02:00
Joshua Lochner b1537e28dc Create package-lock.json 2023-08-01 15:30:52 +02:00
Joshua Lochner 9aa1a29dac [version] Update to 2.5.0 2023-08-01 14:24:56 +02:00
Joshua Lochner f867226c7e
Improve browser extension sample/template (#196)
* Update extension to be module

* Update example extension

* Allow user to specify a custom cache system

* Implement custom cache system

Emulates the Web Cache API using chrome's local storage API

* Use custom cache system in extension

* Fix serialization

* Remove old folders

* Update extension readme

* Add note about JSON requirement for local storage
2023-08-01 14:23:21 +02:00
Joshua Lochner 2fde656791
Add support for computing CLIP image and text embeddings separately (Closes #148) (#227)
* Define custom CLIP ONNX configs

* Update conversion script

* Support specifying custom model file name

* Use int64 for CLIP input ids

* Add support for CLIP text and vision models

* Fix JSDoc

* Add docs for `CLIPTextModelWithProjection`

* Add docs for `CLIPVisionModelWithProjection`

* Add unit test for CLIP text models

* Add unit test for CLIP vision models

* Set resize precision to 3 decimal places

* Fix `RawImage.save()` function

* Throw error when reading image and status != 200

* Create basic semantic image search application

* Separate out components

* Add `update-database` script

* Update transformers.js version
2023-08-01 14:01:04 +02:00
Joshua Lochner 27920d8483 [version] Update to 2.4.4 2023-07-28 13:28:37 +02:00
Joshua Lochner 2015c685c7
Add Starcoder model support + demo (#225)
* Add support for `gpt_bigcode` models

* Create basic code-completion sample application

* Update sidebar

* Remove debug statement

* Disable 1B model (for now)

* Display progress bars

* Reuse config if not specified

* Update supported_models.py

* Update comment

* Add temperature/sample/topk generation params

* Update sidebar

* Add `gpt_bigcode` to supported models list

* Add code playground example

* Update title

* Cleanup

* Ignore `bigcode/starcoderbase-1b` from tests

* Update transformers.js version for demo
2023-07-28 13:24:32 +02:00
Joshua Lochner da67f41434 [version] Update to 2.4.3 2023-07-27 06:06:42 +02:00
Joshua Lochner 961c0cf860 Add MPNet to README 2023-07-27 06:01:50 +02:00
Joshua Lochner f163f1a318
Add support for `mpnet` models (#221) 2023-07-27 05:59:23 +02:00
Joshua Lochner 09ff83b90e
Create example next.js application (Closes #210) (#211)
* Create example next app

* Link to example app

* Update next configs

* Create tutorial for next.js application

* Update next.js tutorial

* Rename project `next` -> `next-client`

* Clone `next-server` from `next-client`

* Update next.config.js for server-side inference

* Create basic server-side next.js application

* Update example links

* Update subheading for client-side next.js app

* Update next.config.js files

* Create example Dockerfile

* Update next tutorial to include server-side inference

* Improve wording

* Update Dockerfile

* Add step to create a Dockerfile

* Update examples snippet

* Fix wording
2023-07-26 01:48:13 +02:00
Joshua Lochner f181e135d4 [version] Update to 2.4.2 2023-07-22 05:08:12 +02:00
Joshua Lochner 1165f04a9f
Fix BPE tokenization for weird whitespace characters (Closes #199) (#208)
* Add new tokenizer unit test (#199)

* Perform `NFKC` normalization for sentencepiece models w/ precompiled charmap

* Fix JSDoc indentation

* Add problematic string to unit tests

* Use consistent BPE split token

* Add second problematic string
2023-07-22 04:51:11 +02:00
Joshua Lochner 86e68bf9c0
Add support for private/gated model access (Closes #198) (#202)
* Allow user to specify HF token as an environment variable

* Add documentation for how to make authorized requests

* Improve docs
2023-07-21 17:31:37 +02:00
Joshua Lochner 00c0e2935e Fix documentation (Closes #201) 2023-07-21 16:57:46 +02:00