Commit Graph

1070 Commits

Author SHA1 Message Date
Joshua Lochner 060ac830fc
Add M2M100 tokenizer (Closes #235) (#250)
* Add `M2M100Tokenizer`

* Allow `added_tokens` list to be empty

* Apply hot-fix for issue in HF's `M2M100Tokenizer`

* Skip M2M100 tokenizer tests for now

TODO: Remove when https://github.com/huggingface/transformers/pull/25478 is merged

* Fix `_build_translation_inputs` for `M2M100Tokenizer`

* Add example code in JSDoc for `TranslationPipeline`

* Update supported_models.py
2023-08-14 17:22:20 +02:00
Joshua Lochner cc4b857d54
Add problem type (Fixes #248) (#249)
* Add support for `problem_type` in text classification

* Add unit test for `multi_label_classification` problem type

* Update supported_models.py
2023-08-14 16:35:13 +02:00
Joshua Lochner d7a734342c Update tokenizer example documentation (Closes #245) 2023-08-13 23:24:50 +02:00
Joshua Lochner 2f70a5d37c
Fix typo in supported-tasks snippet 2023-08-11 01:45:39 +02:00
Celso Dias cfdfe9c6f1
word corrected in readme (#247) 2023-08-11 01:41:20 +02:00
Joshua Lochner b420a8841e [version] Update to 2.5.1 2023-08-09 22:25:53 +02:00
Joshua Lochner 46dd49064f
[Llama + LLama2] Add model support (#232)
* Add support for llama models

* Fix JSDoc
2023-08-09 13:35:28 +02:00
Joshua Lochner 1e157ba2d8
Add support for Deberta models (#244)
* add documentation for zero shot classification

* add multi_label example

* review comments

* edit examples data

* Add deberta and deberta-v2 model definitions

* Update model mapping

* Implement missing `Strip` normalizer

* Add deberta and deberta-v2 tokenizers

* Add fast path to `Strip` normalizer

* Add token types to deberta tokenizer output

* Update supported_models.py

* Fix default Precompiled normalization

* Update supported models list

* Update JSDoc

* Support `not_entailment` label

* Update mult-label example JSDoc

---------

Co-authored-by: Aschen <amaret93@gmail.com>
2023-08-09 11:58:16 +02:00
Joshua Lochner db7d0f0f83
Tokenization improvements (#234)
* Create basic tokenizer playground app

* Default to no display when user adding large body of text

* Optimize BPE algorithm

- Use map instead of object for `bpe_ranks`
- Replace reduction in BPE algorithm with for loop
- Avoid conversions between sets and arrays

* Use for loop to avoid stack issues with `.push(...items)`

* Fix `mergeArrays` typing

* Remove unnecessary try-catch block in BPE

* Add Llama, T5, and BERT tokenizers to the playground

* Improve how BERT/T5 tokens are displayed

* Improve how token margins are displayed

* Use `Map` for cache

* Add efficient heap-based priority queue implementation

* Add more unit tests for LlamaTokenizer

Selected from https://github.com/belladoreai/llama-tokenizer-js/blob/master/llama-tokenizer.js#L381-L452

* Implement priority-queue-based BPE algorithm

* Remove old code

* Update `bpe` docstring

* Add `data-structures` page to docs

* Update JSDoc for data-structures.js

* Update data-structures.js

* Move `TokenLattice` and `CharTrie` to data-structures module

* Minor refactoring
2023-08-08 12:11:35 +02:00
Joshua Lochner ebc9722305 Update supported_models.py 2023-08-01 22:27:40 +02:00
Joshua Lochner d2a0aa9133 Add link to semantic image search application 2023-08-01 18:56:51 +02:00
Joshua Lochner a9a955c76f Update .env.local.example 2023-08-01 18:55:46 +02:00
Joshua Lochner 99db37864d Update semantic image search example README 2023-08-01 18:55:41 +02:00
Joshua Lochner b1537e28dc Create package-lock.json 2023-08-01 15:30:52 +02:00
Joshua Lochner 9aa1a29dac [version] Update to 2.5.0 2023-08-01 14:24:56 +02:00
Joshua Lochner f867226c7e
Improve browser extension sample/template (#196)
* Update extension to be module

* Update example extension

* Allow user to specify a custom cache system

* Implement custom cache system

Emulates the Web Cache API using chrome's local storage API

* Use custom cache system in extension

* Fix serialization

* Remove old folders

* Update extension readme

* Add note about JSON requirement for local storage
2023-08-01 14:23:21 +02:00
Joshua Lochner 2fde656791
Add support for computing CLIP image and text embeddings separately (Closes #148) (#227)
* Define custom CLIP ONNX configs

* Update conversion script

* Support specifying custom model file name

* Use int64 for CLIP input ids

* Add support for CLIP text and vision models

* Fix JSDoc

* Add docs for `CLIPTextModelWithProjection`

* Add docs for `CLIPVisionModelWithProjection`

* Add unit test for CLIP text models

* Add unit test for CLIP vision models

* Set resize precision to 3 decimal places

* Fix `RawImage.save()` function

* Throw error when reading image and status != 200

* Create basic semantic image search application

* Separate out components

* Add `update-database` script

* Update transformers.js version
2023-08-01 14:01:04 +02:00
Joshua Lochner 27920d8483 [version] Update to 2.4.4 2023-07-28 13:28:37 +02:00
Joshua Lochner 2015c685c7
Add Starcoder model support + demo (#225)
* Add support for `gpt_bigcode` models

* Create basic code-completion sample application

* Update sidebar

* Remove debug statement

* Disable 1B model (for now)

* Display progress bars

* Reuse config if not specified

* Update supported_models.py

* Update comment

* Add temperature/sample/topk generation params

* Update sidebar

* Add `gpt_bigcode` to supported models list

* Add code playground example

* Update title

* Cleanup

* Ignore `bigcode/starcoderbase-1b` from tests

* Update transformers.js version for demo
2023-07-28 13:24:32 +02:00
Joshua Lochner da67f41434 [version] Update to 2.4.3 2023-07-27 06:06:42 +02:00
Joshua Lochner 961c0cf860 Add MPNet to README 2023-07-27 06:01:50 +02:00
Joshua Lochner f163f1a318
Add support for `mpnet` models (#221) 2023-07-27 05:59:23 +02:00
Joshua Lochner 09ff83b90e
Create example next.js application (Closes #210) (#211)
* Create example next app

* Link to example app

* Update next configs

* Create tutorial for next.js application

* Update next.js tutorial

* Rename project `next` -> `next-client`

* Clone `next-server` from `next-client`

* Update next.config.js for server-side inference

* Create basic server-side next.js application

* Update example links

* Update subheading for client-side next.js app

* Update next.config.js files

* Create example Dockerfile

* Update next tutorial to include server-side inference

* Improve wording

* Update Dockerfile

* Add step to create a Dockerfile

* Update examples snippet

* Fix wording
2023-07-26 01:48:13 +02:00
Joshua Lochner f181e135d4 [version] Update to 2.4.2 2023-07-22 05:08:12 +02:00
Joshua Lochner 1165f04a9f
Fix BPE tokenization for weird whitespace characters (Closes #199) (#208)
* Add new tokenizer unit test (#199)

* Perform `NFKC` normalization for sentencepiece models w/ precompiled charmap

* Fix JSDoc indentation

* Add problematic string to unit tests

* Use consistent BPE split token

* Add second problematic string
2023-07-22 04:51:11 +02:00
Joshua Lochner 86e68bf9c0
Add support for private/gated model access (Closes #198) (#202)
* Allow user to specify HF token as an environment variable

* Add documentation for how to make authorized requests

* Improve docs
2023-07-21 17:31:37 +02:00
Joshua Lochner 00c0e2935e Fix documentation (Closes #201) 2023-07-21 16:57:46 +02:00
Joshua Lochner a298de39f3 Update list of examples
- Add "Doodle Dash"
- Reorder
2023-07-11 16:12:27 +02:00
Joshua Lochner 2e812458e4 Fix object-detection demo 2023-07-11 16:07:01 +02:00
Joshua Lochner 4e947aa657 [version] Update to 2.4.1 2023-07-11 02:12:28 +02:00
Joshua Lochner f112349a28
Object-detection pipeline improvements + better documentation (#189)
* Fix variable name

* Add pipeline loading options section

* Align object detection pipeline output with python library

* Update unit tests

* Update batched object detection unit test

* Relax object detection unit tests
2023-07-11 02:09:03 +02:00
Joshua Lochner 13efa96122
Fix padding and truncation in pipelines (#190) 2023-07-11 02:07:53 +02:00
Joshua Lochner 316d10e6ec [version] Update to 2.4.0 2023-07-10 00:14:44 +02:00
Joshua Lochner 4e21189a0a
Fix loading of grayscale images in node.js (#181)
* Ensure the image loaded by sharp.js has the correct number of channels

* Do not assume default channels
2023-07-09 23:22:08 +02:00
Joshua Lochner 86de50d0f2
Whisper word-level timestamps (#184)
* Support outputting attentions in generate function

* Add unit tests for concatenating tensors

* Implement `cat` for `dim>0`

* Add `cat` unit tests for > 2 tensors

* Allow for negative indexing + bounds checking

* Add test case for `cat` with negative indexing

* Clean up `safeIndex` helper function

* Allow indexing error message to include dimension

* Reuse `safeIndex` helper function for `normalize_`

* Optimize `cat` indexing

* Implement `stack` tensor operation

+ add unit tests

* Add TODOs

* Implement `mean` tensor operation

* Implement `std_mean` tensor ops

* Fix order of `std_mean` returns

* Implement median filter

* Implement dynamic time warping

* Implement `neg` tensor op

* Throw error if audio sent to processor is not a `Float32Array`

* Add `round` helper function

* [WIP] Implement basic version of word-level-timestamps

Known issues:
- timestamps not correct for index > 0
- punctuation not same as python version

* Fix typo

* Fix timestamps

* Round to 2 decimals

* Fix punctuation

* Fix typing

* Remove debug statements

* Cleanup code

* Cleanup

* Remove debug statements

* Update JSDoc for extract token timestamps function

* Add return type for `std_mean` tensor function

* Improve typing of private whisper tokenizer functions

* Indicate method is private

* Allow whisper feature extractor to be called with Float64Array input

* Fix typo

* Throw error if `cross_attentions` are not present in model output when extracting token timestamps

* Throw error during generate function

* Allow whisper models to be exported with `output_attentions=True`

* Add alignment heads to generation config

* Remove print statement

* Update versions

* Override protobufjs version

* Update package-lock.json

* Require onnx==1.13.1 for conversion

Will update once onnxruntime-web supports onnx IR version 9

* Add unit test for word-level timestamps

* Extract add attentions function out of `generate`

* Fix `findLongestCommonSequence` return types

* Downgrade back to onnxruntime 1.14.0

1.15.1 is a little to unstable right now.

* Cleanup

- use `.map`
- rename variables

* Update comments

* Add examples for how to transcribe w/ word-level timestamps

* Add example for transcribing/translating audio longer than 30 seconds

* Make example more compact
2023-07-09 23:21:43 +02:00
Joshua Lochner aceab9bf3d Update supported models list 2023-07-01 04:46:09 +02:00
Joshua Lochner 1563b434bc [version] Update to 2.3.1 2023-07-01 04:15:13 +02:00
Joshua Lochner f2a2aeea44
Add xlm-roberta models (Fixes #177) (#178) 2023-07-01 03:50:57 +02:00
lsb 3c8b15e39e
Update onnx.js (#174)
* Update onnx.js

* Update regex test for iOS 16.4 user agent

---------

Co-authored-by: Joshua Lochner <admin@xenova.com>
2023-07-01 03:28:49 +02:00
Joshua Lochner 1bf7958cfa
Add example code for running text-generation models (#175)
* Add example code for running text-generation models

* Fix non-greedy sampling functions

* Update samplers

* Remove duplicate requirement

`onnxruntime` is specified in `optimum[onnxruntime]`

* Align `generate` function output with python library

Include starting tokens in output

* [docs] Add example text-generation code

* Update demo site text streaming for causal language models

* Override default code highlighting for operators

* Fix order of link
2023-07-01 03:04:00 +02:00
Joshua Lochner 1914c0784d Fix conversion to grayscale 2023-06-29 23:38:52 +02:00
Julien Chaumond 6eb924b7b1
Add `RobertaForTokenClassification` and an example checkpoint on Hub (#170) 2023-06-29 20:15:58 +02:00
Joshua Lochner 27d7ea489b
Improvements to documentation (#172)
* link to the conversion Space for maximum simplicity

* add some types to script (very optional)

* typo

* no need for trailing slash here

* Node is also a valid option

* Document how to find a compatible checkpoint on the hub

* Update README

* Fix typing

* Update docs index

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
2023-06-29 19:32:17 +02:00
Joshua Lochner a5ca113d51
[WIP] New model/tokenizer types (#165)
* Recursively replace tensors with custom class

* Add mobile vit models

* Add example code for `ImageClassificationPipeline`

* Fix example urls

* Add MobileViT models and processors

* Update optimum requirement in conversion script

Previous name is deprecated

* Update supported models

* Update supported_models.py

* Update supported_models.py

* Update tokenizer test generator script

* Add special test case for falcon tokenizers

* Update tokenizer test script

* Add support for `FalconTokenizer`

* Update `BertPreTokenizer` call parameter types

* Add `GPTNeoXTokenizer` tokenizer (mpt)

* Use transformers from source when testing

* Reuse `prepare_model_inputs` function type

Better than using `@see {@link ... }` since it works with intellisense.
2023-06-28 15:14:44 +02:00
Joshua Lochner 8d6622ef9b [version] Update to 2.3.0 2023-06-22 15:36:02 +02:00
Joshua Lochner c491c2661f Do not use browser cache if inaccessible (Fixes #162) 2023-06-22 15:32:52 +02:00
Pushpender Saini 15854f9cd6
Set chunk timestamp to rounded time (#160) 2023-06-22 01:01:06 +02:00
Joshua Lochner f628b841a8
Allow user to set `per_channel` and `reduce_range` quantization params (#156) (#157)
* Allow user to set `per_channel` and `reduce_range` quantization parameters (#156)

Also save quantization options

* Get operators of graph and subgraphs
2023-06-22 00:43:43 +02:00
Joshua Lochner d90f58110a
Add whisper unit tests (#155)
* Only run encoder with required inputs

* Add basic whisper unit tests

* Add newline after heading for docs

* Add unit test for transcribing english with timestamps

* Add multilingual test case
2023-06-21 23:58:16 +02:00
Joshua Lochner 4804171180
Do not use spread operator to concatenate large arrays (Closes #153) (#154)
* Do not use spread operator for merging large arrays (Fix #153)

* Add unit test for encoding long strings
2023-06-21 01:21:14 +02:00