Commit Graph

3332 Commits

Author SHA1 Message Date
Lysandre Debut d3eb7d23a4
Pipeline doc (#3055)
* Pipeline doc initial commit

* pipeline abstraction

* Remove modelcard argument from pipeline

* Task-specific pipelines can be instantiated with no model or tokenizer

* All pipelines doc
2020-03-02 14:07:10 -05:00
Manuel Romero 2c7749784c Update README.md
- Add example of usage
- Update metrics
2020-03-02 13:35:34 -05:00
Julien Chaumond 0e56b37e80 rm bogus file
cc @patrickvonplaten
2020-03-02 12:27:12 -05:00
Patrick von Platen 2fdc7f6ce8
correct greedy generation when doing beam search (#3078)
* correct greedy generation when doing beam search

* improve comment
2020-03-02 12:00:09 -05:00
Julien Chaumond 13afb71208 [ci] Ensure that TF does not preempt all GPU memory for itself
see https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

Co-Authored-By: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2020-03-02 11:56:45 -05:00
Patrick von Platen c0135194eb
Force pad_token_id to be set before padding for standard tokenizer (#3035)
* force pad_token_id to be set before padding

* fix tests and forbid padding without having a padding_token_id set
2020-03-02 10:53:55 -05:00
Sam Shleifer b54ef78d0c
Bart-CNN (#3059)
`generate` code that produces 99% identical summarizations to fairseq on CNN test data, with caching.
2020-03-02 10:35:53 -05:00
Victor SANH 6b1ff25084
fix n_gpu count when no_cuda flag is activated (#3077)
* fix n_gpu count when no_cuda flag is activated

* someone was left behind
2020-03-02 10:20:21 -05:00
Julien Chaumond 298bed16a8 make style 2020-03-01 14:08:01 -05:00
VictorSanh 852e032ca6 include roberta in run_squad_w_distillation - cc @graviraja 2020-03-01 01:56:50 +00:00
VictorSanh b5509abb36 --do_lower_case will always trick me... 2020-03-01 01:39:24 +00:00
Julien Chaumond d6ef587a10
[ci] Fixup e36bd94345 2020-02-28 23:19:17 -05:00
Julien Chaumond e36bd94345
[ci] Run all tests on (self-hosted) GPU (#3020)
* Create self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* do not run slow tests, for now

* [ci] For comparison with circleci, let's also run CPU-tests

* [ci] reorganize

* clearer filenames

* [ci] Final tweaks before merging

* rm slow tests on circle ci

* Trigger CI

* On GPU this concurrency was way too high
2020-02-28 21:11:08 -05:00
srush 908fa43b54
Changes to NER examples for PLT and TPU (#3053)
* changes to allow for tpu training

* black

* tpu

* tpu
2020-02-27 16:45:32 -05:00
Lysandre Debut 8bcb37bfb8
NER support for Albert in run_ner.py and NerPipeline (#2983)
* * Added support for Albert when fine-tuning for NER

* Added support for Albert in NER pipeline

* Added command-line options to examples/ner/run_ner.py to better control tokenization

* Added class AlbertForTokenClassification

* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens

* Added ,

* Now passes style guide enforcement

* Changes from reviews.

* Code now passes style enforcement

* Added test for AlbertForTokenClassification

* Added test for AlbertForTokenClassification
2020-02-27 10:22:55 -05:00
Sam Shleifer 6a37588041
spelling: strictly (#3042) 2020-02-27 10:22:35 -05:00
Cola f4ff44a6d9
Fix batch_encode_plus (#3041) 2020-02-27 09:56:47 -05:00
Martin Malmsten f71157529e Added test for AlbertForTokenClassification 2020-02-27 12:24:20 +01:00
Martin Malmsten aceb6a0907 Added test for AlbertForTokenClassification 2020-02-27 11:52:46 +01:00
Martin Malmsten d762d4289c Code now passes style enforcement 2020-02-26 23:50:40 +01:00
Martin Malmsten 9495d38b0d Changes from reviews. 2020-02-26 23:36:39 +01:00
Julien Chaumond b370cc7e99 [gpu] Fixup fdd61b1992 2020-02-26 21:48:49 +00:00
Julien Chaumond f5516805c2 Fix bart slow test 2020-02-26 20:47:49 +00:00
Andrew Walker 5bc99e7f33
fix several typos in Distil* readme (#3034) 2020-02-26 12:39:54 -05:00
Patrick von Platen fdd61b1992
Fix attn mask gpt2 when using past (#3033)
* fix issue and add some tests

* fix issue and add some tests

* updated doc string gpt2
2020-02-26 12:04:37 -05:00
Julien Chaumond 9cda3620b6
Fix (non-slow) tests on GPU (torch) (#3024)
* Fix tests on GPU (torch)

* Fix bart slow tests

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-02-26 11:59:25 -05:00
Sam Shleifer 9df74b8bc4
Delete all mentions of Model2Model (#3019) 2020-02-26 11:36:27 -05:00
Lysandre Debut bb7c468520
Documentation (#2989)
* All Tokenizers

BertTokenizer + few fixes
RobertaTokenizer
OpenAIGPTTokenizer + Fixes
GPT2Tokenizer + fixes
TransfoXLTokenizer
Correct rst for TransformerXL
XLMTokenizer + fixes
XLNet Tokenizer + Style
DistilBERT + Fix XLNet RST
CTRLTokenizer
CamemBERT Tokenizer
FlaubertTokenizer
XLMRobertaTokenizer
cleanup

* cleanup
2020-02-25 18:43:36 -05:00
Patrick von Platen c913eb9c38
Add integration tests for xlm roberta modelling and xlm roberta tokenzier (#3014)
* add first files

* add xlm roberta integration tests

* make style

* flake 8 issues solved
2020-02-25 16:51:25 -05:00
srush e8ce63ff21
Change masking to direct labeling for TPU support. (#2982)
* change masking to direct labelings

* fix black

* switch to ignore index

* .

* fix black
2020-02-25 14:47:43 -05:00
Jhuo IH 7a7ee28cb9
missing ner link (#2967) 2020-02-25 14:06:57 -05:00
Lysandre Debut 65e7c90a77
Adding usage examples for common tasks (#2850)
* Usage: Sequence Classification & Question Answering

* Pipeline example

* Language modeling

* TensorFlow code for Sequence classification

* Custom TF/PT toggler in docs

* QA + LM for TensorFlow

* Finish Usage for both PyTorch and TensorFlow

* Addressing Julien's comments

* More assertive

* cleanup

* Favicon
- added favicon option in conf.py along with the favicon image
- udpated 🤗 logo. slightly smaller and should appear more consistent across editing programs (no more tongue on the outside of the mouth)

Co-authored-by: joshchagani <joshua@joshuachagani.com>
2020-02-25 13:48:24 -05:00
Julien Chaumond e693cd1e87 [ci] Run slow tests every day 2020-02-24 19:54:47 -05:00
Julien Chaumond 4fc63151af [ci] Attempt to fix #2844 2020-02-24 19:51:34 -05:00
Lysandre Debut b90745c590
Test correct tokenizers after default switch (#3003) 2020-02-24 18:45:53 -05:00
Lysandre Debut 3716c3d8af
False by default (#3002) 2020-02-24 18:30:57 -05:00
Lysandre f9ec5ca90b Release: v2.5.1 2020-02-24 18:22:54 -05:00
Funtowicz Morgan 4cd9c0971c
Fix for fast tokenizers save_pretrained compatibility with Python. (#2933)
* Renamed file generate by tokenizers when calling save_pretrained to match python.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-24 18:20:42 -05:00
Sandro Cavallari ee60840ee6
fix _update_memory fn call in transformer-xl (#2971) 2020-02-24 17:50:24 -05:00
Patrick von Platen 6a50d501ec
add explaining example to XLNet LM modeling (#2997)
* add explaining example to XLNet LM modeling

* improve docstring for xlnet
2020-02-24 15:42:38 -05:00
Patrick von Platen 65d74c4965
Add preprocessing step for transfo-xl tokenization to avoid tokenizing words followed by punction to <unk> (#2987)
* add preprocessing to add space before punctuation for transfo_xl

* improve warning messages

* make style

* compile regex at instantination of tokenizer object
2020-02-24 15:11:10 -05:00
Bram Vanroy a143d9479e
Add local_files_only parameter to pretrained items (#2930)
* Add disable_outgoing to pretrained items

Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded

* parameter name change

* Remove forgotten print
2020-02-24 14:58:15 -05:00
Manuel Romero 286d1ec746 Create README.md 2020-02-24 14:33:49 -05:00
Lysandre Debut 7984a70ee4
kwargs are passed to both model and configuration in AutoModels (#2998) 2020-02-24 14:19:39 -05:00
Lysandre Debut 21d8b6a33e
Testing that batch_encode_plus is the same as encode_plus (#2973)
* Testing that encode_plus and batch_encode_plus behave the same way

Spoiler alert: they don't

* Testing rest of arguments in batch_encode_plus

* Test tensor return in batch_encode_plus

* Addressing Sam's comments

* flake8

* Simplified with `num_added_tokens`
2020-02-24 12:09:46 -05:00
Patrick von Platen 17c45c39ed
Add slow generate tests for pretrained lm models (#2909)
* add slow generate lm_model tests

* fix conflicts

* merge conflicts

* fix conflicts

* add slow generate lm_model tests

* make style

* delete unused variable

* fix conflicts

* fix conflicts

* fix conflicts

* delete unused variable

* fix conflicts

* finished hard coded tests
2020-02-24 11:51:57 -05:00
Lysandre Debut 8194df8e0c
Warning on `add_special_tokens` (#2966)
Warning on `add_special_tokens` when passed to `encode`, `encode_plus` and `batch_encode_plus`
2020-02-24 08:42:54 -05:00
Patrick von Platen 38f5fe9e02
add_ctags_to_git_ignore (#2984) 2020-02-23 16:55:32 -05:00
Martin Malmsten 105dcb4162 Now passes style guide enforcement 2020-02-23 21:47:59 +01:00
Martin Malmsten 33eb8a165d Added , 2020-02-23 21:43:31 +01:00