Commit Graph

5543 Commits

Author SHA1 Message Date
XiaoqiJiao 890e790e16
[model_cards] TinyBERT (HUAWEI Noah's Ark Lab) (#7775) 2020-10-14 09:31:01 -04:00
Jonathan Chang 121dd4332b
Add batch inferencing support for GPT2LMHeadModel (#7552)
* Add support for gpt2 batch inferencing

* add test

* remove typo

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2020-10-14 13:40:24 +02:00
Quentin Lhoest 0c64b18840
Fix bert position ids in DPR convert script (#7776)
* fix bert position ids in DPR convert script

* style
2020-10-14 05:30:02 -04:00
Sylvain Gugger 7968051aba
Fix typo 2020-10-13 17:30:46 -04:00
Sam Shleifer 2977bd528f
Faster pegasus tokenization test with reduced data size (#7762) 2020-10-13 16:22:29 -04:00
François Lagunas 2d6e2ad4fa
Adding optional trial argument to model_init (#7759)
* Adding optional trial argument to model_init

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-13 17:07:02 +02:00
Tiger 7e73c12805
fixed lots of typos. (#7758) 2020-10-13 10:00:20 -04:00
Noam Wies 8cb4ecca25
Avoid unnecessary DDP synchronization when gradient_accumulation_steps > 1 (#7742)
* use DDP no_sync when possible

* fix is_nlp_available addition mistake

* reformat trainer.py

* reformat trainer.py

* drop support for pytorch < 1.2

* return support for pytorch < 1.2
2020-10-13 09:46:44 -04:00
Lysandre Debut 52f7d74398
Do not softmax when num_labels==1 (#7726)
* Do not softmax when num_labels==1

* Update src/transformers/pipelines.py

Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>

Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2020-10-13 09:42:27 -04:00
Patrick von Platen 82b09a8481
[Rag] Fix loading of pretrained Rag Tokenizer (#7756)
* fix rag

* Update tokenizer save_pretrained

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-13 14:34:22 +02:00
Patrick von Platen 2d4e928d97
Update PULL_REQUEST_TEMPLATE.md
Putting my name on a couple more issues to directly redirect them to me
2020-10-13 12:18:31 +02:00
Felipe Curti dcba9ee03b
Gpt1 for sequence classification (#7683)
* Add Documentation for GPT-1 Classification

* Add GPT-1 with Classification head

* Add tests for GPT-1 Classification

* Add GPT-1 For Classification to auto models

* Remove authorized missing keys, change checkpoint to openai-gpt
2020-10-13 05:06:15 -04:00
Lysandre Debut f34b4cd1bd
ElectraTokenizerFast (#7754) 2020-10-13 04:50:41 -04:00
Sam Shleifer 9c2b2db2cd
[marian] Automate Tatoeba-Challenge conversion (#7709) 2020-10-12 12:24:25 -04:00
Alex Combessie aacac8f708
Add license info to nlptown/bert-base-multilingual-uncased-sentiment (#7738) 2020-10-12 11:56:10 -04:00
Lysandre Debut 1f1d950b28
Fix #7331 (#7732) 2020-10-12 09:10:52 -04:00
Julien Plu d9ffb87efb
Fix tf text class (#7724)
* Fix test

* fix generic text classification

* fix test

* Fix tests
2020-10-12 08:45:15 -04:00
sgugger d6175a4268 Fix code quality 2020-10-12 08:22:27 -04:00
Jonathan Chang 1d5ea34f6a
Fix trainer callback (#7720)
Fix a bug that happends when subclassing Trainer and
overwriting evaluate() without calling prediciton_loop()
2020-10-12 07:45:12 -04:00
Kelvin f176e70723
The input training data files (multiple files in glob format). (#7717)
Very often splitting large files to smaller files can prevent tokenizer going out of memory in environment like Colab that does not have swap memory
2020-10-12 07:44:02 -04:00
AndreaSottana 34fcfb44e3
Update tokenization_utils_base.py (#7696)
Minor spelling corrections in docstrings. "information" is uncountable in English and has no plural.
2020-10-12 06:09:20 -04:00
fteufel 2f34bcf3e7
check for tpu availability in save_pretrained (#7699)
Added is_torch_tpu_available() to the condition
for saving a model as xla model. "xla_device"
property of config can also be True on a non-xla
device, when loading a checkpointthat was trained
on xla before.

Resolves #7695
2020-10-12 04:10:17 -04:00
Sylvain Gugger 13c1857718
Fix typo in all model docs (#7714) 2020-10-12 04:06:59 -04:00
Berowne 83086858f8
fixed typo in warning line 207. (#7718)
replace 'men_len' with 'mem_len' to match parameter name
2020-10-12 03:58:58 -04:00
Miguel Victor 03ec02a667
Corrected typo: maked → masked (#7703) 2020-10-11 16:45:00 -04:00
Sam Shleifer 827c519494
[examples] bump pl=0.9.0 (#7053) 2020-10-11 16:39:38 -04:00
Alexandr Maslov ba4bbd92bc
Fix docstring in AutoModel class (#7694) 2020-10-10 21:08:08 -04:00
Andrew Kane 26d5475d4b
Added license information for default and distilbert models (#7688) 2020-10-10 03:55:11 -04:00
Sylvain Gugger c6e18de9f8
Fix flaky test in test_trainer (#7689) 2020-10-09 20:01:15 -04:00
Sylvain Gugger 2c9e83f7b8
Fix title level in Blenderbot doc (#7687) 2020-10-09 19:24:10 -04:00
Doug Blank 9618cd6964
Import integration libraries first (#7650)
* Import intergration libraries first

* isort and black happiness

* flake8 happiness

* Add a test

* Black reformat

* Ignore import order in tests

* A heavy-handed method of disabling comet for tests

* Remove comet_ml tests

* Run black on setup.py
2020-10-09 12:13:22 -04:00
sgugger 4dcc424de3 Complete release instruction 2020-10-09 12:12:03 -04:00
Sylvain Gugger a3cea6a8cc
Better links for models in READMED and doc index (#7680) 2020-10-09 11:17:16 -04:00
Sam Shleifer 0af53b1ef9
Delete extra test file (#7681) 2020-10-09 11:16:35 -04:00
Stas Bekman b0f05e0c4c
[pegasus] Faster tokenizer tests (#7672) 2020-10-09 11:10:32 -04:00
sgugger bc00b37a0d Revert "Better model links in the README and index"
This reverts commit 76e05518bb.
2020-10-09 10:56:13 -04:00
sgugger 76e05518bb Better model links in the README and index 2020-10-09 10:54:40 -04:00
Julien Plu 9ad830596d
Fix dataset cardinality (#7678)
* Fix test

* Fix cardinality issue

* Fix test
2020-10-09 10:38:25 -04:00
Joe Davison a1ac082879
add license to xlm-roberta-large-xnli card 2020-10-09 09:16:06 -04:00
Funtowicz Morgan 21ed3a6b99
Reintroduce clean_text on BertTokenizer call which was removed by mistake in #4723 (#5749)
* Reintroduce clean_text call which was removed by mistake in #4723

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added unittest for clean_text parameter on Bert tokenizer.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Better unittest name.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Adapt unittest to use untrained tokenizer.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Code quality + update test

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-10-09 08:07:28 -04:00
Noah Trenaman 5668fdb09e
Update XLM-RoBERTa details (#7669) 2020-10-09 05:16:58 -04:00
guhur 0578a91300
fix nn.DataParallel compatibility with PyTorch 1.5 (#7671)
The same type of errors as in https://github.com/huggingface/transformers/pull/4300
2020-10-09 05:15:08 -04:00
Sam Shleifer 297233fa92
[s2s] Switch README urls to cdn (#7670) 2020-10-08 21:22:22 -04:00
Sam Shleifer a1ecc90d6b
[pseudo] Switch URLS to CDN (#7661) 2020-10-08 14:12:39 -04:00
Suraj Patil 06a973fd2a
[s2s] configure lr_scheduler from command line (#7641) 2020-10-08 13:06:35 -04:00
Lysandre Debut 4a00613c24
Fix RobertaForCausalLM docs (#7642)
* Fix RobertaForCausalLM docs

* Apply review suggestion

Co-authored-by: sgugger <sylvain.gugger@gmail,com>

Co-authored-by: sgugger <sylvain.gugger@gmail,com>
2020-10-08 08:36:00 -04:00
Thomas Wolf 55cb2ee62e
Green tests: update torch-hub test dependencies (add protobuf and pin tokenizer 0.9.0-RC2) (#7658)
* pin torch-hub test

* add protobuf dep
2020-10-08 13:21:15 +02:00
Thomas Wolf 9aeacb58ba
Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2020-10-08 11:32:16 +02:00
Piero Molino 4d04120c6d
Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load (#6935)
* Replaced torch.load for loading the pretrained vocab of TransformerXL to pickle.load

* Replaced torch.save with pickle.dump when saving the vocabulary

* updating transformer-xl

* uploaded on S3 - compatibility

* fix tests

* style

* Address review comments

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-10-08 10:16:10 +02:00
Sam Shleifer aba4e22944
[pseudolabels] cleanup markdown table (#7653) 2020-10-07 23:04:18 -04:00