transformers

Commit Graph

Author	SHA1	Message	Date
Teven	f8208fa456	Correct transformers-cli env call	2020-04-09 09:03:19 +02:00
Lysandre Debut	6435b9f908	Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684 ) * Updating modeling tf files; adding tests * Merge `encode_plus` and `batch_encode_plus`	2020-04-08 16:22:44 -04:00
LysandreJik	500aa12318	close #3699	2020-04-08 14:32:47 -04:00
Julien Chaumond	a594ee9c84	More doc for model cards (#3698 ) see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270	2020-04-08 12:12:52 -04:00
Julien Chaumond	83703cd077	Update doc for {Summarization,Translation}Pipeline and other tweaks	2020-04-08 09:45:00 -04:00
Seyone Chithrananda	a1b3b4167e	Created README.md for model card ChemBERTa (#3666 ) * created readme.md * update readme with fixes Fixes from PR comments	2020-04-08 09:10:20 -04:00
Lorenzo Ampil	747907dc5e	Fix typo in FeatureExtractionPipeline docstring	2020-04-08 09:08:56 -04:00
Sam Shleifer	715aa5b135	[Bart] Replace config.output_past with use_cache kwarg (#3632 )	2020-04-07 19:08:26 -04:00
Sam Shleifer	e344e3d402	[examples] SummarizationDataset cleanup (#3451 )	2020-04-07 19:05:58 -04:00
Patrick von Platen	b0ad069517	[Tokenization] fix edge case for bert tokenization (#3517 ) * fix egde gase for bert tokenization * add Lysandres comments for improvement * use new is_pretokenized_flag	2020-04-07 16:26:31 -04:00
Patrick von Platen	80fa0f7812	[Examples, Benchmark] Improve benchmark utils (#3674 ) * improve and add features to benchmark utils * update benchmark style * remove output files	2020-04-07 16:25:57 -04:00
Michael Pang	05deb52dc1	Optimize causal mask using torch.where (#2715 ) * Optimize causal mask using torch.where Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance. * Maintain compatiblity with torch 1.0.0 - thanks for PR feedback * Fix typo * reformat line for CI	2020-04-07 22:19:18 +02:00
Sam Shleifer	0a4b1068e1	Speedup torch summarization tests (#3663 )	2020-04-07 14:01:30 -04:00
Myle Ott	5aa8a278a3	Fix roberta checkpoint conversion script (#3642 )	2020-04-07 12:03:23 -04:00
Julien Chaumond	11cc1e168b	[model_cards] Turn down spurious warnings Close #3639 + spurious warning mentioned in #3227 cc @lysandrejik @thomwolf	2020-04-07 10:20:19 -04:00
Teven	0a9d09b42a	fixed TransfoXLLMHeadModel documentation (#3661 ) Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-07 00:47:51 +02:00
Funtowicz Morgan	96ab75b8dd	Tokenizers v3.0.0 (#3185 ) * Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-07 00:29:15 +02:00
Ethan Perez	e52d1258e0	Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631 ) * Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py `convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase. * Simplifying change to match recent commits	2020-04-06 16:52:22 -04:00
ktrapeznikov	0ac33ddd8d	Create README.md	2020-04-06 16:35:29 -04:00
Manuel Romero	326e6ebae7	Add model card	2020-04-06 16:30:01 -04:00
Manuel Romero	43eca3f878	Add model card	2020-04-06 16:29:51 -04:00
Manuel Romero	6bec88ca42	Create README.md	2020-04-06 16:29:44 -04:00
Manuel Romero	769b60f935	Add model card (#3655 ) * Add model card * Fix model name in fine-tuning script	2020-04-06 16:29:36 -04:00
Manuel Romero	c4bcb01906	Create model card (#3654 ) * Create model card * Fix model name in fine-tuning script	2020-04-06 16:29:25 -04:00
Manuel Romero	6903a987b8	Create README.md	2020-04-06 16:29:02 -04:00
MichalMalyska	760872dbde	Create README.md (#3662 )	2020-04-06 16:27:50 -04:00
jjacampos	47e1334c0b	Add model card for BERTeus (#3649 ) * Add model card for BERTeus * Update README	2020-04-06 16:21:25 -04:00
Suchin	529534dc2f	BioMed Roberta-Base (AllenAI) (#3643 ) * added model card * updated README * updated README * updated README * added evals * removed pico eval * Tweaks Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-04-06 16:12:09 -04:00
Lysandre Debut	261c4ff4e2	Update notebooks (#3620 ) * Update notebooks * From local to global link * from local links to actual global links	2020-04-06 14:32:39 -04:00
Julien Chaumond	39a34cc375	[model_cards] ELECTRA (w/ examples of usage) Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com> Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2020-04-06 11:43:33 -04:00
LysandreJik	ea6dba2787	Re-pin isort	2020-04-06 10:09:54 -04:00
LysandreJik	11c3257a18	unpin isort for pypi	2020-04-06 10:06:41 -04:00
LysandreJik	36bffc81b3	Release: v2.8.0	2020-04-06 10:03:53 -04:00
Patrick von Platen	2ee410560e	[Generate, Test] Split generate test function into beam search, no beam search (#3601 ) * split beam search and no beam search test * fix test * clean generate tests	2020-04-06 10:37:05 +02:00
Patrick von Platen	1789c7daf1	fix argument order (#3637 )	2020-04-05 12:33:41 +02:00
Patrick von Platen	b809d2f073	Fix TF T5 docstring (#3636 )	2020-04-05 12:23:09 +02:00
Timo Moeller	4ab8ab4f50	Adjust model card to reflect changes to vocabulary (cherry picked from commit `8e25c4bf28`)	2020-04-04 15:27:41 -04:00
ktrapeznikov	ac40eed1a5	Create README.md adding readme for ktrapeznikov/albert-xlarge-v2-squad-v2	2020-04-04 15:18:54 -04:00
ktrapeznikov	fd9995ebc5	Create README.md	2020-04-04 15:18:31 -04:00
Julien Chaumond	5d912e7ed4	Tweak typing for #3566	2020-04-04 15:04:03 -04:00
Julien Chaumond	94eb68d742	weigths*weights	2020-04-04 15:03:26 -04:00
Manuel Romero	243e687be6	Create model card	2020-04-04 08:20:34 -04:00
Julien Chaumond	3e4b4dd190	[model_cards] Link to ExBERT visualisation Hat/tip @bhoov @HendrikStrobelt @sebastianGehrmann Also cc @srush and @thomwolf	2020-04-03 20:03:29 -04:00
Max Ryabinin	c6acd246ec	Speed up GELU computation with torch.jit (#2988 ) * Compile gelu_new with torchscript * Compile _gelu_python with torchscript * Wrap gelu_new with torch.jit for torch>=1.4	2020-04-03 15:20:21 -04:00
Lysandre Debut	d5d7d88612	ELECTRA (#3257 ) * Electra wip * helpers * Electra wip * Electra v1 * ELECTRA may be saved/loaded * Generator & Discriminator * Embedding size instead of halving the hidden size * ELECTRA Tokenizer * Revert BERT helpers * ELECTRA Conversion script * Archive maps * PyTorch tests * Start fixing tests * Tests pass * Same configuration for both models * Compatible with base + large * Simplification + weight tying * Archives * Auto + Renaming to standard names * ELECTRA is uncased * Tests * Slight API changes * Update tests * wip * ElectraForTokenClassification * temp * Simpler arch + tests Removed ElectraForPreTraining which will be in a script * Conversion script * Auto model * Update links to S3 * Split ElectraForPreTraining and ElectraForTokenClassification * Actually test PreTraining model * Remove num_labels from configuration * wip * wip * From discriminator and generator to electra * Slight API changes * Better naming * TensorFlow ELECTRA tests * Accurate conversion script * Added to conversion script * Fast ELECTRA tokenizer * Style * Add ELECTRA to README * Modeling Pytorch Doc + Real style * TF Docs * Docs * Correct links * Correct model intialized * random fixes * style * Addressing Patrick's and Sam's comments * Correct links in docs	2020-04-03 14:10:54 -04:00
Yohei Tamura	8594dd80dd	BertJapaneseTokenizer accept options for mecab (#3566 ) * BertJapaneseTokenizer accept options for mecab * black * fix mecab_option to Option[str]	2020-04-03 11:12:19 -04:00
HUSEIN ZOLKEPLI	216e167ce6	Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-cased README (#3613 ) * add bert bahasa readme * update readme * update readme * added xlnet * added tiny-bert and fix xlnet readme * added albert base	2020-04-03 09:28:43 -04:00
ahotrod	1ac6a246d8	Update README.md (#3604 ) Update AutoModel & AutoTokernizer loading.	2020-04-03 09:28:25 -04:00
ahotrod	e91692f4a3	Update README.md (#3603 )	2020-04-03 09:27:57 -04:00
HenrykBorzymowski	8e287d507d	corrected mistake in polish model cards (#3611 ) * added model_cards for polish squad models * corrected mistake in polish design cards Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>	2020-04-03 09:07:15 -04:00

1 2 3 4 5 ...

3710 Commits All Branches Search

3710 Commits

All Branches