[XLNet] Fix mems behavior (#8567)
* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
This commit is contained in:
parent
369f1d77b4
commit
2a6fbe6a40
|
@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
|
|||
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
|
||||
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or
|
||||
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
|
||||
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
|
||||
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!
|
||||
|
|
|
@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
|
|||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
|
||||
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
|
||||
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
||||
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
||||
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
||||
|
|
|
@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
|
|||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
|
||||
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||
|
||||
|
|
|
@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
|
|||
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
||||
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
||||
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
||||
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
|
||||
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
||||
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
||||
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
|
||||
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
||||
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
||||
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
||||
study.*
|
||||
|
|
|
@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
|
|||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
|
||||
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
|
||||
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
||||
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
|
||||
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
|
||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
||||
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
||||
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
||||
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
|
||||
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
|
||||
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
||||
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
||||
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
||||
|
|
|
@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
|
|||
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
||||
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
||||
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
||||
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
||||
community for further reproducible experiments in French NLP.*
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ The abstract from the paper is the following:
|
|||
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
||||
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
||||
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
|
||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
|
||||
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
||||
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
||||
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
||||
|
|
|
@ -6,19 +6,19 @@ Overview
|
|||
|
||||
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
||||
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
||||
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
|
||||
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
||||
information extraction tasks, such as form understanding and receipt understanding.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
||||
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
||||
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
|
||||
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
|
||||
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
|
||||
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
|
||||
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
|
||||
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
|
||||
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
|
||||
classification (from 93.07 to 94.42).*
|
||||
|
||||
|
|
|
@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
|
|||
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
||||
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
||||
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
||||
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
||||
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
||||
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
||||
|
|
|
@ -13,7 +13,7 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
|
|||
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
|
||||
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
||||
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
|
||||
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
|
||||
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
||||
on the encoder, decoder, or reconstructing parts of the text.
|
||||
|
||||
|
|
|
@ -17,7 +17,7 @@ the next token.
|
|||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
||||
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||
|
@ -25,7 +25,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
|||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||
|
||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||
|
||||
|
|
|
@ -17,7 +17,7 @@ The abstract from the paper is the following:
|
|||
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
|
||||
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
|
||||
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
|
||||
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
|
||||
text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
|
||||
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
|
||||
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
|
||||
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
|
||||
|
|
|
@ -19,7 +19,7 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
|
|||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
||||
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||
|
@ -27,7 +27,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
|||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||
|
||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||
|
||||
|
|
|
@ -527,7 +527,7 @@ Pegasus
|
|||
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
|
||||
|
||||
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
|
||||
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
|
||||
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
|
||||
objective, called Gap Sentence Generation (GSG).
|
||||
|
||||
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
|
||||
|
@ -609,7 +609,7 @@ MT5
|
|||
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
|
||||
et al.
|
||||
|
||||
The model architecture is same as T5. mT5's pre-training objective includes T5's self-supervised training, but not T5's
|
||||
The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
|
||||
supervised training. mT5 is trained on 101 languages.
|
||||
|
||||
The library provides a version of this model for conditional generation.
|
||||
|
@ -630,8 +630,8 @@ MBart
|
|||
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
|
||||
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
|
||||
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
|
||||
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
|
||||
The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
|
||||
for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
|
||||
sequence-to-sequence model by denoising full texts in multiple languages,
|
||||
|
||||
The library provides a version of this model for conditional generation.
|
||||
|
@ -658,7 +658,7 @@ ProphetNet
|
|||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||
|
||||
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
|
||||
ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
|
||||
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
|
||||
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
|
||||
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
|
||||
|
@ -683,8 +683,8 @@ XLM-ProphetNet
|
|||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||
|
||||
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
|
||||
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
||||
XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
|
||||
on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
||||
|
||||
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
|
||||
versions for headline generation and question generation, respectively.
|
||||
|
|
|
@ -305,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
|
|||
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
|
||||
GPT-2 with causal language modeling.
|
||||
|
||||
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
|
||||
Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
|
||||
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
|
||||
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
|
||||
|
||||
|
|
|
@ -55,8 +55,6 @@ class PretrainedConfig(object):
|
|||
Whether or not the model should return all hidden-states.
|
||||
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not the model should returns all attentions.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
return_dict (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return a :class:`~transformers.file_utils.ModelOutput` instead of a plain
|
||||
tuple.
|
||||
|
@ -168,7 +166,6 @@ class PretrainedConfig(object):
|
|||
self.return_dict = kwargs.pop("return_dict", True)
|
||||
self.output_hidden_states = kwargs.pop("output_hidden_states", False)
|
||||
self.output_attentions = kwargs.pop("output_attentions", False)
|
||||
self.use_cache = kwargs.pop("use_cache", True) # Not used by all models
|
||||
self.torchscript = kwargs.pop("torchscript", False) # Only used by PyTorch models
|
||||
self.use_bfloat16 = kwargs.pop("use_bfloat16", False)
|
||||
self.pruned_heads = kwargs.pop("pruned_heads", {})
|
||||
|
|
|
@ -229,7 +229,7 @@ class LineByLineWithSOPTextDataset(Dataset):
|
|||
# to `block_size` anyways, so short sequences are generally wasted
|
||||
# computation. However, we *sometimes*
|
||||
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
||||
# sequences to minimize the mismatch between pre-training and fine-tuning.
|
||||
# sequences to minimize the mismatch between pretraining and fine-tuning.
|
||||
# The `target_seq_length` is just a rough target however, whereas
|
||||
# `block_size` is a hard limit.
|
||||
target_seq_length = max_num_tokens
|
||||
|
@ -425,7 +425,7 @@ class TextDatasetForNextSentencePrediction(Dataset):
|
|||
# to `block_size` anyways, so short sequences are generally wasted
|
||||
# computation. However, we *sometimes*
|
||||
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
||||
# sequences to minimize the mismatch between pre-training and fine-tuning.
|
||||
# sequences to minimize the mismatch between pretraining and fine-tuning.
|
||||
# The `target_seq_length` is just a rough target however, whereas
|
||||
# `block_size` is a hard limit.
|
||||
target_seq_length = max_num_tokens
|
||||
|
|
|
@ -38,6 +38,7 @@ class TFGenerationMixin:
|
|||
|
||||
def _use_cache(self, outputs, use_cache):
|
||||
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
|
||||
use_cache = getattr(self.config, "use_cache", False)
|
||||
if len(outputs) <= 1 or use_cache is False:
|
||||
return False
|
||||
if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
|
||||
|
@ -194,7 +195,6 @@ class TFGenerationMixin:
|
|||
min_length = min_length if min_length is not None else self.config.min_length
|
||||
do_sample = do_sample if do_sample is not None else self.config.do_sample
|
||||
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
|
||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
||||
num_beams = num_beams if num_beams is not None else self.config.num_beams
|
||||
temperature = temperature if temperature is not None else self.config.temperature
|
||||
top_k = top_k if top_k is not None else self.config.top_k
|
||||
|
@ -224,7 +224,6 @@ class TFGenerationMixin:
|
|||
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
|
||||
assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
|
||||
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
|
||||
assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
|
||||
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
|
||||
assert temperature > 0, "`temperature` should be strictly positive."
|
||||
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
|
||||
|
|
|
@ -462,7 +462,6 @@ class GenerationMixin:
|
|||
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
|
||||
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
|
||||
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
|
||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
||||
|
||||
if input_ids is None:
|
||||
# init `input_ids` with bos_token_id
|
||||
|
|
|
@ -730,7 +730,7 @@ class AlbertModel(AlbertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
||||
Albert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||
`sentence order prediction (classification)` head.
|
||||
""",
|
||||
ALBERT_START_DOCSTRING,
|
||||
|
|
|
@ -809,7 +809,7 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Albert Model with two heads on top for pre-training: a `masked language modeling` head and a `sentence order
|
||||
Albert Model with two heads on top for pretraining: a `masked language modeling` head and a `sentence order
|
||||
prediction` (classification) head.
|
||||
""",
|
||||
ALBERT_START_DOCSTRING,
|
||||
|
|
|
@ -108,6 +108,8 @@ class BartConfig(PretrainedConfig):
|
|||
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
|
||||
:obj:`True` for `bart-large-cnn`.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
"""
|
||||
model_type = "bart"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
@ -134,9 +136,6 @@ class BartConfig(PretrainedConfig):
|
|||
classifier_dropout=0.0,
|
||||
num_labels=3,
|
||||
is_encoder_decoder=True,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
normalize_before=False,
|
||||
add_final_layer_norm=False,
|
||||
do_blenderbot_90_layernorm=False,
|
||||
|
@ -145,6 +144,10 @@ class BartConfig(PretrainedConfig):
|
|||
static_position_embeddings=False,
|
||||
add_bias_logits=False,
|
||||
force_bos_token_to_be_generated=False,
|
||||
use_cache=True,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
**common_kwargs
|
||||
):
|
||||
r"""
|
||||
|
@ -208,6 +211,8 @@ class BartConfig(PretrainedConfig):
|
|||
|
||||
self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm
|
||||
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def num_attention_heads(self) -> int:
|
||||
return self.encoder_attention_heads
|
||||
|
|
|
@ -888,7 +888,7 @@ class BertModel(BertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a `next
|
||||
Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
|
||||
sentence prediction (classification)` head.
|
||||
""",
|
||||
BERT_START_DOCSTRING,
|
||||
|
|
|
@ -90,7 +90,7 @@ TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
|||
|
||||
class TFBertPreTrainingLoss:
|
||||
"""
|
||||
Loss function suitable for BERT-like pre-training, that is, the task of pretraining a language model by combining
|
||||
Loss function suitable for BERT-like pretraining, that is, the task of pretraining a language model by combining
|
||||
NSP + MLM. .. note:: Any label of -100 will be ignored (along with the corresponding logits) in the loss
|
||||
computation.
|
||||
"""
|
||||
|
@ -878,7 +878,7 @@ class TFBertModel(TFBertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Bert Model with two heads on top as done during the pre-training:
|
||||
Bert Model with two heads on top as done during the pretraining:
|
||||
a `masked language modeling` head and a `next sentence prediction (classification)` head.
|
||||
""",
|
||||
BERT_START_DOCSTRING,
|
||||
|
|
|
@ -80,7 +80,7 @@ class BertweetTokenizer(PreTrainedTokenizer):
|
|||
normalization (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
||||
Whether or not to apply a normalization preprocess.
|
||||
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
|
||||
The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
|
||||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||
|
||||
.. note::
|
||||
|
||||
|
|
|
@ -61,6 +61,9 @@ class CTRLConfig(PretrainedConfig):
|
|||
The epsilon to use in the layer normalization layers
|
||||
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
|
||||
|
||||
Examples::
|
||||
|
||||
|
@ -98,6 +101,7 @@ class CTRLConfig(PretrainedConfig):
|
|||
summary_activation=None,
|
||||
summary_proj_to_labels=True,
|
||||
summary_first_dropout=0.1,
|
||||
use_cache=True,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
@ -119,6 +123,7 @@ class CTRLConfig(PretrainedConfig):
|
|||
self.summary_activation = summary_activation
|
||||
self.summary_first_dropout = summary_first_dropout
|
||||
self.summary_proj_to_labels = summary_proj_to_labels
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
|
|
|
@ -772,7 +772,7 @@ DEBERTA_START_DOCSTRING = r"""
|
|||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
||||
<https://arxiv.org/abs/2006.03654>`_ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build on top of
|
||||
BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
|
||||
improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pre-training data.
|
||||
improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
|
||||
|
||||
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
|
||||
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
|
||||
|
|
|
@ -891,8 +891,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Electra model with a binary classification head on top as used during pre-training for identifying generated
|
||||
tokens.
|
||||
Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||
|
||||
It is recommended to load the discriminator checkpoint into that model.
|
||||
""",
|
||||
|
|
|
@ -789,8 +789,7 @@ class TFElectraModel(TFElectraPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Electra model with a binary classification head on top as used during pre-training for identifying generated
|
||||
tokens.
|
||||
Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||
|
||||
Even though both the discriminator and generator may be loaded into this model, the discriminator is the only model
|
||||
of the two to have the correct classification head to be used for this model.
|
||||
|
|
|
@ -109,6 +109,8 @@ class FSMTConfig(PretrainedConfig):
|
|||
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
||||
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
|
||||
search when at least ``num_beams`` sentences are finished per batch or not.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
|
||||
Examples::
|
||||
|
||||
|
@ -142,9 +144,6 @@ class FSMTConfig(PretrainedConfig):
|
|||
dropout=0.1,
|
||||
activation_dropout=0.0,
|
||||
init_std=0.02,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
decoder_start_token_id=2,
|
||||
is_encoder_decoder=True,
|
||||
scale_embedding=True,
|
||||
|
@ -152,6 +151,10 @@ class FSMTConfig(PretrainedConfig):
|
|||
num_beams=5,
|
||||
length_penalty=1.0,
|
||||
early_stopping=False,
|
||||
use_cache=True,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
**common_kwargs
|
||||
):
|
||||
if "hidden_size" in common_kwargs:
|
||||
|
@ -196,6 +199,8 @@ class FSMTConfig(PretrainedConfig):
|
|||
self.activation_dropout = activation_dropout
|
||||
self.dropout = dropout
|
||||
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def num_attention_heads(self) -> int:
|
||||
return self.encoder_attention_heads
|
||||
|
|
|
@ -1241,7 +1241,7 @@ class TFFunnelModel(TFFunnelPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Funnel model with a binary classification head on top as used during pre-training for identifying generated tokens.
|
||||
Funnel model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||
""",
|
||||
FUNNEL_START_DOCSTRING,
|
||||
)
|
||||
|
|
|
@ -104,6 +104,8 @@ class GPT2Config(PretrainedConfig):
|
|||
The dropout ratio to be used after the projection and activation.
|
||||
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
|
||||
Example::
|
||||
|
||||
|
@ -142,9 +144,10 @@ class GPT2Config(PretrainedConfig):
|
|||
summary_activation=None,
|
||||
summary_proj_to_labels=True,
|
||||
summary_first_dropout=0.1,
|
||||
gradient_checkpointing=False,
|
||||
use_cache=True,
|
||||
bos_token_id=50256,
|
||||
eos_token_id=50256,
|
||||
gradient_checkpointing=False,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||
|
@ -168,6 +171,7 @@ class GPT2Config(PretrainedConfig):
|
|||
self.summary_first_dropout = summary_first_dropout
|
||||
self.summary_proj_to_labels = summary_proj_to_labels
|
||||
self.gradient_checkpointing = gradient_checkpointing
|
||||
self.use_cache = use_cache
|
||||
|
||||
self.bos_token_id = bos_token_id
|
||||
self.eos_token_id = eos_token_id
|
||||
|
|
|
@ -1013,7 +1013,7 @@ class LxmertModel(LxmertPreTrainedModel):
|
|||
|
||||
|
||||
@add_start_docstrings(
|
||||
"""Lxmert Model with a specified pre-training head on top. """,
|
||||
"""Lxmert Model with a specified pretraining head on top. """,
|
||||
LXMERT_START_DOCSTRING,
|
||||
)
|
||||
class LxmertForPreTraining(LxmertPreTrainedModel):
|
||||
|
@ -1024,7 +1024,7 @@ class LxmertForPreTraining(LxmertPreTrainedModel):
|
|||
self.num_qa_labels = config.num_qa_labels
|
||||
self.visual_loss_normalizer = config.visual_loss_normalizer
|
||||
|
||||
# Use of pre-training tasks
|
||||
# Use of pretraining tasks
|
||||
self.task_mask_lm = config.task_mask_lm
|
||||
self.task_obj_predict = config.task_obj_predict
|
||||
self.task_matched = config.task_matched
|
||||
|
|
|
@ -1176,7 +1176,7 @@ class TFLxmertForPreTraining(TFLxmertPreTrainedModel):
|
|||
self.num_qa_labels = config.num_qa_labels
|
||||
self.visual_loss_normalizer = config.visual_loss_normalizer
|
||||
|
||||
# Use of pre-training tasks
|
||||
# Use of pretraining tasks
|
||||
self.task_mask_lm = config.task_mask_lm
|
||||
self.task_obj_predict = config.task_obj_predict
|
||||
self.task_matched = config.task_matched
|
||||
|
|
|
@ -933,7 +933,7 @@ class MobileBertModel(MobileBertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
||||
MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||
`next sentence prediction (classification)` head.
|
||||
""",
|
||||
MOBILEBERT_START_DOCSTRING,
|
||||
|
|
|
@ -1014,7 +1014,7 @@ class TFMobileBertModel(TFMobileBertPreTrainedModel):
|
|||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
||||
MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||
`next sentence prediction (classification)` head.
|
||||
""",
|
||||
MOBILEBERT_START_DOCSTRING,
|
||||
|
|
|
@ -96,6 +96,9 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
|
||||
|
||||
The dropout ratio to be used after the projection and activation.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
|
||||
|
||||
Examples::
|
||||
|
||||
|
@ -133,6 +136,7 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||
summary_activation=None,
|
||||
summary_proj_to_labels=True,
|
||||
summary_first_dropout=0.1,
|
||||
use_cache=True,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
@ -155,6 +159,7 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||
self.summary_activation = summary_activation
|
||||
self.summary_first_dropout = summary_first_dropout
|
||||
self.summary_proj_to_labels = summary_proj_to_labels
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
|
|
|
@ -90,6 +90,8 @@ class ProphetNetConfig(PretrainedConfig):
|
|||
eps (:obj:`float`, `optional`, defaults to 0.0):
|
||||
Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label
|
||||
smoothing is performed.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
"""
|
||||
model_type = "prophetnet"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
@ -112,15 +114,16 @@ class ProphetNetConfig(PretrainedConfig):
|
|||
init_std=0.02,
|
||||
is_encoder_decoder=True,
|
||||
add_cross_attention=True,
|
||||
pad_token_id=0,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
decoder_start_token_id=0,
|
||||
ngram=2,
|
||||
num_buckets=32,
|
||||
relative_max_distance=128,
|
||||
disable_ngram_loss=False,
|
||||
eps=0.0,
|
||||
use_cache=True,
|
||||
pad_token_id=0,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(
|
||||
|
@ -156,6 +159,8 @@ class ProphetNetConfig(PretrainedConfig):
|
|||
self.activation_dropout = activation_dropout
|
||||
self.dropout = dropout
|
||||
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def num_attention_heads(self) -> int:
|
||||
return self.num_encoder_attention_heads
|
||||
|
|
|
@ -72,6 +72,8 @@ RAG_CONFIG_DOC = r"""
|
|||
output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and
|
||||
:obj:`context_attention_mask` are returned. See returned tensors for more detail.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
"""
|
||||
|
||||
|
||||
|
@ -107,6 +109,7 @@ class RagConfig(PretrainedConfig):
|
|||
exclude_bos_score=False,
|
||||
do_marginalize=False,
|
||||
output_retrieved=False,
|
||||
use_cache=True,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(
|
||||
|
@ -156,6 +159,8 @@ class RagConfig(PretrainedConfig):
|
|||
|
||||
self.do_deduplication = do_deduplication
|
||||
|
||||
self.use_cache = use_cache
|
||||
|
||||
@classmethod
|
||||
def from_question_encoder_generator_configs(
|
||||
cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs
|
||||
|
|
|
@ -138,6 +138,8 @@ class ReformerConfig(PretrainedConfig):
|
|||
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
|
||||
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether to tie input and output embeddings.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
|
||||
Examples::
|
||||
|
||||
|
@ -188,6 +190,7 @@ class ReformerConfig(PretrainedConfig):
|
|||
pad_token_id=0,
|
||||
vocab_size=320,
|
||||
tie_word_embeddings=False,
|
||||
use_cache=True,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(
|
||||
|
@ -226,3 +229,4 @@ class ReformerConfig(PretrainedConfig):
|
|||
self.axial_norm_std = axial_norm_std
|
||||
self.chunk_size_lm_head = chunk_size_lm_head
|
||||
self.attn_layers = attn_layers
|
||||
self.use_cache = use_cache
|
||||
|
|
|
@ -69,6 +69,8 @@ class T5Config(PretrainedConfig):
|
|||
feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`):
|
||||
Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses
|
||||
the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
"""
|
||||
model_type = "t5"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
@ -88,6 +90,7 @@ class T5Config(PretrainedConfig):
|
|||
initializer_factor=1.0,
|
||||
feed_forward_proj="relu",
|
||||
is_encoder_decoder=True,
|
||||
use_cache=True,
|
||||
pad_token_id=0,
|
||||
eos_token_id=1,
|
||||
**kwargs
|
||||
|
@ -112,6 +115,7 @@ class T5Config(PretrainedConfig):
|
|||
self.layer_norm_epsilon = layer_norm_epsilon
|
||||
self.initializer_factor = initializer_factor
|
||||
self.feed_forward_proj = feed_forward_proj
|
||||
self.use_cache = use_cache
|
||||
|
||||
@property
|
||||
def hidden_size(self):
|
||||
|
|
|
@ -884,7 +884,7 @@ T5_INPUTS_DOCSTRING = r"""
|
|||
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
|
||||
details.
|
||||
|
||||
To know more on how to prepare :obj:`inputs` for pre-training take a look at `T5 Training
|
||||
To know more on how to prepare :obj:`inputs` for pretraining take a look at `T5 Training
|
||||
<./t5.html#training>`__.
|
||||
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
|
||||
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
|
||||
|
|
|
@ -15,6 +15,8 @@
|
|||
# limitations under the License.
|
||||
""" XLNet configuration """
|
||||
|
||||
import warnings
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
@ -106,12 +108,18 @@ class XLNetConfig(PretrainedConfig):
|
|||
Used in the SQuAD evaluation script.
|
||||
end_n_top (:obj:`int`, `optional`, defaults to 5):
|
||||
Used in the SQuAD evaluation script.
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should return the last pre-computed hidden states.
|
||||
use_mems_eval (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.
|
||||
use_mems_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not the model should make use of the recurrent memory mechanism in train mode.
|
||||
|
||||
.. note::
|
||||
This flag behaves differently from with other models: it just controls the inference behavior, during
|
||||
training the model always uses ``use_cache=True``.
|
||||
For pretraining, it is recommended to set ``use_mems_train`` to :obj:`True`. For fine-tuning, it is
|
||||
recommended to set ``use_mems_train`` to :obj:`False` as discussed `here
|
||||
<https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587>`__. If ``use_mems_train`` is set
|
||||
to :obj:`True`, one has to make sure that the train batches are correctly pre-processed, `e.g.`
|
||||
:obj:`batch_1 = [[This line is], [This is the]]` and :obj:`batch_2 = [[ the first line], [ second
|
||||
line]]` and that all batches are of equal size.
|
||||
|
||||
Examples::
|
||||
|
||||
|
@ -145,6 +153,8 @@ class XLNetConfig(PretrainedConfig):
|
|||
dropout=0.1,
|
||||
mem_len=512,
|
||||
reuse_len=None,
|
||||
use_mems_eval=True,
|
||||
use_mems_train=False,
|
||||
bi_data=False,
|
||||
clamp_len=-1,
|
||||
same_length=False,
|
||||
|
@ -197,6 +207,16 @@ class XLNetConfig(PretrainedConfig):
|
|||
self.pad_token_id = pad_token_id
|
||||
self.eos_token_id = eos_token_id
|
||||
|
||||
if "use_cache" in kwargs:
|
||||
warnings.warn(
|
||||
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems_eval` instead.",
|
||||
FutureWarning,
|
||||
)
|
||||
use_mems_eval = kwargs["use_cache"]
|
||||
|
||||
self.use_mems_eval = use_mems_eval
|
||||
self.use_mems_train = use_mems_train
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
return -1
|
||||
|
|
|
@ -440,6 +440,9 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
self.layer = [TFXLNetLayer(config, name="layer_._{}".format(i)) for i in range(config.n_layer)]
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout)
|
||||
|
||||
self.use_mems_eval = config.use_mems_eval
|
||||
self.use_mems_train = config.use_mems_train
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.word_embedding
|
||||
|
||||
|
@ -489,14 +492,23 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
return ret
|
||||
|
||||
def cache_mem(self, curr_out, prev_mem):
|
||||
"""cache hidden states into memory."""
|
||||
# cache hidden states into memory.
|
||||
if self.reuse_len is not None and self.reuse_len > 0:
|
||||
curr_out = curr_out[: self.reuse_len]
|
||||
|
||||
if prev_mem is None:
|
||||
new_mem = curr_out[-self.mem_len :]
|
||||
if self.mem_len is None or self.mem_len == 0:
|
||||
# If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
||||
# and returns all of the past and current hidden states.
|
||||
cutoff = 0
|
||||
else:
|
||||
new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]
|
||||
# If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
||||
# states. This is the preferred setting for training and long-form generation.
|
||||
cutoff = -self.mem_len
|
||||
if prev_mem is None:
|
||||
# if :obj:`use_mems` is active and `mem_len` is defined, the model
|
||||
new_mem = curr_out[cutoff:]
|
||||
else:
|
||||
new_mem = tf.concat([prev_mem, curr_out], 0)[cutoff:]
|
||||
|
||||
return tf.stop_gradient(new_mem)
|
||||
|
||||
|
@ -569,7 +581,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -587,7 +599,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -602,6 +614,11 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
)
|
||||
return_dict = inputs["return_dict"] if inputs["return_dict"] is not None else self.return_dict
|
||||
|
||||
if training:
|
||||
use_mems = use_mems if use_mems is not None else self.use_mems_train
|
||||
else:
|
||||
use_mems = use_mems if use_mems is not None else self.use_mems_eval
|
||||
|
||||
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
||||
# but we want a unified interface in the library with the batch size on the first dimension
|
||||
# so we move here the first dimension (batch) to the end
|
||||
|
@ -737,7 +754,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
hidden_states = [] if output_hidden_states else None
|
||||
for i, layer_module in enumerate(self.layer):
|
||||
# cache new mems
|
||||
if self.mem_len is not None and self.mem_len > 0 and use_cache:
|
||||
if use_mems:
|
||||
new_mems = new_mems + (self.cache_mem(output_h, inputs["mems"][i]),)
|
||||
if output_hidden_states:
|
||||
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
||||
|
@ -768,7 +785,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||
output = tf.transpose(output, perm=(1, 0, 2))
|
||||
|
||||
if not (self.mem_len is not None and self.mem_len > 0 and use_cache):
|
||||
if not use_mems:
|
||||
new_mems = None
|
||||
if output_hidden_states:
|
||||
if output_g is not None:
|
||||
|
@ -1066,7 +1083,7 @@ XLNET_INPUTS_DOCSTRING = r"""
|
|||
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
||||
as they have already been computed.
|
||||
|
||||
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||
:obj::obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||
perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
||||
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
||||
|
||||
|
@ -1147,7 +1164,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1165,7 +1182,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1182,7 +1199,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||
input_mask=inputs["input_mask"],
|
||||
head_mask=inputs["head_mask"],
|
||||
inputs_embeds=inputs["inputs_embeds"],
|
||||
use_cache=inputs["use_cache"],
|
||||
use_mems=inputs["use_mems"],
|
||||
output_attentions=inputs["output_attentions"],
|
||||
output_hidden_states=inputs["output_hidden_states"],
|
||||
return_dict=inputs["return_dict"],
|
||||
|
@ -1207,7 +1224,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||
def get_output_embeddings(self):
|
||||
return self.lm_loss.input_embeddings
|
||||
|
||||
def prepare_inputs_for_generation(self, inputs, past, **kwargs):
|
||||
def prepare_inputs_for_generation(self, inputs, past, use_mems=None, **kwargs):
|
||||
# Add dummy token at the end (no attention on this one)
|
||||
|
||||
# At every pass, the attention values for the new token and the two last generated tokens
|
||||
|
@ -1238,7 +1255,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||
"input_ids": inputs,
|
||||
"perm_mask": perm_mask,
|
||||
"target_mapping": target_mapping,
|
||||
"use_cache": kwargs["use_cache"],
|
||||
"use_mems": kwargs.get("use_mems"),
|
||||
}
|
||||
|
||||
# if past is defined in model kwargs then use it for faster decoding
|
||||
|
@ -1260,7 +1277,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1309,7 +1326,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1328,7 +1345,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||
input_mask=inputs["input_mask"],
|
||||
head_mask=inputs["head_mask"],
|
||||
inputs_embeds=inputs["inputs_embeds"],
|
||||
use_cache=inputs["use_cache"],
|
||||
use_mems=inputs["use_mems"],
|
||||
output_attentions=inputs["output_attentions"],
|
||||
output_hidden_states=inputs["output_hidden_states"],
|
||||
return_dict=return_dict,
|
||||
|
@ -1395,7 +1412,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1420,7 +1437,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1439,7 +1456,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||
input_mask=inputs["input_mask"],
|
||||
head_mask=inputs["head_mask"],
|
||||
inputs_embeds=inputs["inputs_embeds"],
|
||||
use_cache=inputs["use_cache"],
|
||||
use_mems=inputs["use_mems"],
|
||||
output_attentions=inputs["output_attentions"],
|
||||
output_hidden_states=inputs["output_hidden_states"],
|
||||
return_dict=return_dict,
|
||||
|
@ -1512,7 +1529,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||
target_mapping=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1526,6 +1543,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
|
||||
:obj:`input_ids` above)
|
||||
"""
|
||||
|
||||
inputs = input_processing(
|
||||
func=self.call,
|
||||
input_ids=input_ids,
|
||||
|
@ -1537,7 +1555,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1579,7 +1597,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||
flat_input_mask,
|
||||
inputs["head_mask"],
|
||||
flat_inputs_embeds,
|
||||
inputs["use_cache"],
|
||||
inputs["use_mems"],
|
||||
inputs["output_attentions"],
|
||||
inputs["output_hidden_states"],
|
||||
return_dict=return_dict,
|
||||
|
@ -1639,7 +1657,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1663,7 +1681,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1682,7 +1700,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||
input_mask=inputs["input_mask"],
|
||||
head_mask=inputs["head_mask"],
|
||||
inputs_embeds=inputs["inputs_embeds"],
|
||||
use_cache=inputs["use_cache"],
|
||||
use_mems=inputs["use_mems"],
|
||||
output_attentions=inputs["output_attentions"],
|
||||
output_hidden_states=inputs["output_hidden_states"],
|
||||
return_dict=return_dict,
|
||||
|
@ -1739,7 +1757,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=True,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
|
@ -1769,7 +1787,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1789,7 +1807,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||
input_mask=inputs["input_mask"],
|
||||
head_mask=inputs["head_mask"],
|
||||
inputs_embeds=inputs["inputs_embeds"],
|
||||
use_cache=inputs["use_cache"],
|
||||
use_mems=inputs["use_mems"],
|
||||
output_attentions=inputs["output_attentions"],
|
||||
output_hidden_states=inputs["output_hidden_states"],
|
||||
return_dict=return_dict,
|
||||
|
|
|
@ -16,6 +16,7 @@
|
|||
"""
|
||||
PyTorch XLNet model.
|
||||
"""
|
||||
import warnings
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Optional, Tuple
|
||||
|
||||
|
@ -876,7 +877,7 @@ XLNET_INPUTS_DOCSTRING = r"""
|
|||
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
||||
as they have already been computed.
|
||||
|
||||
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||
:obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||
perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
||||
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
||||
|
||||
|
@ -997,15 +998,15 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||
curr_out = curr_out[: self.reuse_len]
|
||||
|
||||
if self.mem_len is None or self.mem_len == 0:
|
||||
# If :obj:`use_cache` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
||||
# If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
||||
# and returns all of the past and current hidden states.
|
||||
cutoff = 0
|
||||
else:
|
||||
# If :obj:`use_cache` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
||||
# If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
||||
# states. This is the preferred setting for training and long-form generation.
|
||||
cutoff = -self.mem_len
|
||||
if prev_mem is None:
|
||||
# if :obj:`use_cache` is active and `mem_len` is defined, the model
|
||||
# if :obj:`use_mems` is active and `mem_len` is defined, the model
|
||||
new_mem = curr_out[cutoff:]
|
||||
else:
|
||||
new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:]
|
||||
|
@ -1080,10 +1081,11 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||
input_mask=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete after depreciation warning is removed
|
||||
):
|
||||
|
||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||
|
@ -1091,7 +1093,18 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||
)
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
if "use_cache" in kwargs:
|
||||
warnings.warn(
|
||||
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems` instead.",
|
||||
FutureWarning,
|
||||
)
|
||||
use_mems = kwargs["use_cache"]
|
||||
|
||||
if self.training:
|
||||
use_mems = use_mems if use_mems is not None else self.config.use_mems_train
|
||||
else:
|
||||
use_mems = use_mems if use_mems is not None else self.config.use_mems_eval
|
||||
|
||||
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
||||
# but we want a unified interface in the library with the batch size on the first dimension
|
||||
|
@ -1222,7 +1235,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||
attentions = [] if output_attentions else None
|
||||
hidden_states = [] if output_hidden_states else None
|
||||
for i, layer_module in enumerate(self.layer):
|
||||
if use_cache:
|
||||
if use_mems:
|
||||
# cache new mems
|
||||
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
||||
if output_hidden_states:
|
||||
|
@ -1253,7 +1266,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||
output = output.permute(1, 0, 2).contiguous()
|
||||
|
||||
if not use_cache:
|
||||
if not use_mems:
|
||||
new_mems = None
|
||||
|
||||
if output_hidden_states:
|
||||
|
@ -1299,7 +1312,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||
def get_output_embeddings(self):
|
||||
return self.lm_loss
|
||||
|
||||
def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
|
||||
def prepare_inputs_for_generation(self, input_ids, past=None, use_mems=None, **kwargs):
|
||||
# Add dummy token at the end (no attention on this one)
|
||||
|
||||
effective_batch_size = input_ids.shape[0]
|
||||
|
@ -1332,7 +1345,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||
"input_ids": input_ids,
|
||||
"perm_mask": perm_mask,
|
||||
"target_mapping": target_mapping,
|
||||
"use_cache": use_cache,
|
||||
"use_mems": use_mems,
|
||||
}
|
||||
|
||||
# if past is defined in model kwargs then use it for faster decoding
|
||||
|
@ -1355,10 +1368,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
labels=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
|
||||
|
@ -1407,7 +1421,6 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||
>>> next_token_logits = outputs.logits # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
transformer_outputs = self.transformer(
|
||||
input_ids,
|
||||
|
@ -1419,10 +1432,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
logits = self.lm_loss(transformer_outputs[0])
|
||||
|
@ -1483,10 +1497,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
labels=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||
|
@ -1495,7 +1510,6 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
transformer_outputs = self.transformer(
|
||||
input_ids,
|
||||
|
@ -1507,10 +1521,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
**kwargs,
|
||||
)
|
||||
output = transformer_outputs[0]
|
||||
|
||||
|
@ -1576,10 +1591,11 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
labels=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||
|
@ -1588,7 +1604,6 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||
`input_ids` above)
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
outputs = self.transformer(
|
||||
input_ids,
|
||||
|
@ -1600,7 +1615,7 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
|
@ -1673,10 +1688,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
labels=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||
|
@ -1685,7 +1701,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||
:obj:`input_ids` above)
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
|
||||
|
||||
flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
|
||||
|
@ -1708,10 +1724,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||
target_mapping=target_mapping,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=flat_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
output = transformer_outputs[0]
|
||||
|
@ -1775,10 +1792,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||
inputs_embeds=None,
|
||||
start_positions=None,
|
||||
end_positions=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||
|
@ -1791,7 +1809,6 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||
sequence are not taken into account for computing the loss.
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
outputs = self.transformer(
|
||||
input_ids,
|
||||
|
@ -1803,10 +1820,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
sequence_output = outputs[0]
|
||||
|
@ -1885,10 +1903,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||
is_impossible=None,
|
||||
cls_index=None,
|
||||
p_mask=None,
|
||||
use_cache=None,
|
||||
use_mems=None,
|
||||
output_attentions=None,
|
||||
output_hidden_states=None,
|
||||
return_dict=None,
|
||||
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||
):
|
||||
r"""
|
||||
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||
|
@ -1926,7 +1945,6 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||
>>> loss = outputs.loss
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
||||
|
||||
transformer_outputs = self.transformer(
|
||||
input_ids,
|
||||
|
@ -1938,10 +1956,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||
input_mask=input_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
use_mems=use_mems,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
**kwargs,
|
||||
)
|
||||
hidden_states = transformer_outputs[0]
|
||||
start_logits = self.start_logits(hidden_states, p_mask=p_mask)
|
||||
|
|
|
@ -153,7 +153,7 @@ class TFXLNetModelTester:
|
|||
inputs = [input_ids_1, input_mask]
|
||||
result = model(inputs)
|
||||
|
||||
config.mem_len = 0
|
||||
config.use_mems_eval = False
|
||||
model = TFXLNetModel(config)
|
||||
no_mems_outputs = model(inputs)
|
||||
self.parent.assertEqual(len(no_mems_outputs), 1)
|
||||
|
|
|
@ -206,7 +206,36 @@ class XLNetModelTester:
|
|||
[(self.seq_length, self.batch_size, self.hidden_size)] * self.num_hidden_layers,
|
||||
)
|
||||
|
||||
def create_and_check_xlnet_model_use_cache(
|
||||
def create_and_check_use_mems_train(
|
||||
self,
|
||||
config,
|
||||
input_ids_1,
|
||||
input_ids_2,
|
||||
input_ids_q,
|
||||
perm_mask,
|
||||
input_mask,
|
||||
target_mapping,
|
||||
segment_ids,
|
||||
lm_labels,
|
||||
sequence_labels,
|
||||
is_impossible_labels,
|
||||
token_labels,
|
||||
):
|
||||
model = XLNetForSequenceClassification(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
|
||||
train_size = input_ids_1.shape[0]
|
||||
|
||||
batch_size = 4
|
||||
for i in range(train_size // batch_size + 1):
|
||||
input_ids = input_ids_1[i : (i + 1) * batch_size]
|
||||
labels = sequence_labels[i : (i + 1) * batch_size]
|
||||
outputs = model(input_ids=input_ids, labels=labels, return_dict=True)
|
||||
self.parent.assertIsNone(outputs.mems)
|
||||
self.parent.assertIsNotNone(outputs.loss)
|
||||
|
||||
def create_and_check_xlnet_model_use_mems(
|
||||
self,
|
||||
config,
|
||||
input_ids_1,
|
||||
|
@ -234,8 +263,8 @@ class XLNetModelTester:
|
|||
device=torch_device,
|
||||
)
|
||||
causal_mask = torch.triu(causal_mask, diagonal=0)
|
||||
outputs_cache = model(input_ids_1, use_cache=True, perm_mask=causal_mask)
|
||||
outputs_no_cache = model(input_ids_1, use_cache=False, perm_mask=causal_mask)
|
||||
outputs_cache = model(input_ids_1, use_mems=True, perm_mask=causal_mask)
|
||||
outputs_no_cache = model(input_ids_1, use_mems=False, perm_mask=causal_mask)
|
||||
outputs_conf = model(input_ids_1)
|
||||
|
||||
self.parent.assertTrue(len(outputs_cache) == len(outputs_conf))
|
||||
|
@ -525,11 +554,15 @@ class XLNetModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
|||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
|
||||
|
||||
def test_xlnet_base_model_use_cache(self):
|
||||
# checking that in auto-regressive mode, :obj:`use_cache` gives the same results
|
||||
def test_xlnet_base_model_use_mems(self):
|
||||
# checking that in auto-regressive mode, :obj:`use_mems` gives the same results
|
||||
self.model_tester.set_seed()
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_xlnet_model_use_cache(*config_and_inputs)
|
||||
self.model_tester.create_and_check_xlnet_model_use_mems(*config_and_inputs)
|
||||
|
||||
def test_seq_classification_use_mems_train(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_use_mems_train(*config_and_inputs)
|
||||
|
||||
def test_xlnet_base_model_with_att_output(self):
|
||||
self.model_tester.set_seed()
|
||||
|
|
Loading…
Reference in New Issue