parent
d44ac47bac
commit
71f772ebd0
|
@ -17,7 +17,7 @@ documentation.
|
|||
|
||||
## A
|
||||
|
||||
### Attention mask
|
||||
### attention mask
|
||||
|
||||
The attention mask is an optional argument used when batching sequences together.
|
||||
|
||||
|
@ -75,23 +75,40 @@ by the tokenizer under the key "attention_mask":
|
|||
|
||||
### autoencoding models
|
||||
|
||||
see [MLM](#mlm)
|
||||
see [masked language modeling](#masked-language-modeling)
|
||||
|
||||
### autoregressive models
|
||||
|
||||
see [CLM](#clm)
|
||||
see [causal language modeling](#causal-language-modeling)
|
||||
|
||||
## B
|
||||
|
||||
### backbone
|
||||
|
||||
The backbone is the network (embeddings and layers) that outputs the raw hidden states or features. It is usually connected to a [head](#head) which accepts the features as its input to make a prediction. For example, [`ViTModel`] is a backbone without a specific head on top. Other models can also use [`VitModel`] as a backbone such as [DPT](model_doc/dpt).
|
||||
|
||||
## C
|
||||
|
||||
### CLM
|
||||
### channel
|
||||
|
||||
Causal language modeling, a pretraining task where the model reads the texts in order and has to predict the next word.
|
||||
It's usually done by reading the whole sentence but using a mask inside the model to hide the future tokens at a
|
||||
certain timestep.
|
||||
Color images are made up of some combination of values in three channels - red, green, and blue (RGB) - and grayscale images only have one channel. In 🤗 Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`].
|
||||
|
||||
### causal language modeling
|
||||
|
||||
A pretraining task where the model reads the texts in order and has to predict the next word. It's usually done by
|
||||
reading the whole sentence but using a mask inside the model to hide the future tokens at a certain timestep.
|
||||
|
||||
### connectionist temporal classification (CTC)
|
||||
|
||||
An algorithm which allows a model to learn without knowing exactly how the input and output are aligned; CTC calculates the distribution of all possible outputs for a given input and chooses the most likely output from it. CTC is commonly used in speech recognition tasks because speech doesn't always cleanly align with the transcript for a variety of reasons such as a speaker's different speech rates.
|
||||
|
||||
### convolution
|
||||
|
||||
A type of layer in a neural network where the input matrix is multiplied element-wise by a smaller matrix (kernel or filter) and the values are summed up in a new matrix. This is known as a convolutional operation which is repeated over the entire input matrix. Each operation is applied to a different segment of the input matrix. Convolutional neural networks (CNNs) are commonly used in computer vision.
|
||||
|
||||
## D
|
||||
|
||||
### Decoder input IDs
|
||||
### decoder input IDs
|
||||
|
||||
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
|
||||
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
|
||||
|
@ -108,7 +125,7 @@ Machine learning algorithms which uses neural networks with several layers.
|
|||
|
||||
## F
|
||||
|
||||
### Feed Forward Chunking
|
||||
### feed forward chunking
|
||||
|
||||
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
|
||||
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
|
||||
|
@ -127,12 +144,26 @@ For models employing the function [`apply_chunking_to_forward`], the `chunk_size
|
|||
embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity. If
|
||||
`chunk_size` is set to 0, no feed forward chunking is done.
|
||||
|
||||
## H
|
||||
|
||||
### head
|
||||
|
||||
The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:
|
||||
|
||||
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
|
||||
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
|
||||
* [`Wav2Vec2ForCTC`] ia a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].
|
||||
|
||||
## I
|
||||
|
||||
### Input IDs
|
||||
### image patch
|
||||
|
||||
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
|
||||
numerical representations of tokens building the sequences that will be used as input by the model*.
|
||||
Vision-based Transformers models split an image into smaller patches which are linearly embedded, and then passed as a sequence to the model. You can find the `patch_size` - or resolution - of the model in it's configuration.
|
||||
|
||||
### input IDs
|
||||
|
||||
The input ids are often the only required parameters to be passed to the model as input. They are token indices,
|
||||
numerical representations of tokens building the sequences that will be used as input by the model.
|
||||
|
||||
<Youtube id="VFp38yj8h3A"/>
|
||||
|
||||
|
@ -171,7 +202,7 @@ Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
|
|||
```
|
||||
|
||||
The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
|
||||
token indices are under the key "input_ids":
|
||||
token indices are under the key `input_ids`:
|
||||
|
||||
```python
|
||||
>>> encoded_sequence = inputs["input_ids"]
|
||||
|
@ -199,7 +230,7 @@ because this is the way a [`BertModel`] is going to expect its inputs.
|
|||
|
||||
## L
|
||||
|
||||
### Labels
|
||||
### labels
|
||||
|
||||
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
|
||||
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
|
||||
|
@ -207,28 +238,34 @@ predictions and the expected value (the label).
|
|||
|
||||
These labels are different according to the model head, for example:
|
||||
|
||||
- For sequence classification models (e.g., [`BertForSequenceClassification`]), the model expects a tensor of dimension
|
||||
- For sequence classification models ([`BertForSequenceClassification`]), the model expects a tensor of dimension
|
||||
`(batch_size)` with each value of the batch corresponding to the expected label of the entire sequence.
|
||||
- For token classification models (e.g., [`BertForTokenClassification`]), the model expects a tensor of dimension
|
||||
- For token classification models ([`BertForTokenClassification`]), the model expects a tensor of dimension
|
||||
`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token.
|
||||
- For masked language modeling (e.g., [`BertForMaskedLM`]), the model expects a tensor of dimension `(batch_size,
|
||||
- For masked language modeling ([`BertForMaskedLM`]), the model expects a tensor of dimension `(batch_size,
|
||||
seq_length)` with each value corresponding to the expected label of each individual token: the labels being the token
|
||||
ID for the masked token, and values to be ignored for the rest (usually -100).
|
||||
- For sequence to sequence tasks,(e.g., [`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), the model
|
||||
- For sequence to sequence tasks,([`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), the model
|
||||
expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences
|
||||
associated with each input sequence. During training, both *BART* and *T5* will make the appropriate
|
||||
*decoder_input_ids* and decoder attention masks internally. They usually do not need to be supplied. This does not
|
||||
apply to models leveraging the Encoder-Decoder framework. See the documentation of each model for more information on
|
||||
each specific model's labels.
|
||||
associated with each input sequence. During training, both BART and T5 will make the appropriate
|
||||
`decoder_input_ids` and decoder attention masks internally. They usually do not need to be supplied. This does not
|
||||
apply to models leveraging the Encoder-Decoder framework.
|
||||
|
||||
<Tip>
|
||||
|
||||
The base models (e.g., [`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting
|
||||
Each model's labels may be different, so be sure to always check the documentation of each model for more information
|
||||
about their specific labels!
|
||||
|
||||
</Tip>
|
||||
|
||||
The base models ([`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting
|
||||
features.
|
||||
|
||||
## M
|
||||
|
||||
### MLM
|
||||
### masked language modeling
|
||||
|
||||
Masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done by
|
||||
A pretraining task where the model sees a corrupted version of the texts, usually done by
|
||||
masking some tokens randomly, and has to predict the original text.
|
||||
|
||||
### multimodal
|
||||
|
@ -237,22 +274,30 @@ A task that combines texts with another kind of inputs (for instance images).
|
|||
|
||||
## N
|
||||
|
||||
### NLG
|
||||
### Natural language generation
|
||||
|
||||
Natural language generation, all tasks related to generating text (for instance talk with transformers, translation).
|
||||
All tasks related to generating text (for instance talk with transformers, translation).
|
||||
|
||||
### NLP
|
||||
### Natural language processing
|
||||
|
||||
Natural language processing, a generic way to say "deal with texts".
|
||||
A generic way to say "deal with texts".
|
||||
|
||||
### NLU
|
||||
### Natural language understanding
|
||||
|
||||
Natural language understanding, all tasks related to understanding what is in a text (for instance classifying the
|
||||
All tasks related to understanding what is in a text (for instance classifying the
|
||||
whole text, individual words).
|
||||
|
||||
## P
|
||||
|
||||
### Position IDs
|
||||
### pixel values
|
||||
|
||||
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from a feature extractor.
|
||||
|
||||
### pooling
|
||||
|
||||
An operation that reduces a matrix into a smaller matrix, either by taking the maximum or average of the pooled dimension(s). Pooling layers are commonly found between convolutional layers to downsample the feature representation.
|
||||
|
||||
### position IDs
|
||||
|
||||
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
|
||||
each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in the
|
||||
|
@ -268,26 +313,37 @@ other types of positional embeddings, such as sinusoidal position embeddings or
|
|||
### pretrained model
|
||||
|
||||
A model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods involve a
|
||||
self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or masking some
|
||||
words and trying to predict them (see MLM).
|
||||
self-supervised objective, which can be reading the text and trying to predict the next word (see [causal language
|
||||
modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language
|
||||
modeling](#masked-language-modeling)).
|
||||
|
||||
Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).
|
||||
|
||||
## R
|
||||
|
||||
### RNN
|
||||
### recurrent neural network
|
||||
|
||||
Recurrent neural network, a type of model that uses a loop over a layer to process texts.
|
||||
A type of model that uses a loop over a layer to process texts.
|
||||
|
||||
## S
|
||||
|
||||
### sampling rate
|
||||
|
||||
A measurement in hertz of the number of samples (the audio signal) taken per second. The sampling rate is a result of discretizing a continuous signal such as speech.
|
||||
|
||||
### self-attention
|
||||
|
||||
Each element of the input finds out which other elements of the input they should attend to.
|
||||
|
||||
### seq2seq or sequence-to-sequence
|
||||
### sequence-to-sequence (seq2seq)
|
||||
|
||||
Models that generate a new sequence from an input, like translation models, or summarization models (such as
|
||||
[Bart](model_doc/bart) or [T5](model_doc/t5)).
|
||||
|
||||
### stride
|
||||
|
||||
In [convolution](#convolution) or [pooling](#pooling), the stride refers to the distance the kernel is moved over a matrix. A stride of 1 means the kernel is moved one pixel over at a time, and a stride of 2 means the kernel is moved two pixels over at a time.
|
||||
|
||||
## T
|
||||
|
||||
### token
|
||||
|
@ -295,7 +351,7 @@ Models that generate a new sequence from an input, like translation models, or s
|
|||
A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a
|
||||
punctuation symbol.
|
||||
|
||||
### Token Type IDs
|
||||
### token Type IDs
|
||||
|
||||
Some models' purpose is to do classification on pairs of sentences or question answering.
|
||||
|
||||
|
|
Loading…
Reference in New Issue