Update TF fine-tuning docs (#18654)
* Update TF fine-tuning docs * Fix formatting * Add some section headers so the right sidebar works better * Squiggly it * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Explain things in the text, not the comments * Make the two dataset creation methods into a list * Move the advice about collation out of a <Tip> * Edits for clarity * Edits for clarity * Edits for clarity * Replace `to_tf_dataset` with `prepare_tf_dataset` in the fine-tuning pages * Restructure the page a little bit * Restructure the page a little bit * Restructure the page a little bit Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
parent
d842f2d5b9
commit
2b9513fdab
|
@ -245,20 +245,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... lm_dataset["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_test_set = model.prepare_tf_dataset(
|
||||
... lm_dataset["test"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
@ -352,20 +350,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... lm_dataset["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_test_set = model.prepare_tf_dataset(
|
||||
... lm_dataset["test"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
|
|
@ -224,21 +224,19 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
|
||||
>>> tf_train_set = tokenized_swag["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids"],
|
||||
... label_cols=["labels"],
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_swag["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=batch_size,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids"],
|
||||
... label_cols=["labels"],
|
||||
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||
... tokenized_swag["validation"],
|
||||
... shuffle=False,
|
||||
... batch_size=batch_size,
|
||||
... collate_fn=data_collator,
|
||||
|
@ -273,10 +271,7 @@ Load BERT with [`TFAutoModelForMultipleChoice`]:
|
|||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> model.compile(
|
||||
... optimizer=optimizer,
|
||||
... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
||||
... )
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
|
|
@ -199,20 +199,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_squad["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
||||
... dummy_labels=True,
|
||||
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||
... tokenized_squad["validation"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
|
|
@ -144,18 +144,19 @@ At this point, only three steps remain:
|
|||
</Tip>
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "label"],
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_imdb["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "label"],
|
||||
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||
... tokenized_imdb["test"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
|
|
@ -159,18 +159,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_billsum["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = tokenized_billsum["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_test_set = model.prepare_tf_dataset(
|
||||
... tokenized_billsum["test"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
|
|
@ -199,18 +199,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_wnut["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||
... tokenized_wnut["validation"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
|
|
@ -175,18 +175,18 @@ At this point, only three steps remain:
|
|||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_train_set = model.prepare_tf_dataset(
|
||||
... tokenized_books["train"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
>>> tf_test_set = model.prepare_tf_dataset(
|
||||
... tokenized_books["test"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
|
@ -216,7 +216,7 @@ Configure the model for training with [`compile`](https://keras.io/api/models/mo
|
|||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||
>>> model.fit(tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
|
|
@ -65,10 +65,16 @@ If you like, you can create a smaller subset of the full dataset to fine-tune on
|
|||
|
||||
## Train
|
||||
|
||||
At this point, you should follow the section corresponding to the framework you want to use. You can use the links
|
||||
in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework,
|
||||
just use the button at the top-right of that framework's block!
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
<Youtube id="nvBXf7s7vTI"/>
|
||||
|
||||
## Train with PyTorch Trainer
|
||||
|
||||
🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.
|
||||
|
||||
Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:
|
||||
|
@ -151,66 +157,113 @@ Then fine-tune your model by calling [`~transformers.Trainer.train`]:
|
|||
|
||||
<Youtube id="rnTGBy2ax1c"/>
|
||||
|
||||
🤗 Transformers models also supports training in TensorFlow with the Keras API.
|
||||
## Train a TensorFlow model with Keras
|
||||
|
||||
### Convert dataset to TensorFlow format
|
||||
You can also train 🤗 Transformers models in TensorFlow with the Keras API!
|
||||
|
||||
The [`DefaultDataCollator`] assembles tensors into a batch for the model to train on. Make sure you specify `return_tensors` to return TensorFlow tensors:
|
||||
### Loading data for Keras
|
||||
|
||||
When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that
|
||||
Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.
|
||||
Let's try that first before we do anything more complicated.
|
||||
|
||||
First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue),
|
||||
since it's a simple binary text classification task, and just take the training split for now.
|
||||
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
from datasets import load_dataset
|
||||
|
||||
>>> data_collator = DefaultDataCollator(return_tensors="tf")
|
||||
dataset = load_dataset("glue", "cola")
|
||||
dataset = dataset["train"] # Just take the training split for now
|
||||
```
|
||||
|
||||
Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s,
|
||||
so we can just convert that directly to a NumPy array without tokenization!
|
||||
|
||||
```py
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
||||
tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True)
|
||||
|
||||
labels = np.array(dataset["label"]) # Label is already an array of 0 and 1
|
||||
```
|
||||
|
||||
Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model:
|
||||
|
||||
```py
|
||||
from transformers import TFAutoModelForSequenceClassification
|
||||
from tensorflow.keras.optimizers import Adam
|
||||
|
||||
# Load and compile our model
|
||||
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
|
||||
# Lower learning rates are often better for fine-tuning transformers
|
||||
model.compile(optimizer=Adam(3e-5))
|
||||
|
||||
model.fit(tokenized_data, labels)
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
[`Trainer`] uses [`DataCollatorWithPadding`] by default so you don't need to explicitly specify a data collator.
|
||||
You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically
|
||||
choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always
|
||||
override this by specifying a loss yourself if you want to!
|
||||
|
||||
</Tip>
|
||||
|
||||
Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:
|
||||
This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why?
|
||||
Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle
|
||||
“jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole
|
||||
dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!
|
||||
|
||||
### Loading data as a tf.data.Dataset
|
||||
|
||||
If you want to avoid slowing down training, you can load your data as a `tf.data.Dataset` instead. Although you can write your own
|
||||
`tf.data` pipeline if you want, we have two convenience methods for doing this:
|
||||
|
||||
- [`~TFPreTrainedModel.prepare_tf_dataset`]: This is the method we recommend in most cases. Because it is a method
|
||||
on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and
|
||||
discard the others to make a simpler, more performant dataset.
|
||||
- [`~datasets.Dataset.to_tf_dataset`]: This method is more low-level, and is useful when you want to exactly control how
|
||||
your dataset is created, by specifying exactly which `columns` and `label_cols` to include.
|
||||
|
||||
Before you can use [`~TFPreTrainedModel.prepare_tf_dataset`], you will need to add the tokenizer outputs to your dataset as columns, as shown in
|
||||
the following code sample:
|
||||
|
||||
```py
|
||||
>>> tf_train_dataset = small_train_dataset.to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "token_type_ids"],
|
||||
... label_cols=["labels"],
|
||||
... shuffle=True,
|
||||
... collate_fn=data_collator,
|
||||
... batch_size=8,
|
||||
... )
|
||||
def tokenize_dataset(data):
|
||||
# Keys of the returned dictionary will be added to the dataset as columns
|
||||
return tokenizer(data["text"])
|
||||
|
||||
>>> tf_validation_dataset = small_eval_dataset.to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "token_type_ids"],
|
||||
... label_cols=["labels"],
|
||||
... shuffle=False,
|
||||
... collate_fn=data_collator,
|
||||
... batch_size=8,
|
||||
... )
|
||||
|
||||
dataset = dataset.map(tokenize_dataset)
|
||||
```
|
||||
|
||||
### Compile and fit
|
||||
Remember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the
|
||||
columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly
|
||||
reduces the number of padding tokens compared to padding the entire dataset.
|
||||
|
||||
Load a TensorFlow model with the expected number of labels:
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
>>> from transformers import TFAutoModelForSequenceClassification
|
||||
|
||||
>>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
|
||||
>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
Then compile and fine-tune your model with [`fit`](https://keras.io/api/models/model_training_apis/) as you would with any other Keras model:
|
||||
Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded.
|
||||
If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument.
|
||||
If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language
|
||||
modelling), you can use the `collate_fn` argument instead to pass a function that will be called to transform the
|
||||
list of samples into a batch and apply any preprocessing you want. See our
|
||||
[examples](https://github.com/huggingface/transformers/tree/main/examples) or
|
||||
[notebooks](https://huggingface.co/docs/transformers/notebooks) to see this approach in action.
|
||||
|
||||
Once you've created a `tf.data.Dataset`, you can compile and fit the model as before:
|
||||
|
||||
```py
|
||||
>>> model.compile(
|
||||
... optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
|
||||
... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
||||
... metrics=tf.metrics.SparseCategoricalAccuracy(),
|
||||
... )
|
||||
model.compile(optimizer=Adam(3e-5))
|
||||
|
||||
>>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
|
||||
model.fit(tf_dataset)
|
||||
```
|
||||
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
|
|
Loading…
Reference in New Issue