156 lines
8.6 KiB
Markdown
156 lines
8.6 KiB
Markdown
# Zero-shot classifier distillation
|
|
|
|
Author: @joeddav
|
|
|
|
This script provides a way to improve the speed and memory performance of a zero-shot classifier by training a more
|
|
efficient student model from the zero-shot teacher's predictions over an unlabeled dataset.
|
|
|
|
The zero-shot classification pipeline uses a model pre-trained on natural language inference (NLI) to determine the
|
|
compatibility of a set of candidate class names with a given sequence. This serves as a convenient out-of-the-box
|
|
classifier without the need for labeled training data. However, for a given sequence, the method requires each
|
|
possible label to be fed through the large NLI model separately. Thus for `N` sequences and `K` classes, a total of
|
|
`N*K` forward passes through the model are required. This requirement slows inference considerably, particularly as
|
|
`K` grows.
|
|
|
|
Given (1) an unlabeled corpus and (2) a set of candidate class names, the provided script trains a student model
|
|
with a standard classification head with `K` output dimensions. The resulting student model can then be used for
|
|
classifying novel text instances with a significant boost in speed and memory performance while retaining similar
|
|
classification performance to the original zero-shot model
|
|
|
|
### Usage
|
|
|
|
A teacher NLI model can be distilled to a more efficient student model by running [`distill_classifier.py`](https://github.com/huggingface/transformers/blob/main/examples/research_projects/zero-shot-distillation/distill_classifier.py):
|
|
|
|
```bash
|
|
python distill_classifier.py \
|
|
--data_file <unlabeled_data.txt> \
|
|
--class_names_file <class_names.txt> \
|
|
--output_dir <output_dir>
|
|
```
|
|
|
|
`<unlabeled_data.txt>` should be a text file with a single unlabeled example per line. `<class_names.txt>` is a text file with one class name per line.
|
|
|
|
Other optional arguments include:
|
|
|
|
- `--teacher_name_or_path` (default: `roberta-large-mnli`): The name or path of the NLI teacher model.
|
|
- `--student_name_or_path` (default: `distillbert-base-uncased`): The name or path of the student model which will
|
|
be fine-tuned to copy the teacher predictions.
|
|
- `--hypothesis_template` (default `"This example is {}."`): The template used to turn each label into an NLI-style
|
|
hypothesis when generating teacher predictions. This template must include a `{}` or similar syntax for the
|
|
candidate label to be inserted into the template. For example, the default template is `"This example is {}."` With
|
|
the candidate label `sports`, this would be fed into the model like `[CLS] sequence to classify [SEP] This example
|
|
is sports . [SEP]`.
|
|
- `--multi_class`: Whether or not multiple candidate labels can be true. By default, the scores are normalized such
|
|
that the sum of the label likelihoods for each sequence is 1. If `--multi_class` is passed, the labels are
|
|
considered independent and probabilities are normalized for each candidate by doing a softmax of the entailment
|
|
score vs. the contradiction score. This is sometimes called "multi-class multi-label" classification.
|
|
- `--temperature` (default: `1.0`): The temperature applied to the softmax of the teacher model predictions. A
|
|
higher temperature results in a student with smoother (lower confidence) predictions than the teacher while a value
|
|
`<1` resultings in a higher-confidence, peaked distribution. The default `1.0` is equivalent to no smoothing.
|
|
- `--teacher_batch_size` (default: `32`): The batch size used for generating a single set of teacher predictions.
|
|
Does not affect training. Use `--per_device_train_batch_size` to change the training batch size.
|
|
|
|
Any of the arguments in the 🤗 Trainer's
|
|
[`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html?#trainingarguments) can also be
|
|
modified, such as `--learning_rate`, `--fp16`, `--no_cuda`, `--warmup_steps`, etc. Run `python distill_classifier.py
|
|
-h` for a full list of available arguments or consult the [Trainer
|
|
documentation](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).
|
|
|
|
> **Note**: Distributed and TPU training are not currently supported. Single-node multi-GPU is supported, however,
|
|
and will run automatically if multiple GPUs are available.
|
|
|
|
### Example: Topic classification
|
|
|
|
> A full colab demo notebook of this example can be found [here](https://colab.research.google.com/drive/1mjBjd0cR8G57ZpsnFCS3ngGyo5nCa9ya?usp=sharing).
|
|
|
|
Let's say we're interested in classifying news articles into one of four topic categories: "the world", "sports",
|
|
"business", or "science/tech". We have an unlabeled dataset, [AG's News](https://huggingface.co/datasets/ag_news),
|
|
which corresponds to this problem (in reality AG's News is annotated, but we will pretend it is not for the sake of
|
|
example).
|
|
|
|
We can use an NLI model like `roberta-large-mnli` for zero-shot classification like so:
|
|
|
|
```python
|
|
>>> class_names = ["the world", "sports", "business", "science/tech"]
|
|
>>> hypothesis_template = "This text is about {}."
|
|
>>> sequence = "A new moon has been discovered in Jupiter's orbit"
|
|
|
|
>>> zero_shot_classifier = pipeline("zero-shot-classification", model="roberta-large-mnli")
|
|
>>> zero_shot_classifier(sequence, class_names, hypothesis_template=hypothesis_template)
|
|
{'sequence': "A new moon has been discovered in Jupiter's orbit",
|
|
'labels': ['science/tech', 'the world', 'business', 'sports'],
|
|
'scores': [0.7035840153694153, 0.18744826316833496, 0.06027870625257492, 0.04868902638554573]}
|
|
```
|
|
|
|
Unfortunately, inference is slow since each of our 4 class names must be fed through the large model for every
|
|
sequence to be classified. But with our unlabeled data we can distill the model to a small distilbert classifier to
|
|
make future inference much faster.
|
|
|
|
To run the script, we will need to put each training example (text only) from AG's News on its own line in
|
|
`agnews/train_unlabeled.txt`, and each of the four class names in the newline-separated `agnews/class_names.txt`.
|
|
Then we can run distillation with the following command:
|
|
|
|
```bash
|
|
python distill_classifier.py \
|
|
--data_file ./agnews/unlabeled.txt \
|
|
--class_names_files ./agnews/class_names.txt \
|
|
--teacher_name_or_path roberta-large-mnli \
|
|
--hypothesis_template "This text is about {}." \
|
|
--output_dir ./agnews/distilled
|
|
```
|
|
|
|
The script will generate a set of soft zero-shot predictions from `roberta-large-mnli` for each example in
|
|
`agnews/unlabeled.txt`. It will then train a student distilbert classifier on the teacher predictions and
|
|
save the resulting model in `./agnews/distilled`.
|
|
|
|
The resulting model can then be loaded and used like any other pre-trained classifier:
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
model = AutoModelForSequenceClassification.from_pretrained("./agnews/distilled")
|
|
tokenizer = AutoTokenizer.from_pretrained("./agnews/distilled")
|
|
```
|
|
|
|
and even used trivially with a `TextClassificationPipeline`:
|
|
|
|
```python
|
|
>>> distilled_classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
|
|
>>> distilled_classifier(sequence)
|
|
[[{'label': 'the world', 'score': 0.14899294078350067},
|
|
{'label': 'sports', 'score': 0.03205857425928116},
|
|
{'label': 'business', 'score': 0.05943061783909798},
|
|
{'label': 'science/tech', 'score': 0.7595179080963135}]]
|
|
```
|
|
|
|
> Tip: pass `device=0` when constructing a pipeline to run on a GPU
|
|
|
|
As we can see, the results of the student closely resemble that of the trainer despite never having seen this
|
|
example during training. Now let's do a quick & dirty speed comparison simulating 16K examples with a batch size of
|
|
16:
|
|
|
|
```python
|
|
for _ in range(1000):
|
|
zero_shot_classifier([sequence] * 16, class_names)
|
|
# runs in 1m 23s on a single V100 GPU
|
|
```
|
|
|
|
```python
|
|
%%time
|
|
for _ in range(1000):
|
|
distilled_classifier([sequence] * 16)
|
|
# runs in 10.3s on a single V100 GPU
|
|
```
|
|
|
|
As we can see, the distilled student model runs an order of magnitude faster than its teacher NLI model. This is
|
|
also a seeting where we only have `K=4` possible labels. The higher the number of classes for a given task, the more
|
|
drastic the speedup will be, since the zero-shot teacher's complexity scales linearly with the number of classes.
|
|
|
|
Since we secretly have access to ground truth labels for AG's news, we can evaluate the accuracy of each model. The
|
|
original zero-shot model `roberta-large-mnli` gets an accuracy of 69.3% on the held-out test set. After training a
|
|
student on the unlabeled training set, the distilled model gets a similar score of 70.4%.
|
|
|
|
Lastly, you can share the distilled model with the community and/or use it with our inference API by [uploading it
|
|
to the 🤗 Hub](https://huggingface.co/transformers/model_sharing.html). We've uploaded the distilled model from this
|
|
example at
|
|
[joeddav/distilbert-base-uncased-agnews-student](https://huggingface.co/joeddav/distilbert-base-uncased-agnews-student).
|