adjusted formating and some wording in the readme

This commit is contained in:
thomwolf 2019-03-27 11:53:44 +01:00
parent 24e67fbf75
commit cea8ba1d59
1 changed files with 28 additions and 62 deletions

View File

@ -1,72 +1,36 @@
#Language Model Finetuning
###Introduction
The three example scripts in this folder are used to **fine-tune** a pre-trained BERT language model. In general, the
way language models like BERT are used is that they are first trained with a language modelling objective on a large,
general language corpus, and then a classifier head is added and the model is quickly fine-tuned on a target task,
while still (hopefully) retaining its general language understanding. This greatly reduces overfitting and yields
state-of-the-art results, especially when training data for the target task are limited.
# BERT Model Finetuning using Masked Language Modeling objective
The [ULMFiT paper](https://arxiv.org/abs/1801.06146) took a slightly different approach, however, and added an
intermediate step in which the language model was fine-tuned on text **from the same domain as the target task** before
the final stage when the classifier was added and the model was trained on the target task itself. The paper reported
significantly improved results from this step, and found that they could get high-quality classifications even with
only tiny numbers (<1000) of labelled training examples, as long as they had a lot of unlabelled data from the target
domain.
## Introduction
The BERT language model is significantly stronger than ULMFiT, but the [BERT paper](https://arxiv.org/abs/1810.04805)
did not test language model fine-tuning, and in general there aren't many examples of this approach being used for
Transformer-based language models. As such, it's hard to predict what effect this step will have on final model
performance, but it's reasonable to conjecture that it will improve the final classification performance, especially
when a large unlabelled corpus from the target domain is available, labelled data is limited, or the target domain is
very unusual and different from 'normal' English text. If you are aware of any literature on this subject, please feel
free to add it in here, or open an issue and tag me (@Rocketknight1) and I'll include it.
The three example scripts in this folder can be used to **fine-tune** a pre-trained BERT model using the pretraining objective (combination of masked language modeling and next sentence prediction loss). In general, pretrained models like BERT are are first trained with a pretraining objective (masked language modelling and next sentence prediction for BERT) on a large and general natural language corpus. A classifier head is then added on top of the pre-trained architecture and the model is quickly fine-tuned on a target task, while still (hopefully) retaining its general language understanding. This greatly reduces overfitting and yields state-of-the-art results, especially when training data for the target task are limited.
###Input format
The scripts in this folder expect a single file as input, consisting of untokenized text, with one **sentence** per
line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training
involves a 'next sentence' objective in which the model must predict whether two sequences of text are contiguous text
from the same document or not, and to avoid making the task 'too easy', the split point between the sequences is
always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text
can be split.
The [ULMFiT paper](https://arxiv.org/abs/1801.06146) took a slightly different approach, however, and added an intermediate step in which the model is fine-tuned on text **from the same domain as the target task and using the pretraining objective** before the final stage in which the classifier head is added and the model is trained on the target task itself. This paper reported significantly improved results from this step, and found that they could get high-quality classifications even with only tiny numbers (<1000) of labelled training examples, as long as they had a lot of unlabelled data from the target domain.
###Usage
There are two ways to fine-tune a language model using these scripts. The first 'quick' approach is to use
`simple_lm_finetuning.py`. This script does everything for you in a single script, but generates training instances
that consist of just two sentences. This is very different from the BERT paper, where (confusingly) the NextSentence
task concatenated sentences together from each document to form two long multi-sentences, which the paper just
referred to as 'sentences'. The difference between the 'simple' approach and the original paper approach becomes very
pronounced at long sequence lengths because two sentences will be much shorter than the max sequence length,
and so most of each training example will just consist of blank padding characters, which wastes a lot of computation
and results in a model that isn't really training on long sequences.
The BERT model has more capacity than the LSTM models used in the ULMFiT work, but the [BERT paper](https://arxiv.org/abs/1810.04805) did not test finetuning using the pretraining objective and at the present stage there aren't many examples of this approach being used for Transformer-based language models. As such, it's hard to predict what effect this step will have on final model performance, but it's reasonable to conjecture that this approach can improve the final classification performance, especially when a large unlabelled corpus from the target domain is available, labelled data is limited, or the target domain is very unusual and different from 'normal' English text. If you are aware of any literature on this subject, please feel free to add it in here, or open an issue and tag me (@Rocketknight1) and I'll include it.
As such, the preferred approach (assuming you have documents containing multiple contiguous sentences from your
target domain) is to use `pregenerate_training_data.py` to pre-process your data into training examples following the
methodology used for LM training in the original BERT paper and repo. Because there is a significant random component
to training data generation for BERT, this script has the option to generate multiple 'epochs' of pre-processed data,
to avoid training on the same random splits each epoch. Generating an epoch of data for each training epoch should
result a better final model, and so we recommend doing so.
## Input format
You can then train on the pregenerated data using `finetune_on_pregenerated.py`, and pointing it to the folder created
by `pregenerate_training_data.py`. Note that you should use the same bert_model and case options for both!
Also note that max_seq_len does not need to be specified for the `finetune_on_pregenerated.py` script,
as it is inferred from the training examples.
The scripts in this folder expect a single file as input, consisting of untokenized text, with one **sentence** per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a _next sentence_ objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task _too easy_, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.
There are various options that can be tweaked, but they are mostly set to the values from the BERT paper/repo and should
be left alone. The most relevant ones for the end-user are probably `--max_seq_len`, which controls the length of
training examples (in wordpiece tokens) seen by the model, and `--fp16`, which enables fast half-precision training on
recent GPUs. `--max_seq_len` defaults to 128 but can be set as high as 512.
Higher values may yield stronger language models at the cost of slower and more memory-intensive training
## Usage
In addition, if memory usage is an issue, especially when training on a single GPU, reducing `--train_batch_size` from
the default 32 to a lower number (4-16) can be helpful, or leaving `--train_batch_size` at the default and increasing
`--gradient_accumulation_steps` to 2-8. Changing `--gradient_accumulation_steps` may be preferable as alterations to the
batch size may require corresponding changes in the learning rate to compensate. There is also a `--reduce_memory`
option for both the `pregenerate_training_data.py` and `finetune_on_pregenerated.py` scripts that spills data to disc
in shelf objects or numpy memmaps rather than retaining it in memory, which hugely reduces memory usage with little
performance impact.
There are two ways to fine-tune a language model using these scripts. The first _quick_ approach is to use [`simple_lm_finetuning.py`](./simple_lm_finetuning.py). This script does everything in a single script, but generates training instances that consist of just two sentences. This is quite different from the BERT paper, where (confusingly) the NextSentence task concatenated sentences together from each document to form two long multi-sentences, which the paper just referred to as _sentences_. The difference between this simple approach and the original paper approach can have a significant effect for long sequences since two sentences will be much shorter than the max sequence length. In this case, most of each training example will just consist of blank padding characters, which wastes a lot of computation and results in a model that isn't really training on long sequences.
As such, the preferred approach (assuming you have documents containing multiple contiguous sentences from your target domain) is to use [`pregenerate_training_data.py`](./pregenerate_training_data.py) to pre-process your data into training examples following the methodology used for LM training in the original BERT paper and repository. Since there is a significant random component to training data generation for BERT, this script includes an option to generate multiple _epochs_ of pre-processed data, to avoid training on the same random splits each epoch. Generating an epoch of data for each training epoch should result a better final model, and so we recommend doing so.
You can then train on the pregenerated data using [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py), and pointing it to the folder created by [`pregenerate_training_data.py`](./pregenerate_training_data.py). Note that you should use the same `bert_model` and case options for both! Also note that `max_seq_len` does not need to be specified for the [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py) script, as it is inferred from the training examples.
There are various options that can be tweaked, but they are mostly set to the values from the BERT paper/repository and default values should make sense. The most relevant ones are:
- `--max_seq_len`: Controls the length of training examples (in wordpiece tokens) seen by the model. Defaults to 128 but can be set as high as 512. Higher values may yield stronger language models at the cost of slower and more memory-intensive training.
- `--fp16`: Enables fast half-precision training on recent GPUs.
In addition, if memory usage is an issue, especially when training on a single GPU, reducing `--train_batch_size` from the default 32 to a lower number (4-16) can be helpful, or leaving `--train_batch_size` at the default and increasing `--gradient_accumulation_steps` to 2-8. Changing `--gradient_accumulation_steps` may be preferable as alterations to the batch size may require corresponding changes in the learning rate to compensate. There is also a `--reduce_memory` option for both the `pregenerate_training_data.py` and `finetune_on_pregenerated.py` scripts that spills data to disc in shelf objects or numpy memmaps rather than retaining it in memory, which significantly reduces memory usage with little performance impact.
## Examples
### Simple fine-tuning
###Examples
#####Simple fine-tuning
```
python3 simple_lm_finetuning.py
--train_corpus my_corpus.txt
@ -75,7 +39,8 @@ python3 simple_lm_finetuning.py
--output_dir finetuned_lm/
```
#####Pregenerating training data
### Pregenerating training data
```
python3 pregenerate_training_data.py
--train_corpus my_corpus.txt
@ -86,7 +51,8 @@ python3 pregenerate_training_data.py
--max_seq_len 256
```
#####Training on pregenerated data
### Training on pregenerated data
```
python3 finetune_on_pregenerated.py
--pregenerated_data training/