transformers/README.md

# 👾 PyTorch-Transformers

[![CircleCI](https://circleci.com/gh/huggingface/pytorch-transformers.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-transformers)

PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:

1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
3. **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).

| Section | Description |
|-|-|
| [Installation](#installation) | How to install the package |
| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |

## Installation

This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0

### With pip

PyTorch-Transformers can be installed by pip as follows:

```bash
pip install pytorch-transformers
```

### From source

Clone the repository and run:

```bash
pip install [--editable] .
```

### Tests

A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/pytorch-transformers/tree/master/examples).

These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).

You can run the tests from the root of the cloned repository with the commands:

```bash
python -m pytest -sv ./pytorch_transformers/tests/
python -m pytest -sv ./examples/
```

## Quick tour: Usage

Here are two quick-start examples using `Bert` and `GPT2` with pre-trained models.

See the [documentation](#documentation) for the details of all the models and classes.

### BERT example

First let's prepare a tokenized input from a text string using `BertTokenizer`

```python
import torch
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
```

Let's see how we can use `BertModel` to encode our inputs in hidden-states:

```python
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # PyTorch-Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
```

And how to use `BertForMaskedLM` to predict a masked token:

```python
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
```

### OpenAI GPT-2

Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.

First let's prepare a tokenized input from our text string using `GPT2Tokenizer`

```python
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "Who was Jim Henson ? Jim Henson was a"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
```

Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:

```python
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
```

Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).

## Quick tour: Fine-tuning/usage scripts

The library comprises several example scripts with SOTA performances for NLU and NLG tasks:

- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
- other model-specific examples (see the documentation).

Here are three quick usage examples for these scripts:

### `run_glue.py`: Fine-tuning on GLUE tasks for sequence classification

The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```shell
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_bert_classifier.py \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
```

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

#### Fine-tuning XLNet model on the STS-B regression task

This example code fine-tunes XLNet on the STS-B corpus using parallel training on a server with 4 V100 GPUs.
Parallel training is a simple way to use several GPU (but it is slower and less flexible than distributed training, see below).

```shell
export GLUE_DIR=/path/to/glue

python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --task_name=sts-b     \
    --data_dir=${GLUE_DIR}/STS-B  \
    --output_dir=./proc_data/sts-b-110   \
    --max_seq_length=128   \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=1200  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120
```

On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine.
These hyper-parameters give evaluation results pearsonr of `0.918`.

#### Fine-tuning Bert model on the MRPC classification task

This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.

```bash
python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name MRPC \
    --do_train   \
    --do_eval   \
    --do_lower_case   \
    --data_dir $GLUE_DIR/MRPC/   \
    --max_seq_length 128   \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5   \
    --num_train_epochs 3.0  \
    --output_dir /tmp/mrpc_output/ \
    --overwrite_output_dir   \
    --overwrite_cache \
```

Training with these hyper-parameters gave us the following results:

```bash
  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798
```

### `run_squad.py`: Fine-tuning on SQuAD for question-answering

This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

```bash
python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_predict \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
```

Training with these hyper-parameters gave us the following results:

```bash
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
```

This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.

### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet

A conditional generation script is also included to generate text from a prompt.
The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).

Here is how to run the script with the small version of OpenAI GPT-2 model:

```shell
python ./examples/run_glue.py \
    --model_type=gpt2 \
    --length=20 \
    --model_name_or_path=gpt2 \
```

## Migrating from pytorch-pretrained-bert to pytorch-transformers

Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`

### Models always output `tuples`

The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.

The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).

In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.

Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model:

```python
# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

# In pytorch-transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
```

### Serialization

While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.

Here is an example:

```python
### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)

### Now let's save our model and tokenizer to a directory
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
```

### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.

The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.

Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:

```python
# Parameters:
lr = 1e-3
num_total_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1

### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps)
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    optimizer.step()

### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    scheduler.step()
    optimizer.step()
```

## Citation

At the moment, there is no paper associated to PyTorch-Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								# 👾 PyTorch-Transformers
-												Begin Updating the README.md

											
										
										
											2018-11-02 14:51:07 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								[![CircleCI](https://circleci.com/gh/huggingface/pytorch-transformers.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-transformers)
-												Setup CI

											
										
										
											2018-12-21 05:33:39 +08:00
-												indeed

											
										
										
											2019-07-16 06:29:15 +08:00
+								PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
 								The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:
-												Begin Updating the README.md

											
										
										
											2018-11-02 14:51:07 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
 . **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
 . **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 . **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 . **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 . **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
-												update readme, file names, removing TF code, moving tests

											
										
										
											2018-11-04 06:35:14 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								| Section | Description |
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								|-|-|
-												fixing links in readme

											
										
										
											2018-11-17 15:46:17 +08:00
+								| [Installation](#installation) | How to install the package |
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
-												added migration guide to readme

											
										
										
											2019-07-16 15:03:49 +08:00
+								| [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
 								| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								## Installation
-												Update README.md

											
										
										
											2018-11-03 21:18:44 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
-												Begin Updating the README.md

											
										
										
											2018-11-02 14:51:07 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								### With pip
-												update readme

											
										
										
											2018-11-05 22:35:44 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								PyTorch-Transformers can be installed by pip as follows:
-												update readme and setup

											
										
										
											2019-07-05 18:30:15 +08:00
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								```bash
-												[BIG] name change

											
										
										
											2019-07-05 17:55:36 +08:00
+								pip install pytorch-transformers
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								```
-												Begin Updating the README.md

											
										
										
											2018-11-02 14:51:07 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								### From source
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
 								Clone the repository and run:
-												update readme and setup

											
										
										
											2019-07-05 18:30:15 +08:00
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								```bash
 								pip install [--editable] .
 								```
-												Begin Updating the README.md

											
										
										
											2018-11-02 14:51:07 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								### Tests
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/pytorch-transformers/tree/master/examples).
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								You can run the tests from the root of the cloned repository with the commands:
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```bash
 								python -m pytest -sv ./pytorch_transformers/tests/
 								python -m pytest -sv ./examples/
 								```
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								## Quick tour: Usage
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								Here are two quick-start examples using `Bert` and `GPT2` with pre-trained models.
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								See the [documentation](#documentation) for the details of all the models and classes.
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								### BERT example
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								First let's prepare a tokenized input from a text string using `BertTokenizer`
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
 								```python
 								import torch
-												[BIG] name change

											
										
										
											2019-07-05 17:55:36 +08:00
+								from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
-												update readme for tokenizers

											
										
										
											2019-02-13 17:11:11 +08:00
+								import logging
 								logging.basicConfig(level=logging.INFO)
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# Load pre-trained model tokenizer (vocabulary)
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								# Tokenize input
-												update readme and few typos

											
										
										
											2019-01-10 08:40:00 +08:00
+								text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								tokenized_text = tokenizer.tokenize(text)
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
 								# Mask a token that we will try to predict back with `BertForMaskedLM`
-												fixed a minor bug in README.md

											
										
										
											2019-02-12 19:54:23 +08:00
+								masked_index = 8
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								tokenized_text[masked_index] = '[MASK]'
-												update readme and few typos

											
										
										
											2019-01-10 08:40:00 +08:00
+								assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
 								# Convert token to vocabulary indices
 								indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
-												update readme and few typos

											
										
										
											2019-01-10 08:40:00 +08:00
+								segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# Convert inputs to PyTorch tensors
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								tokens_tensor = torch.tensor([indexed_tokens])
 								segments_tensors = torch.tensor([segments_ids])
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								Let's see how we can use `BertModel` to encode our inputs in hidden-states:
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
 								```python
 								# Load pre-trained model (weights)
 								model = BertModel.from_pretrained('bert-base-uncased')
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
 								# Set the model in evaluation mode to desactivate the DropOut modules
 								# This is IMPORTANT to have reproductible results during evaluation!
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								model.eval()
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								# If you have a GPU, put everything on cuda
 								tokens_tensor = tokens_tensor.to('cuda')
 								segments_tensors = segments_tensors.to('cuda')
 								model.to('cuda')
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# Predict hidden states features for each layer
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								with torch.no_grad():
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								    # See the models docstrings for the detail of the inputs
 								    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
 								    # PyTorch-Transformers models always output tuples.
 								    # See the models docstrings for the detail of all the outputs
 								    # In our case, the first element is the hidden state of the last layer of the Bert model
 								    encoded_layers = outputs[0]
 								# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
 								assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								And how to use `BertForMaskedLM` to predict a masked token:
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
 								```python
 								# Load pre-trained model (weights)
 								model = BertForMaskedLM.from_pretrained('bert-base-uncased')
 								model.eval()
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								# If you have a GPU, put everything on cuda
 								tokens_tensor = tokens_tensor.to('cuda')
 								segments_tensors = segments_tensors.to('cuda')
 								model.to('cuda')
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# Predict all tokens
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								with torch.no_grad():
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
 								    predictions = outputs[0]
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2018-11-17 15:42:45 +08:00
+								# confirm we were able to predict 'henson'
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								predicted_index = torch.argmax(predictions[0, masked_index]).item()
-												Updated quick-start example with `BertForMaskedLM`

As `convert_ids_to_tokens` returns a list, the code in the README currently throws an `AssertionError`, so I propose I quick fix.
											
										
										
											2018-11-28 21:53:46 +08:00
+								predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								assert predicted_token == 'henson'
 								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								### OpenAI GPT-2
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
 								```python
 								import torch
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme for tokenizers

											
										
										
											2019-02-13 17:11:11 +08:00
+								# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
 								import logging
 								logging.basicConfig(level=logging.INFO)
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
+								# Load pre-trained model tokenizer (vocabulary)
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								# Encode a text inputs
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								text = "Who was Jim Henson ? Jim Henson was a"
 								indexed_tokens = tokenizer.encode(text)
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								# Convert indexed tokens in a PyTorch tensor
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
+								tokens_tensor = torch.tensor([indexed_tokens])
 								```
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
 								```python
 								# Load pre-trained model (weights)
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								model = GPT2LMHeadModel.from_pretrained('gpt2')
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								# Set the model in evaluation mode to desactivate the DropOut modules
 								# This is IMPORTANT to have reproductible results during evaluation!
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
+								model.eval()
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								# If you have a GPU, put everything on cuda
 								tokens_tensor = tokens_tensor.to('cuda')
 								model.to('cuda')
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
+								# Predict all tokens
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								with torch.no_grad():
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								    outputs = model(tokens_tensor)
 								    predictions = outputs[0]
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								# get the predicted next sub-word (in our case, the word 'man')
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
+								predicted_index = torch.argmax(predictions[0, -1, :]).item()
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
 								assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
-												docstrings

											
										
										
											2019-02-08 00:07:22 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
-												updating hub

											
										
										
											2019-06-17 22:21:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								## Quick tour: Fine-tuning/usage scripts
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
 								- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
 								- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
 								- other model-specific examples (see the documentation).
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
-												update readme

											
										
										
											2019-07-16 06:21:33 +08:00
+								Here are three quick usage examples for these scripts:
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								### `run_glue.py`: Fine-tuning on GLUE tasks for sequence classification
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								Before running anyone of these GLUE tasks you should download the
 								[GLUE data](https://gluebenchmark.com/tasks) by running
 								[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 								and unpack it to some directory `$GLUE_DIR`.
-												cuda on in the examples by default

											
										
										
											2019-02-11 19:15:43 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```shell
 								export GLUE_DIR=/path/to/glue
 								export TASK_NAME=MRPC
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								python run_bert_classifier.py \
 								  --task_name $TASK_NAME \
 								  --do_train \
 								  --do_eval \
 								  --do_lower_case \
 								  --data_dir $GLUE_DIR/$TASK_NAME \
 								  --bert_model bert-base-uncased \
 								  --max_seq_length 128 \
 								  --train_batch_size 32 \
 								  --learning_rate 2e-5 \
 								  --num_train_epochs 3.0 \
 								  --output_dir /tmp/$TASK_NAME/
-												update readme

											
										
										
											2019-01-10 08:25:28 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								#### Fine-tuning XLNet model on the STS-B regression task
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								This example code fine-tunes XLNet on the STS-B corpus using parallel training on a server with 4 V100 GPUs.
 								Parallel training is a simple way to use several GPU (but it is slower and less flexible than distributed training, see below).
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```shell
 								export GLUE_DIR=/path/to/glue
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								python ./examples/run_glue.py \
 								    --model_type xlnet \
 								    --model_name_or_path xlnet-large-cased \
 								    --do_train  \
 								    --task_name=sts-b     \
 								    --data_dir=${GLUE_DIR}/STS-B  \
 								    --output_dir=./proc_data/sts-b-110   \
 								    --max_seq_length=128   \
 								    --per_gpu_eval_batch_size=8   \
 								    --per_gpu_train_batch_size=8   \
 								    --gradient_accumulation_steps=1 \
 								    --max_steps=1200  \
 								    --model_name=xlnet-large-cased   \
 								    --overwrite_output_dir   \
 								    --overwrite_cache \
 								    --warmup_steps=120
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine.
 								These hyper-parameters give evaluation results pearsonr of `0.918`.
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								#### Fine-tuning Bert model on the MRPC classification task
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```bash
 								python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   \
 								    --model_type bert \
 								    --model_name_or_path bert-large-uncased-whole-word-masking \
 								    --task_name MRPC \
 								    --do_train   \
 								    --do_eval   \
 								    --do_lower_case   \
 								    --data_dir $GLUE_DIR/MRPC/   \
 								    --max_seq_length 128   \
 								    --per_gpu_eval_batch_size=8   \
 								    --per_gpu_train_batch_size=8   \
 								    --learning_rate 2e-5   \
 								    --num_train_epochs 3.0  \
 								    --output_dir /tmp/mrpc_output/ \
 								    --overwrite_output_dir   \
 								    --overwrite_cache \
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								Training with these hyper-parameters gave us the following results:
-												updating hub

											
										
										
											2019-06-17 22:21:28 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```bash
 								  acc = 0.8823529411764706
 								  acc_and_f1 = 0.901702786377709
 								  eval_loss = 0.3418912578906332
 								  f1 = 0.9210526315789473
 								  global_step = 174
 								  loss = 0.07231863956341798
-												updating hub

											
										
										
											2019-06-17 22:21:28 +08:00
+								```
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								### `run_squad.py`: Fine-tuning on SQuAD for question-answering
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
-												sub  section overviews

											
										
										
											2018-11-17 15:55:56 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```bash
 								python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
 								    --model_type bert \
 								    --model_name_or_path bert-large-uncased-whole-word-masking \
 								    --do_train \
 								    --do_predict \
 								    --do_lower_case \
 								    --train_file $SQUAD_DIR/train-v1.1.json \
 								    --predict_file $SQUAD_DIR/dev-v1.1.json \
 								    --learning_rate 3e-5 \
 								    --num_train_epochs 2 \
 								    --max_seq_length 384 \
 								    --doc_stride 128 \
 								    --output_dir ../models/wwm_uncased_finetuned_squad/ \
 								    --per_gpu_eval_batch_size=3   \
 								    --per_gpu_train_batch_size=3   \
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								Training with these hyper-parameters gave us the following results:
-												update readme

											
										
										
											2019-02-18 18:12:09 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```bash
 								python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
 								{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
+								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
-												added best practices for serialization in README and examples

											
										
										
											2019-04-15 21:00:33 +08:00
-												updates to readme and doc

											
										
										
											2019-07-16 19:56:47 +08:00
+								### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
-												added best practices for serialization in README and examples

											
										
										
											2019-04-15 21:00:33 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								A conditional generation script is also included to generate text from a prompt.
 								The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
-												added best practices for serialization in README and examples

											
										
										
											2019-04-15 21:00:33 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								Here is how to run the script with the small version of OpenAI GPT-2 model:
-												added best practices for serialization in README and examples

											
										
										
											2019-04-15 21:00:33 +08:00
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								```shell
 								python ./examples/run_glue.py \
 								    --model_type=gpt2 \
 								    --length=20 \
 								    --model_name_or_path=gpt2 \
-												added best practices for serialization in README and examples

											
										
										
											2019-04-15 21:00:33 +08:00
+								```
-												added migration guide to readme

											
										
										
											2019-07-16 15:03:49 +08:00
+								## Migrating from pytorch-pretrained-bert to pytorch-transformers
 								Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`
 								### Models always output `tuples`
 								The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
 								The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
 								In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
 								Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model:
 								```python
 								# Let's load our model
 								model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
 								# If you used to have this line in pytorch-pretrained-bert:
 								loss = model(input_ids, labels=labels)
 								# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
 								outputs = model(input_ids, labels=labels)
 								loss = outputs[0]
 								# In pytorch-transformers you can also have access to the logits:
 								loss, logits = outputs[:2]
 								# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
 								model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
 								outputs = model(input_ids, labels=labels)
 								loss, logits, attentions = outputs
 								```
 								### Serialization
 								While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.
 								Here is an example:
 								```python
 								### Let's load a model and tokenizer
 								model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
 								tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 								### Do some stuff to our model and tokenizer
 								# Ex: add new tokens to the vocabulary and embeddings of our model
 								tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
 								model.resize_token_embeddings(len(tokenizer))
 								# Train our model
 								train(model)
 								### Now let's save our model and tokenizer to a directory
 								model.save_pretrained('./my_saved_model_directory/')
 								tokenizer.save_pretrained('./my_saved_model_directory/')
 								### Reload the model and the tokenizer
 								model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
 								tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 								```
 								### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
 								The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
 								The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
 								The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
 								Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:
 								```python
 								# Parameters:
 								lr = 1e-3
 								num_total_steps = 1000
 								num_warmup_steps = 100
 								warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
 								### Previously BertAdam optimizer was instantiated like this:
 								optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps)
 								### and used like this:
 								for batch in train_data:
 								    loss = model(batch)
 								    loss.backward()
 								    optimizer.step()
 								### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
 								optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
 								scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler
 								### and used like this:
 								for batch in train_data:
 								    loss = model(batch)
 								    loss.backward()
 								    scheduler.step()
 								    optimizer.step()
 								```
-												update readme

											
										
										
											2019-07-16 06:12:55 +08:00
+								## Citation
-												updating readme and notebooks

											
										
										
											2018-11-16 21:31:15 +08:00
-												added migration guide to readme

											
										
										
											2019-07-16 15:03:49 +08:00
+								At the moment, there is no paper associated to PyTorch-Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.