10 KiB
Llama2
Overview
The Llama2 model was proposed in LLaMA: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!
The abstract from the paper is the following:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Checkout all Llama2 model checkpoints here. This model was contributed by Arthur Zucker with contributions from Lysandre Debut. The code of the implementation in Hugging Face is based on GPT-NeoX here. The original code of the authors can be found here.
Usage tips
The Llama2
models were trained using bfloat16
, but the original inference uses float16
. The checkpoints uploaded on the Hub use torch_dtype = 'float16'
, which will be
used by the AutoModel
API to cast the checkpoints from torch.float32
to torch.float16
.
The dtype
of the online weights is mostly irrelevant unless you are using torch_dtype="auto"
when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")
. The reason is that the model will first be downloaded ( using the dtype
of the checkpoints online), then it will be casted to the default dtype
of torch
(becomes torch.float32
), and finally, if there is a torch_dtype
provided in the config, it will be used.
Training the model in float16
is not recommended and is known to produce nan
; as such, the model should be trained in bfloat16
.
Tips:
- Weights for the Llama2 models can be obtained by filling out this form
- The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this paper
- Setting
config.pretraining_tp
to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits. - The original model uses
pad_id = -1
which means that there is no padding token. We can't have the same logic, make sure to add a padding token usingtokenizer.add_special_tokens({"pad_token":"<pad>"})
and resize the token embedding accordingly. You should also set themodel.config.pad_token_id
. Theembed_tokens
layer of the model is initialized withself.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)
, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended. - After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the conversion script. The script can be called with the following (example) command:
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
- After conversion, the model and tokenizer can be loaded via:
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it's thus 145GB of RAM needed.
-
The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.
-
When using Flash Attention 2 via
attn_implementation="flash_attention_2"
, don't passtorch_dtype
to thefrom_pretrained
class method and use Automatic Mixed-Precision training. When usingTrainer
, it is simply specifying eitherfp16
orbf16
toTrue
. Otherwise, make sure you are usingtorch.autocast
. This is required because the Flash Attention only supportfp16
andbf16
data type.
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
- Llama 2 is here - get it on Hugging Face, a blog post about Llama 2 and how to use it with 🤗 Transformers and 🤗 PEFT.
- LLaMA 2 - Every Resource you need, a compilation of relevant resources to learn about LLaMA 2 and how to get started quickly.
- A notebook on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎
- A notebook on how to fine-tune the "Llama-v2-7b-guanaco" model with 4-bit QLoRA and generate Q&A datasets from PDFs. 🌎
- A notebook on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷
⚗️ Optimization
- Fine-tune Llama 2 with DPO, a guide to using the TRL library's DPO method to fine tune Llama 2 on a specific dataset.
- Extended Guide: Instruction-tune Llama 2, a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving.
- A notebook on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎
⚡️ Inference
- A notebook on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎
- A notebook on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎
🚀 Deploy
- Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker, a complete guide from setup to QLoRA fine-tuning and deployment on Amazon SageMaker.
- Deploy Llama 2 7B/13B/70B on Amazon SageMaker, a guide on using Hugging Face's LLM DLC container for secure and scalable deployment.
LlamaConfig
autodoc LlamaConfig
LlamaTokenizer
autodoc LlamaTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
LlamaTokenizerFast
autodoc LlamaTokenizerFast - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - update_post_processor - save_vocabulary
LlamaModel
autodoc LlamaModel - forward
LlamaForCausalLM
autodoc LlamaForCausalLM - forward
LlamaForSequenceClassification
autodoc LlamaForSequenceClassification - forward