Introduction

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across a wide range of tasks. However, fine-tuning these massive models for specific applications presents significant challenges, particularly in terms of computational resources and memory requirements. Enter QLoRA (Quantised Low-Rank Adaptation), an innovative technique that combines the benefits of quantization and low-rank adaptation to enable cheap, fast and efficient fine-tuning of LLMs when hardware resources are limited.

In this blog post, we'll explore QLoRA's quantisation and low-rank adaptation, implementation, and impact on training LLMs.

time lapse photography of three men cycling
Photo by paolo candelo on Unsplash

Background - Quantisation and LoRA

Before we delve into QLoRA, let's establish a foundational understanding of its key components, i.e. Quantisation and Low-Rank adaptation (LoRA)

Quantization

Quantization is a technique for reducing the precision of a model's parameters and activations. By representing these values with fewer bits, quantization significantly reduces memory usage and computational requirements, often with minimal impact on model performance.

Precision-based Quantisation

Here are different precision quantisation representation data that are currently used for training models:

  1. FP32 (32-bit floating-point):

  2. FP16 (16-bit floating-point):

  3. BF16 (Brain Floating Point):

  4. INT8 (8-bit integer):

  5. INT4 (4-bit integer):

Quantization Schemes

A quantization scheme is a method for mapping a large set of input values to a smaller set of output values. It is typically used to reduce the precision of data representation. For example, a quantization scheme will be used to convert and represent data from FP32 format to BF16 or INT8 format.

Some important quantization factors to be considered are:

Based on the above parameters, some important quantization schemes include the following:

  1. Linear Quantization:

  1. Non-linear Quantization:

  2. Symmetric vs Asymmetric Quantization:

Now, let’s understand the Low-Rank Adaptation in detail.

Low-Rank Adaptation (LoRA) in Depth

LoRA, introduced by Hu et al. (2021), is a parameter-efficient fine-tuning (PEFT) method that freezes the pre-trained model weights and injects trainable low-rank matrices into each layer of the transformer architecture.

The fundamental idea behind LoRA is to represent the weight updates during fine-tuning as the product of two low-rank matrices rather than updating the entire model. This approach significantly reduces the number of trainable parameters while allowing for effective model adaptation to new tasks.

Mathematical Formulation of LoRA

Pre-trained LLMs have a low intrinsic dimension and can still learn effectively despite being randomly projected to a smaller space.

Let W₀ ∈ ℝᵈˣᵈ be the pre-trained weights of a layer in the original model. During fine-tuning with LoRA, instead of directly updating W₀, we introduce a low-rank update:

W = W₀ + BA

Where:

The product BA represents the weight update and only A and B are trained during fine-tuning.

The above weight matrix weight decomposition is applied to the self-attention modules of the transformer. Therefore, only these modules will be trained during training while the remaining pre-trained model weights are frozen.

Key Components of LoRA

  1. Rank (r):

  2. Scaling Factor (α):

  3. Target Modules:

LoRA in Practise

During fine-tuning, LoRA is implemented as follows:

  1. Initialization:

  2. Training Process:

  3. Inference:

Now, let’s understand the big picture of how QLoRa is implemented as a combination of LoRA and Quantization.

QLoRA: Combining Quantisation and LoRA

QLoRA, introduced by Dettmers et al. (2023), integrates advanced quantization techniques with LoRA to create a highly efficient fine-tuning method for large language models.

Before we delve into the specifics of the QLoRa workflow, we should note some key innovations introduced in the QLoRA paper.

Key Innovations

  1. 4-bit NormalFloat Quantization

Normal float is a novel quantization data type optimized for training neural networks while maintaining performance levels.

The NormalFloat quantization process involves:

  1. Estimating the parameters (μ, σ) of the normal distribution that best fits the weight tensor.

  2. Defining non-linear quantization boundaries based on the cumulative distribution function (CDF) of the normal distribution.

  3. Mapping weights to 4-bit integers based on these boundaries.

4-bit Normal Float is the datatype used to quantise and store base model weights during QLoRA training.

  1. Double Quantization

Double quantization is a technique that further reduces memory usage:

  1. First, it quantizes model weights to 4-bit precision using NormalFloat.

  2. Then, it quantizes the resulting quantization constants (scaling factors and zero-points) to 8-bit precision.

This two-step process significantly reduces the memory footprint of quantization constants, which can be substantial in large models.

  1. Paged Optimizers

Paged optimizers efficiently manage memory by:

Integration of Quantization and LoRA

QLoRA combines quantization and LoRA in a synergistic manner:

  1. Quantized Base Model:

  2. Full-precision LoRA Updates:

  3. Quantization-aware Training:

  4. Memory-efficient Optimization:

End-to-End QLoRA Workflow

  1. Load the pre-trained model and quantize it to 4-bit precision using NormalFloat.

  2. Add LoRA adapters to the quantized model, initializing them in 16-bit precision (BF16)

  3. Use paged optimizers and gradient accumulation for memory-efficient training.

  4. During training, perform quantization-aware forward and backward passes.

  5. Update only the LoRA parameters, keeping the quantized base model fixed.

  6. For inference, merge the LoRA updates with the quantized base model or keep them separate for task-specific adaptation.

This combination of techniques allows QLoRA to fine-tune models with billions of parameters on consumer-grade hardware, democratizing access to state-of-the-art language models.

Now that we have understood the theory of how QLoRA works in detail let’s implement it in code and fine-tune a large language model.

Implementation of QLoRA

  1. First, Install the necessary libraries:

Let's walk through a practical implementation of QLoRA using the Hugging Face Transformers and PEFT libraries.

pip install transformers peft bitsandbytes accelerate
  1. Load and Quantize the model.

Here, we are loading a 6.7 billion-parameter model in a 4-bit NormalFloat datatype with double quantization enabled. The LoRA parameters are loaded in the BF16 datatype.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "facebook/opt-6.7b"  # Example large model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
  1. Configure LoRA

Convert the pre-trained model into the LoRA model by adding the LoRA adapters and specifying parameters like rank (r), alpha, and target modules.

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, peft_config)
  1. Load and Prepare a Dataset

For this example, let’s use the IMDb dataset from the datasets library to fine-tune our LLM.

from datasets import load_dataset

dataset = load_dataset("imdb")  # Example dataset

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
  1. Configure Training Arguments

Let’s configure the training arguments before we start training the model using the transformers library. Note the optimizer is set as paged_adamw_8bit to efficiently manage memory with the CPU as a backup.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=100,
    optim="paged_adamw_8bit"
)
  1. Train the model

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

Post training the model is saved for further inference.

Conclusion

QLoRA (Quantized Low-Rank Adaptation) represents a significant leap forward in model fine-tuning. By combining quantization techniques with low-rank adaptation, QLoRA dramatically reduces the memory footprint required for training while maintaining model quality. This breakthrough allows for faster, more efficient fine-tuning of large language models on consumer-grade hardware, opening up new possibilities for customization and specialization of AI models.

The advent of QLoRA, alongside other innovative training methods like LoRA, PEFT, and instruction fine-tuning, is democratizing access to powerful language models. These techniques are making it possible for researchers, developers, and organizations of all sizes to work with and adapt state-of-the-art LLMs for specific applications. As these methods continue to evolve, we're moving closer to a future where advanced AI capabilities are not limited to tech giants but are accessible to a global community of innovators.

References

  1. Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314.

  2. Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.

  3. https://huggingface.co/blog/4bit-transformers-bitsandbytes


If you enjoyed this blog, please click the ❤️ button, share it with your peers, and subscribe for more content. Your support helps spread the knowledge and grow our community. Thank you!

Subscribe now