Introduction

The abilities and effectiveness of large language models (LLMs) increasingly resemble human-like qualities, surpassing human capabilities in certain aspects, thanks to their remarkable text generation capabilities. A key factor contributing to this success is the ability to fine-tune these models to produce coherent and creative text. By adjusting the mechanisms behind text generation, we can steer LLMs to generate text in desired ways. Continuing our blog series, we will delve into additional sampling methods, such as nucleus sampling, custom sampling, and beam search sampling, to further enhance the quality of text generation.

person holding light bulb — Photo by Diego PH on Unsplash

Exploring Text Generation Techniques

1. Top-k Sampling

Top-k sampling is a widely used variation of random sampling that limits the pool of tokens we can sample at each timestep. The core concept of top-k sampling is to eliminate low-probability options by only considering the k tokens with the highest probability of sampling. This technique can often produce more natural-sounding text than other techniques.

How Top-k Sampling Works

Top-k sampling restricts the possible next words to the k most probable ones, effectively shrinking the dictionary it uses for prediction.
Lower k values in top-k sampling promote conservative text generation by prioritizing highly probable words within the context.
For example, if k is 50, we choose the top 50 tokens from the vocabulary and then perform random sampling from these 50 tokens to generate text.

Implementation

Top-k Sampling can be activated easily in the transformers generate() function by setting the do_sample parameter to True, specifying the temperature parameter to control creativity and top_k to define the k tokens from which to sample. The choice of the parameter k is to be manually chosen based on the vocabulary size and probability distribution.

Below is an example of how to implement top-k sampling with the GPT-2 XL model using the transformers library:

Case I: With temperature=2.0 (high) and top_k=70

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Setup the model and tokenizer
checkpoint = "gpt2-xl"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

# Encode the inputs
input_text = "Once upon a time, there was a man who"
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(device)

# Top-k sampling
output = model.generate(input_ids, max_new_tokens=70, do_sample=True, temperature=2.0)
generated_text = tokenizer.decode(output[0])
print(generated_text)

Code Output:

Once upon a time, there was a man who wanted an autographeter a lot in honor "something his old band, Bon Iver's indie icon brothers Scott and Jared had made for him", thus started "This Old Radio Show" in July 2016 by Aaron Schock's son Kyle as guest cohost. [1] As per [note2].""<|endoftext|>

The above-generated output is very unusually diverse and doesn’t have clarity as to what it is conveying. Setting a high temperature parameter value (i.e. >1.0) is responsible for the high variation and diversity in the generated content.

Case II: With temperature=0.9 and top_k=70

output = model.generate(input_ids, max_new_tokens=70, do_sample=True, temperature=0.9, top_k=50)
generated_text = tokenizer.decode(output[0])
print(generated_text)

Code Output:

Once upon a time, there was a man who made a beautiful thing, a beautiful thing that would bring happiness, joy, beauty, and hope to everyone - one that had a heart and soul that would not be broken.

The man, who made this, called it "I am the Sun".

His name was, and is, J.C. Sutter.

We can observe that the above text has better clarity and coherence than the previous output.

Now, let’s explore another sampling method that aids in restricting the output distribution but with a dynamic cutoff, unlike top-k, where we have a fixed cut-off of tokens.

2. Nucleus Sampling

Nucleus Sampling, or top-p sampling, is another important technique for controlling the randomness and creativity of the generated text. With top-p sampling, instead of choosing a fixed cutoff value, we set a condition for when to cut off. This condition is when a certain probability mass in the selection is reached.

While top-k sampling focuses on the total number of considered words, top-p focuses on the total probability or cumulative probability captured.

How Nucleus Sampling Works

This method sets a probability threshold (p) instead of a fixed number of words (k).
Then, it selects all the tokens from the vocabulary sorted in descending order until the cumulative probability reaches or exceeds the threshold p.
For example, if p is 95%, we choose all the tokens whose cumulative probability equals 95%.

This technique can be more flexible than top-k because the number of words considered can be dynamically changed based on probability distribution.

Implementation

Nucleus sampling can be activated in the transformer’s generate() function by setting the probability threshold using the top_p parameter. Then, we can control the randomness and diversity using the temperature parameter.

Below is an example of how to implement top-p sampling with the GPT-2 XL model using the transformers library:

output = model.generate(input_ids, max_new_tokens=70, do_sample=True, temperature=0.9, top_p=0.9)
generated_text = tokenizer.decode(output[0])
print(generated_text)

Code Output:

Once upon a time, there was a man who came to the King of the Fairies and asked, "what is the best way to gain eternal life?" He heard that the Fairy Queen had a secret name, and would answer nothing, only that she was "the most beautiful lady in all the world." He asked her, "Whence comes that name?" And she answered that she had it

Now, let’s understand how custom sampling performs for text generation, where we can employ both top_k and top_p parameters.

3. Custom Sampling

Custom sampling combines top_k and top_p sampling techniques to achieve the best of both worlds. If we set top_k=50 and top_p=0.9, this corresponds to the rule of choosing tokens with a probability mass of 90% from a pool of at most 50 tokens.

This method provides a finer degree of control over the generated text, helps generate grammatically correct text, and is potentially more creative than individual sampling techniques.

Implementation

We can implement this by defining both top_k and top_p parameters. Consider the example below.

output = model.generate(input_ids, max_new_tokens=70, do_sample=True, temperature=0.9, top_p=0.9, top_k=50)
generated_text = tokenizer.decode(output[0])
print(generated_text)

Code Output:

Once upon a time ive been the sole proprietor of a small business, and since its inception i have been very much influenced by the people i have met through the website, and i will continue to do so.

My business goal is to grow a small business and take it to the next level, and i want to do that through a forum like this,

Conclusion

The potential and prowess of large language models (LLMs) in generating text, images, and multimedia are advancing at an exponential rate and show no signs of slowing down as research and development efforts persist. As such, we must grasp and master various text generation techniques, which play a pivotal role in achieving desired outcomes from LLMs through appropriate fine-tuning. In conclusion, our exploration of these techniques is not just an academic exercise but a practical necessity in harnessing the full capabilities of LLMs in the ever-evolving landscape of artificial intelligence.

Happy Learning !!