Mastering The Fundamental Principle of Transformers: Attention⚡ Quick Start LLMs

Peter Chung
6 min readDec 15, 2023

--

This is Part 3 of the series Quick Start Large Language Models: An Accessible Code-First Approach To Working with LLMs.

Reading time: ~6 minutes

AI development is accelerating fast. New models are being open-sourced and made available daily.

It’s been a challenge to keep up.

So to help broaden understanding and encourage greater development with LLMs, I’ve started this Quick Start series.

Building on the earlier parts of this series, where we went over tokenization and embeddings, we will now dive into the workings of the Transformer model, specifically the attention mechanism to see how the inner workings of large language models work with data.

Transformers are a configuration, or architecture, of neural networks that were first introduced by a research team at Google with the seminal paper “Attention Is All You Need” back in 2017.

The team took an existing approach to “attention” and applied it in a layered way to produce greater context throughout a given sentence and structure. In essence, the structure allows the model to weigh different parts of sequenced data (such as words in a sentence) in relation to each other, providing a context-aware representation of the entire sequence being examined.

Attention Is All You Need [2017]

It was originally implemented as a translation tool but soon found applications with many AI/ML practitioners reaching state-of-the-art (SOTA) results across multiple applications, such as text generation and question-answering.

To work with the current implementation of LLMs, I think it is helpful to build an intuition of this mechanism. So let’s take a look at how attention is calculated and used in the operations of a neural network.

First, we’ll break down some components. Attention in transformers, or more specifically self-attention as it’s directed at itself during the training process, takes the embeddings of the tokens and creates query, key, and value numbers from the embeddings. These are typically initialized as randomly generated weights and biases that are applied to the embeddings that are iterated over and trained for the model.

For a live example, we’ll create queries, keys, and values from completely randomly generated values to simulate the process.

import torch

# Seed to replicate results across runs
torch.manual_seed(1337)

# (batch_size, seq_length, dim)
query = torch.rand(1, 10, 20)
key = torch.rand(1, 10, 20)
value = torch.rand(1, 10, 20)

With our starting values, we’ll now step through the calculations. Some matrix math is involved, so take your time with it if this is new to you.

  1. First, an attention ‘score’ is calculated by taking the dot product of the query values against the keys of every other word in the sequence we’re examining. To help with training, these values are scaled down by the square root of the number of keys to make the gradients more stable. In the code below, we use the torch.matmul function to take the query matrix of values and multiply it by the transposed (flipped across its diagonal) matrix of key values.
  2. Next, a softmax score, which turns our attention scores into percentages, is calculated to rank the results of the score calculation. With a softmax function, each query will have percentages that total to 100% across the key values. The F.softmax function applies this to our scores matrix, assigning the percentages down the last dimension of the matrix (in our instance the columns) to make them equal to 100%.
  3. Then, we take these percentages and multiply them by the value matrix. This is the power of the attention mechanism. A higher attention score will indicate to the model that this particular pair of tokens should have a greater influence on the resultant prediction. Thus throughout training, the model will optimize the embeddings and attention scores (query, key, value), in addition to the weights of the neural network, to optimize the resulting sequential token predictions that it can learn from the training data we’ve put together.
import torch
import torch.nn.functional as F

def attention_mech(query, key, value):
"""
Simulated calculation of the attention mechanism.
"""
# Step 1 compute attention scores using matrix multiplcation
scores = torch.matmul(query, key.transpose(-2, -1)) / key.size(-1)**0.5

# Step 2 turn raw attention scores into percentage using softmax
weights = F.softmax(scores, dim=-1)

# Step 3 Take the percentage scores and multiply them by values assigned to each token
output = torch.matmul(weights, value)

return output, weights
output, attention_weights = attention_mech(query, key, value)
print("Attention Output:", output)
print("Attention Weights:", attention_weights)

This is a simplified example. Production LLMs will have much denser, more complex layers and configurations to achieve their results.

If we take a look at the BERT family of models, for example, we can call a few lines of code from the Hugging Face platform and see what specifications the researchers used.

from transformers import BertModel, BertConfig, BertTokenizer

# Instantiate a BERT model.
config = BertConfig()
model = BertModel(config)
# Print out the configuration used with the BERT model
print("BERT Configuration:")
print(config)
BERT Configuration from the ‘transformers’ library of Hugging Face.

A couple of notable things to point out. In the BERT configuration, we see that the model uses 12 separate attention heads. A head is a unique attention process. For this model 12 parallel processes are computing different kinds of attention at the same time during training.

Each of these attention instances start from a different set of values. As the model trains, this allows the training process to capture various elements of nuance from the input text with each attention head.

One attention head might hone in on the positioning of certain words within a sentence, or another might capture the persistent use of certain tokens together across the corpus.

So instead of assigning a single numeric value to capture the semantics of tokens, a multi-headed self-attention process allows a model to store a diverse set of contexts about how a token is used in the training data.

Let’s take our instantiated model forward and try a simple sample text.

import torch
from transformers import BertModel, BertConfig, BertTokenizer

# Sample text for our example.
text = "Transformers, robots in disguise."

# Instantiate the tokenizer for the model.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the sample text and convert to a PyTorch tensor.
inputs = tokenizer(text, return_tensors="pt")

# Forward pass, output hidden states and attention
with torch.no_grad(): # Using no gradients for our example case
outputs = model(**inputs, output_attentions=True)
attention = outputs.attentions

# Print the PyTorch tensors containing attention scores
print("Attention scores from the first layer:")
print(attention[0])

From here we see that there are indeed 12 unique attention tensors associated with our very small sample, providing a potentially rich understanding of how our text can be understood.

And that’s it.

The notebook with all code examples can be found here: QuickStart_pt3_Attention_Transformers.ipynb

Now that we’ve created a solid foundational understanding of language modeling, we’ll dive into real-life applications. Next, we’ll work through the process of prototyping solutions to actual real-world use cases.

Thanks for reading!

References & Resources

--

--

Peter Chung
Peter Chung

Written by Peter Chung

Founder and Principal at Aberrest.com. Applied AI research and development. Building intelligent automation, reasoning systems, and agents.