LLM from scratch ("de zero") is yet another build your own LLM.
How does it defer from others? All is done from "scratch" with the help of a few libs...
The easiest and most painful way to learn LLMs... but arguably the best way.
As this tutorial deals with text, I still went with Python, the easiest language to handle text (or is it Perl?).
Amazing other tutorials:
- Building LLMs from scratch: Series introduction - the one I am following and transducting here - arguably the best tutorial out there!
- Andrej Karpathy
- Peter Bloem's Transformers from scratch
- Build a Large Language Model (From Scratch)
- LLMs are Neural Networks design to "understand", generate and "respond" to input text - they are "large" because the number of parameters are high, 117 million for GPT-1, 1.5 billion for GPT-2 and 175 billion for GPT-3 or check this for Gemini
- Neural Networks train on massive amount of data - and they are really at capturing statistical relationships
- LLM have become so good thanks to the Transformer Architecture. See also this wiki entry
- Pre-training trains on a large data set (internet data....) (Wikipedia has 3 billion tokens, Common Crawl filtered has 410 billion tokens...)
- Finetuning is about refining by training on narrower dataset- usually human-defined and domain specific (via labelled data)!
- It is a Deep Neural Network Architecture introduced by the paper "Attention Is All You Need", originally developed for translation tasks
- Steps in a transformer are:
- TEXT: Input text
- PRE-PROCESS: Pre-processing the text for the encoder (convert to numerical representation)
- ENCODER: Produce text encodings used by the decoder
- EMBEDDINGS: Encoder to return embedding vector as inputs to decoder - vectors allow conceptual groupings of words - semantic meaning is captured
- ONE-WORD: Model completes one word as the time
- PRE-PROCESS: Input text prepared for decoder
- DECODER: Generates text one word at the time
- OUTPUT: Final text
In the paper above, it is defined as "a mechanism to draw global dependencies between input and ouput", or to weigh the importance of different words / tokens relative to each other. It allows to capture long range dependencies.
- Split text into words and sub-words tokens
- Convert tokens into token ids
- Encode ids into vectors
We will use Tiny Shakespear.
See SimpleTokenizer:
import re
class SimpleTokenizer:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {v: k for k, v in vocab.items()}
def encode(self, text):
preprocesed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocesed = [item.strip() for item in preprocesed if item.strip()]
ids = [self.str_to_int[s] for s in preprocesed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s([,.:;?_!"()\']|--)', r'\1', text)
return text
Example:
Text to encode: YOU ARE ALL RESOLVED RATHER TO DIE THAN TO FAMISH?
Encoded text: [11984, 516, 303, 8560, 8290, 10616, 2889, 10417, 10616, 3771, 9]
Decoded text: YOU ARE ALL RESOLVED RATHER TO DIE THAN TO FAMISH?
If we try to encoded an unknwon word, it will throw an error unless we add "special tokens" to take care of missing words, but also "end of text" tokens.
In Scratch Book, reading the vocab is done as, including the special tokens:
import re
with open("./data/TinyShakespeare.txt", "r") as f:
raw_text = f.read()
raw_text = raw_text.upper()
preprocesed_text = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocesed_text = [item.strip() for item in preprocesed_text if item.strip()]
all_words = sorted(set(preprocesed_text)) # 12002 tokens
all_words.extend(["<|EOT|>", "<|MIS|>"]) # Adding end-of-text and missing token markers, 12004 tokens, End Of Text and MISsing
vocab_size = len(all_words)
vocab = {token:integer for integer, token in enumerate(all_words)}
GPT uses Byte Pair Encoding...
Is BPE better than basic encoding? Tokenizers exist in 3 different flavors, word based, sub-word based and character based:
- word based: "I love golf" -> "I", "love", "golf"
- sub-word based: "I love golf" -> "I", "lov", "e", "gol", "f"
- character based: "I love golf" -> "I", " ", "l", "o", "v", "e", " ", "g", "o", "l", "f"
In a nutshell:
- word based tokenizers have issues with missing vocabulary, or close words will have different ids such as "club" and "clubs"
- character based tokenizers have very small vocabulary size - usually equivalent to the number of characters in a given language. Its biggest issue is that meaning is lost.. because words are broken down into characters.
- sub-word is supposed to be the best of both worlds - BPE is one sub-word algorithm
Sub-word tokenization follows the following rules:
- Do not split frequently used words into smaller sub-words
- Split the rare words into smaller, meaningful sub-words
"clubs" becomes "club" and "s"
BPE ensures that most common words are represented as a single token.
Although the algoritm is fairly straightforward, we will use a library for testing BPE called Tikoken.
import importlib.metadata
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))
bpe_tokenizer = tiktoken.get_encoding("gpt2")
bpe_encoded_text = bpe_tokenizer.encode(text_2_encode)
print("BPE encoded text:", bpe_encoded_text)
print("BPE decoded text:", bpe_tokenizer.decode(bpe_encoded_text))
Before we create vector embeddings we need to creat the input-target pairs.
Let's take the phrase: "I love golf and paddling". The inputs-targets sequence is as:
- Input = "I", target = "love"
- Input = "I love", target = "golf"
- Input = "I love golf", target = "and"
- Input = "I love golf and", target = "paddling"
"Context length" is how many maximum input words you decide to work with.
context_length = 4
x = raw_text_encoded[:context_length]
y = raw_text_encoded[1:context_length + 1]
print("Input (x):", x)
print("Target (y):", y)
for i in range(1, context_length + 1):
context = raw_text_encoded[:i]
target = raw_text_encoded[i]
print(context, "-->", target)
print(bpe_tokenizer.decode(context), "-->", bpe_tokenizer.decode([target]))
We will use PyTorch to efficiently create input-target pairs, using tensors. A tensor is nothing more than a n-dimensional array or vector. If the context length is 4, the x tensor will have a 4xn dimension, so will y:
Let's take the phrase: "I love golf and paddling":
x = tensor([["I", "love", "golf", "and"],
["paddling", ...]])
y = tensor([["love", "golf", "and", "paddling"],
...])
x's "I" target is y's "love" - same index. x's "I", "love" will have "golf" has target in y.
Each row represents one Input Context.
That's the idea.
This is the data set using the above concept:
import torch
from torch.utils.data import Dataset, Dataloader
class TextDataset(Dataset):
def __init__(self, txt, tokenizer, context_length, stride):
self.input_ids = []
self.target_ids = []
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
for i in range(0, len(token_ids) - context_length, stride):
input_chunk = token_ids[i : i + context_length]
target_chunk = token_ids[i + 1 : i + context_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
We then need to create a data loader.
import torch
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
def __init__(self, txt, tokenizer, context_length, stride):
self.input_ids = []
self.target_ids = []
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
for i in range(0, len(token_ids) - context_length, stride):
input_chunk = token_ids[i : i + context_length]
target_chunk = token_ids[i + 1 : i + context_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
# Create a DataLoader for the TextDataset
# Parameters:
# - raw_txt: the raw text to be tokenized
# - batch_size: size of each batch
# - context_length: maximum length of the input sequence
# - stride: step size to move the input data by
# - shuffle: whether to shuffle the dataset
# - drop_last: whether to drop the last incomplete batch
# - num_workers: number of subprocesses to use for data loading
def create_dataloader(raw_txt, batch_size=4, context_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
from tiktoken import get_encoding
tokenizer = get_encoding("gpt2")
dataset = TextDataset(raw_txt, tokenizer, context_length, stride)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)
return dataloader
Test:
from data_loader import create_dataloader
data_loader = create_dataloader(raw_text, batch_size=1, context_length=4, stride=128, shuffle=True, drop_last=True, num_workers=0)
data_iter = iter(data_loader)
first_batch = next(data_iter)
print("First batch input IDs:", first_batch)
Output:
- The first tensor is the input, the second tensor is the target
First batch input IDs: [tensor([[ 6, 50, 33478, 7283]]), tensor([[ 50, 33478, 7283, 6226]])]
To remember:
- "stride" moves the input field by x, if stride = 1, the next token is moved by 1, etc. Might prevent over-fitting if > 1 by reducing overlaps
We went from raw text, to tokenized data, to token ids, and the creation of tensors (vectors) with inputs and targets. Embeddings are what are fed into the neural network.
What are embeddings? Computers do not understand words, they understand numbers. Random numbers (token ids) assignment to words does not work because it does not capture how words are related to each other.
Similar words need to have similar "vectors" that are in the same "space" (direction, points towards the same direction). Vectors are multi-dimentional, the more features, the higher the vector's dimension.
How to create such vectors? Via Neural Networks...
Companies like Google have open-sourced pre-trained vectors. For example, Google's word2vec, or here are available for anyone to use.
At this stage, you realize that "build an LLM from scratch" is not really from scratch due to the pre-processing required...
word2vec-google-news-300 contains the number 300 which is the dimension of the vectors....
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
word_vectors = model
print(f"Vector for 'computer': {word_vectors['computer']}")
print(f"Similaraties between man and woman': {word_vectors.similarity('man', 'woman')}")
Outputs:
Vector for 'computer': [ 1.07421875e-01 -2.01171875e-01 1.23046875e-01 2.11914062e-01
-9.13085938e-02 2.16796875e-01 -1.31835938e-01 8.30078125e-02 ..... 300 times
Similaraties between man and woman': 0.7664012312889099 .... the highest the most similar
We need to construct an embedding weight matrix that has a size of (vector dimension by vocabulary size).
So if the vector size is n and the vocabulary size is m, the matrix size is n * m.
For example, GPT-2 used vectors of 768 dimension size with a vocabulary size of 50,257 (token ids), there are therefore 38,597,376 elements in the matrix.
How is this matrix created? First by initialising the matrix elements with random values, and train a model using the embeddings as targets.
Using backpropagation, the weights of the matrix are udpdated.
Here is an example of creating such a matrix, initialised with random weights:
# Token Embeddings
print("\nCreating Token Embeddings")
import torch
# I love golf a lot -- raw text - vocab size is 5
# 2, 3, 4, 5, 1 -- token ids
input_ids = torch.tensor([2, 3, 4]) # I love golf
vocab_size = 5
feature_size = 3 # for example - GPT-2 has 768 features
torch.manual_seed(42) # For reproducibility
embedding_layer = torch.nn.Embedding(vocab_size, feature_size)
print(f"Embedding Layer Weights: {embedding_layer.weight}")
Ouputs:
Creating Token Embeddings
Embedding Layer Weights: Parameter containing:
tensor([[ 0.3367, 0.1288, 0.2345],
[ 0.2303, -1.1229, -0.1863],
[ 2.2082, -0.6380, 0.4617],
[ 0.2674, 0.5349, 0.8094],
[ 1.1103, -1.6898, -0.9890]], requires_grad=True)
Notice its size is 5 x 3.
Previously, we have discussed the concept of embeddings, a way to capture similarities between words in a high-dimentional space.
Consider the following two phrases:
- I played golf on this fantastic course
- On this fantastic course I played golf
"golf" appears twice (same token id), but at different locations. How to capture the location? If we do not capture it, the resulting embedded vector representation will be identical.
There are two types of positional embeddings: absolute and relative.
Basically, a number (position) is added to the original embedding vector.
Absolute positioning is used when a fixed number of tokens is important such as sequence generation.
Relative positioning is better when dealing with long sequences where the same phrase appears in different parts of the sequence.
In the paper "Attention Is All You Need, section 3.5 gives some formulas about how to compute positional embeddings.
Inputs embeddings are the sum of token embeddings and positional embeddings. In code:
# Positional Embeddings
print("\nCreating Positional Embeddings")
vocab_size = 50257
feature_size = 256
max_length = 4
embedding_layer = torch.nn.Embedding(vocab_size, feature_size)
from data_loader import create_dataloader
with open("./data/TinyShakespeare.txt", "r") as f:
raw_text = f.read()
raw_text = raw_text.upper()
data_loader = create_dataloader(raw_text, batch_size=8, context_length=max_length, stride=max_length, shuffle=False)
dara_iter = iter(data_loader)
inputs, targets = next(dara_iter)
print("Token IDs:", inputs)
print("Inputs shape:", inputs.shape)
token_embeddings = embedding_layer(inputs)
print("Token Embeddings shape:", token_embeddings.shape)
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, feature_size)
pos_embedding = pos_embedding_layer(torch.arange(max_length))
print("Positional Embeddings shape:", pos_embedding.shape)
input_embeddings = token_embeddings + pos_embedding
print("Input Embeddings shape:", input_embeddings.shape)
At this stage we went through data processing from tokenization (BPE), token embeddings, positional embeddings to input embeddings.
Input embeddings are then fed to the neural network.
But before we need to look at "Attention mechanism"
Attention is the most important concept to understand why GPT or Gemini function so well.
Let's get a quick understanding of Attention.
Consider the phrase: "The cat sitting on the mat, next to the dog, jumped".
The LLM needs to understand that it is the cat who jumped. The LLM needs to capture long term dependencies in sentences.
There are 4 types of attention mechanisms:
- Simplified Self-Attention
- Self-Attention
- Causal Attention
- Multi-Head Attention
The paper "Attention Is All You Need" (2017) is based on the research paper Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau in 2014.
For a great virtual virtual representation, watch Attention in transformers, step-by-step | DL6.
The idea is to convert a word (token) vector embedding into a context vector - an enriched vector, with semantic meaning about that word but also how that word relates to other words in the phrase.
We will use the phrase from Paul Valery:
"Les sanglots longs des violons"
Each word
The idea is to compute a context vector
To go from
For example, for a phrase with n words, for
The main issue with the Simplified Self-Attention approach is the fact that there is prior-biased between two vectors (already aligned) and might not actually capture the attention from the sentence.
For a great visual representation, check out this video.
Self-Attention with Trainable Weights uses Weight Matrices that are updated during model training.
There are 3 weight matrices for this chapter:
The idea is to convert Input Embeddings into Key, Query and Value vectors.
Let's get to the original phrase, Inputs = "Les sanglots longs des violons", asssuming 3 features per word, each row representing a word: les, sanglots, longs, des, violons:
Assume 3 trained matrices
Now we need to compute the attention score for each input.
Each Input ("les, sanglots, longs, des, violons") has now a (Q, K, V).
Assume we use the word "sanglots" as an example: we just need to multiply (dot product) the second row from the Queries matrix by the Keys Matrix transposed.
This gives a (2x5) matrix:
Each number in the matrix represents the attention score for "sanglots" related to "les, sanglots, longs, des, violons"
To generalise this for all inputs, just do this:
We need also to normalize this Attention Scores by squaring by the square root of 2 (2 is the x2 dimension of the matrices) then a softmax.
But why divide by
But why Square Root? Because the product of Q and K increases the variance, which grows proportional to the dimension, dividing by
Finally we need to computer the Context Vectors.
For this, just multiply the Attention Weights by the Values Matrix:
In python:
# Self-Attention V1 using Q, K, V
import torch
import torch.nn as nn
class SelfAttentionV1(nn.Module):
def __init__(self, input_embedding_dimension, output_matrices_dimension):
super().__init__()
self.W_query = nn.Parameter(torch.randn(input_embedding_dimension, output_matrices_dimension))
self.W_key = nn.Parameter(torch.randn(input_embedding_dimension, output_matrices_dimension))
self.W_value = nn.Parameter(torch.randn(input_embedding_dimension, output_matrices_dimension))
def forward(self, x):
keys = x @self.W_key
queries = x @self.W_query
values = x @self.W_value
attention_scores = queries @ keys.T
attention_weights = torch.softmax(attention_scores / keys.shape[-1]**0.5, dim=-1)
context_vectors = attention_weights @ values
return context_vectors
- Think of Query as the Current Token
- Think of Key as the Inputs
- Think of Value as the Content representing the Inputs
Causal Self-Attention is also known as Masked Attention. Whereas Self-Attention considers the entire tokens in the input sequence, Causal Self-Attention restricts the model to only consider previous and current inputs in a sequence when processing any given token: it is done by masking out future tokens in a sequence.
The attention score becomes a triangular matrix.
In Self-Attention,
becomes in Causal Self-Attention:
if the context size is 5 In code:
class CausalSelfAttention(nn.Module):
def __init__(self, input_embedding_dimension, output_matrices_dimension):
super().__init__()
self.W_query = nn.Linear(input_embedding_dimension, output_matrices_dimension, bias=False)
self.W_key = nn.Linear(input_embedding_dimension, output_matrices_dimension, bias=False)
self.W_value = nn.Linear(input_embedding_dimension, output_matrices_dimension, bias=False)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attention_scores = queries @ keys.T
ctx_length = attention_scores.shape[0]
mask = torch.triu(torch.ones(ctx_length, ctx_length), diagonal=1)
masked = attention_scores.masked_fill(mask.bool(), float('-inf'))
attention_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
context_vectors = attention_weights @ values
return context_vectors
Multi-Head Attention extends Causal Attention to work on multiple heads (
Combining here means increasing the size of the 2 context matrices, going for ex from 5x2 twice to a matrix of size 5x4.
In code, implementing Multi-Head Attention involves the creation of multiple instances of the Self-Attention Mechanism, combining their outputs.
class MultiheadAttentionWrapper(nn.Module):
def __init__(self, d_in, d_out, context_length, droput, num_heads, qkv_bias=False):
super().__init__()
self.heads = nn.ModuleList([
CausalSelfAttentionWithDropouts(d_in, d_out, context_length, droput, qkv_bias)
for _ in range(num_heads)
])
def forward(self, x):
return torch.cat([head(x) for head in self.heads], dim=-1)
GPT uses 96 Attention Heads... this is a lot of matrix multiplications!
In the previous implementation, we performed two matrices multiplication... for a multi-head attention size of 96 like GPT3, that is not very efficient.
Before:
Assume
What about:
Let's start with an
Input_X = tensor(
[[[1.7623, 1.4337, 1.2000, 1.2000, 1.5703, 1.2], # Representation vector for word 1
[1.4337, 1.4337, 0.8493, 0.8493, 1.5010, 1.32], # Representation vector for word 2
[1.2000, 0.8493, 1.2436, 1.2436, 1.0863, 1.45]]]) # Representation vector for word 3
- b = 1
- num_tokens = 3
- input_dimension = 6
What is the output dimension d_out and the number of heads, num_heads. We are deciding that the output dimension (context vector) is the same as the input dimension and the number of heads to be 2 (GPT3 uses 96!).
- d_out = 6
- num_heads = 2
- head_dim = d_out / num_heads = 6 / 2 = 3
Init
Include num_heads and head_dim.
- head_dim = d_out / num_heads = 6 / 2 = 3
So that (b, num_tokens, d_out) becomes (b, num_tokens, num_heads, head_dim) or (1, 3, 6) becomes (1, 3, 2, 3).
So that (b, num_tokens, num_heads, head_dim) becomes (b, num_heads, num_tokens, head_dim) or (1, 2, 3, 3).
This is done by transposing the matrix.
It is basically
MultiheadAttention Forward Method Step-By-Step 8: Mask attention score to implement Causal Attention
This is about replacing the upper triangle of the matrix with -infinity and divide by sqrt(head_dim) (3 in our case) Then we apply softmax. Finally, implement dropout.
We need to (re)transpose again to go from (b, num_heads, num_tokens, head_dim) to (b, num_tokens, num_heads, head_dim).
This basically is done via flattening each token output into each row so that the resulting dimension is (b, num_tokens, d_out) or (1,3, 6) in our case.
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
We went through text, tokenized text, embeddings, multi-head attention to create context vectors. (Masked) multi-head attention are part of what is called "Transformer Block" We will use GPT-2's architecture from now one. Read this great medium article
See LLM Architecture and Scratch Book
Speficially, we are looking at the various layers in the transformer architecture: (from this source)
Starting with the normalisation layer, which improves the stability and the effiency of the network training, by adjusting the outputs to have a mean of zero and variance of one (calculate the mean
In python:
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # to avoid division by 0
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return norm_x * self.scale + self.shift
Please read this.
Maths-wise,
An approximation used in GPT-2:
In python:
class GELU(nn.Module):
def forward(self, x):
return x * 0.5 * (1.0 + torch.tanh(
(2.0 / torch.pi) ** 0.5 * (x + 0.044715 * x.pow(3))
))
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
)
def forward(self, x):
return self.layers(x)
def __repr__(self):
return "FeedForward()"
In the diagram above, it is the "+" sign: when we go from the embeddings into the normalisation bypassing the attention layer or the feed forward layer. Shortcut connections are also known a "skip" or "residual" connections.
In the backward pass, when weights of the neural network become small, it is very difficult to make learning progress.
Shortcut is therefore a way to skip a neural network layer.
Also read Visualizing the Loss Landscape of Neural Nets.
The transformer block is the blue architecture below:
The "N x" stands for times-N. In the GPT architecture, the transformer block is applied 12 times.
Remember the process is:
- Multi-hread attention layer, then
- Layer normalisation, then,
- Dropout, then,
- Feed-forward layers, finally,
- GELU activation.
In code
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
return x
The LLM Chat code:
class YourChatModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
The last part is to generate text from inputs. For a great summary of what we have been doing so far, watch this
Check this code:
import torch
print("\nLLM Demo")
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):
# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]
# Get the predictions
with torch.no_grad():
logits = model(idx_cond)
# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]
# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
return idx
from scratch_book_1 import bpe_tokenizer
start_context = "All the contagion"
print("start_context:", start_context)
encoded = bpe_tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) #A
print("encoded_tensor.shape:", encoded_tensor.shape)
from llm_arch import GPT_CONFIG_124M
from llm_arch import YourChatModel
torch.manual_seed(123)
model = YourChatModel(GPT_CONFIG_124M)
model.eval()
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
to_decode =out.squeeze(0).tolist()
print("to_decode:", to_decode)
decoded_text = bpe_tokenizer.decode(to_decode)
print(decoded_text)
What is generated now is gibberish because the model used has not been trained on data.
With the input context "All the contagion", the result comes out as "All the contagion davidiman Byeswick unlockedometer".
The 10 Most Common Regression and Classification Loss Functions can be seen here, but do they work for LLMs?
We will use Cross-Entropy Loss (cross-entropy between two probability distributions).
This is the code.
Putting all of this together, check out the training here
import tiktoken
import torch
with open("./data/SuperTinyShakespeare.txt", "r") as f:
text_data = f.read()
# First 100 characters
print(text_data[:99])
# Last 100 characters
print(text_data[-99:])
total_characters = len(text_data)
tokenizer = tiktoken.get_encoding("gpt2")
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters) # 1,115,394
print("Tokens:", total_tokens) # 338,025 Very short for training data, but good for demo
# Train/validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
torch.manual_seed(123)
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 256, # Shortened context length (orig: 1024)
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}
from data_loader import create_dataloader
train_loader = create_dataloader(
train_data,
batch_size=2,
context_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader(
val_data,
batch_size=2,
context_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)
# Sanity check
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the training loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"increase the `training_ratio`")
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the validation loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"decrease the `training_ratio`")
# print("Train loader:")
# for x, y in train_loader:
# print(x.shape, y.shape)
# print("\nValidation loader:")
# for x, y in val_loader:
# print(x.shape, y.shape)
train_tokens = 0
for input_batch, target_batch in train_loader:
train_tokens += input_batch.numel()
val_tokens = 0
for input_batch, target_batch in val_loader:
val_tokens += input_batch.numel()
print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from llm_arch import YourChatModel
model = YourChatModel(GPT_CONFIG_124M)
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
print("Training model on device:", device)
# with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
# train_loss = calc_loss_loader(train_loader, model, device)
# val_loss = calc_loss_loader(val_loader, model, device)
# print("Training loss:", train_loss)
# print("Validation loss:", val_loss)
# About 10 minutes on my laptop with 16GB RAM and no GPU
# Training loss: 10.967576834868089
# Validation loss: 10.96940314429147
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
print(" Evaluating model...")
model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train()
return train_loss, val_loss
from llm_demo import generate_text_simple
from llm_loss import text_to_token_ids, token_ids_to_text
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text_simple(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print format
model.train()
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1
# Main training loop
for epoch in range(num_epochs):
print(f"Epoch {epoch}")
model.train() # Set model to training mode
for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel() # Returns the total number of elements (or tokens) in the input_batch.
global_step += 1
# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen
# Note:
# Uncomment the following code to calculate the execution time
import time
start_time = time.time()
torch.manual_seed(123)
model = YourChatModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
print("Starting training...")
train_losses, val_losses, tokens_seen = train_model_simple(
model,
train_loader,
val_loader,
optimizer, device,
num_epochs=10,
eval_freq=5,
eval_iter=5,
start_context="All the contagion of",
tokenizer=tokenizer
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
All in one file
If you'd like an in-depth deep-dive into LLMs, please watch all videos by Raj Abhijit Dandekar from Vizuara.
The learning does not stop at the model training, more learning is required to understand fine-tuning, etc.
- The computing power required is huge
- It is all about data, quality data
- But still conceptually, LLMs are all about probability of deduction of the next "word" based on previous "words" (referred to as context window or memory)
- If the number of words in the French language is 100k, and if we assume a context window of 2048 words (for a small model), essentially a diagonal matrix, the weights number is (100k * 100k) * 2048 / 2 = 10,240,000,000,000 = 10240 billion weights (this can be reduced by eliminating weights close to 0)
- The breakthrough of LLMs was to stop trying to build a database at all, and instead build a function that could generate language from first principles it learned from the data. That is a fundamentally more powerful and efficient paradigm.
- Learn, share, enjoy, have fun.
[TJ]