0%

Introduction to Large Language Models

Understanding the fundamentals of LLMs and their revolutionary impact on AI

What are Large Language Models?

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict, understand, and generate human language. They represent one of the most significant breakthroughs in artificial intelligence, powering applications from ChatGPT and Claude to code completion tools and research assistants.

Unlike traditional software that follows explicit rules, LLMs learn statistical patterns from data. By training on trillions of words from books, websites, scientific papers, and code repositories, they develop an implicit understanding of language structure, world knowledge, and reasoning patterns.

Key Characteristics

  • Scale: From 7 billion to over 1 trillion parameters (weights) that encode knowledge and language patterns
  • Training Data: Hundreds of billions to trillions of tokens from diverse text sources
  • Architecture: Primarily based on the Transformer architecture (2017) with attention mechanisms
  • Training Cost: Millions to tens of millions of dollars in computational resources
  • Capabilities: Text generation, question answering, translation, summarization, code generation, mathematical reasoning, and more
  • Context Window: Ability to process from 2K to 200K+ tokens (approximately 1.5K to 150K+ words) at once

The Evolution of Language Models

Language modeling has progressed through several distinct eras:

1950s-1990s: Statistical Language Models

N-gram models that predicted words based on previous N words. Simple but limited by memory and context constraints. Used in early speech recognition and machine translation.

2003-2013: Neural Language Models

Introduction of neural networks for language modeling. Bengio et al. (2003) showed neural networks could learn word embeddings. Word2Vec (2013) popularized dense word representations.

2013-2017: Sequence Models

LSTMs and GRUs enabled longer context understanding. Seq2Seq models (2014) revolutionized machine translation. But training was slow and sequential.

2017: The Transformer Revolution

"Attention Is All You Need" paper introduced the Transformer architecture. Parallel processing and attention mechanisms solved the long-range dependency problem.

2018-2019: Pre-training Era

BERT (110M-340M params), GPT-2 (1.5B params) demonstrated that pre-training on massive unlabeled data, then fine-tuning on specific tasks, dramatically improved performance.

2020-2022: Scaling Up

GPT-3 (175B params) showed emergent abilities at scale: few-shot learning, in-context learning. Models became general-purpose tools. This era proved: scale is all you need.

2022-Present: Alignment & Specialization

InstructGPT, ChatGPT showed that aligning models with human preferences (RLHF) makes them far more useful. Multimodal models (GPT-4, Gemini) integrate vision. Open-source models (LLaMA, Mistral) democratize access.

Scaling Laws: Bigger is Better

Research by OpenAI, DeepMind, and others revealed predictable scaling laws: model performance improves as a power law with respect to three factors:

# Scaling Law (Kaplan et al., 2020) # Loss scales as a power law with compute budget L(C) = (C_c / C)^α_c Where: - L(C) is the cross-entropy loss - C is the compute budget (FLOPs) - C_c and α_c are constants - Typical α_c ≈ 0.05-0.10 Key findings: 1. Model size (N) - More parameters → better performance 2. Dataset size (D) - More data → better performance 3. Compute (C) - More training → better performance Optimal allocation: N^0.73 ∝ D ∝ C^0.27 (Scale model size faster than dataset size)

This means: doubling the compute budget reduces loss by approximately 10%. This predictability has driven the race to scale up models.

Try It: Understanding Scale

GPT-3 has 175 billion parameters. If each parameter was a grain of rice, you'd need:

How Do LLMs Work?

At their core, LLMs are trained with a simple objective: predict the next token given all previous tokens. This is called autoregressive language modeling or causal language modeling.

# Training Objective (simplified) Given text: "The quick brown fox jumps over the lazy dog" Training samples: Input: "The" → Predict: "quick" Input: "The quick" → Predict: "brown" Input: "The quick brown" → Predict: "fox" ...and so on The model learns: P(word_t | word_1, word_2, ..., word_{t-1}) After training on trillions of such examples, the model learns language patterns, facts, reasoning, and more.

This simple task, when performed at massive scale with billions of parameters, leads to emergent abilities - capabilities that appear suddenly at certain scale thresholds.

What LLMs Learn

Through next-token prediction on diverse data, LLMs develop:

Grammar & Syntax

Correct sentence structure, verb conjugation, agreement

World Knowledge

Facts, entities, events, relationships, common sense

Reasoning

Logical inference, mathematical reasoning, analogies

Context Understanding

Discourse, pronouns, implicit meaning, pragmatics

Task Patterns

Translation, summarization, question answering formats

Code & Formulas

Programming syntax, algorithms, mathematical notation

Emergent Abilities at Scale

Larger models develop abilities that smaller ones don't have. These capabilities "emerge" suddenly at certain parameter counts:

Ability Emerges At Description
Few-shot Learning ~13B params Learn from just a few examples in the prompt
In-context Learning ~13B params Adapt behavior based on context without fine-tuning
Chain-of-Thought ~100B params Break down complex reasoning into steps
Multi-step Reasoning ~100B params Solve problems requiring multiple inference steps
Code Generation ~10B params Write functional code from natural language

Key Limitations

Despite their impressive capabilities, LLMs have fundamental limitations:

  • No True Understanding: LLMs are statistical pattern matchers, not conscious entities. They don't "understand" in a human sense.
  • Hallucinations: Can generate plausible-sounding but factually incorrect information with high confidence.
  • Knowledge Cutoff: Only know what was in their training data, with a specific cutoff date.
  • Reasoning Limitations: Struggle with tasks requiring precise logic, counting, or mathematical operations.
  • Context Length: Limited by maximum context window (though improving rapidly).
  • Bias & Toxicity: Can reflect and amplify biases present in training data.
  • No Memory: Each conversation starts fresh (unless explicitly provided context).

The Transformer Architecture

Deep dive into the revolutionary architecture that changed NLP

Understanding Transformers

The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., revolutionized natural language processing by replacing recurrence with pure attention mechanisms. It's the foundation of all modern LLMs including GPT, BERT, Claude, and LLaMA.

Key Innovation: Parallelization Through Attention

Previous architectures (RNNs, LSTMs) processed sequences sequentially: word 1 → word 2 → word 3. This was slow and struggled with long-range dependencies.

Transformers process entire sequences in parallel using attention, enabling:

  • 100x faster training on modern GPUs
  • Better long-range dependency modeling
  • Scalability to billions of parameters

Paper: arxiv.org/abs/1706.03762

Three Types of Transformer Architectures

Type Structure Use Cases Examples
Encoder-Only Bidirectional attention (sees entire input) Classification, NER, Q&A BERT, RoBERTa, DeBERTa
Decoder-Only Causal attention (only sees past) Text generation, chat, code GPT-3/4, LLaMA, Claude
Encoder-Decoder Encoder (bidirectional) + Decoder (causal) Translation, summarization T5, BART, mT5

Modern LLMs are almost all decoder-only (GPT-style) because they're more flexible and scale better.

Decoder-Only Transformer: Component Breakdown

Let's examine a GPT-style decoder-only model, the most common architecture for modern LLMs.

1. Token + Positional Embeddings

Token Embedding: Maps each token ID to a dense vector

# Vocabulary size: 50,000 tokens, embedding dim: 512 token_embedding = nn.Embedding(50000, 512) # Input: "The cat sat" → token IDs: [2, 145, 892] token_ids = torch.tensor([[2, 145, 892]]) token_embeds = token_embedding(token_ids) # [1, 3, 512]

Positional Encoding: Adds position information (attention is permutation-invariant)

# Two approaches: # 1. Learned Positional Embeddings (GPT, BERT) pos_embedding = nn.Embedding(max_seq_len, d_model) pos_ids = torch.arange(seq_len) pos_embeds = pos_embedding(pos_ids) # [seq_len, d_model] # 2. Sinusoidal Positional Encoding (Original Transformer) def sinusoidal_encoding(seq_len, d_model): """Fixed positional encoding using sin/cos functions""" position = torch.arange(seq_len).unsqueeze(1) # [seq_len, 1] div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) encoding = torch.zeros(seq_len, d_model) encoding[:, 0::2] = torch.sin(position * div_term) # even indices encoding[:, 1::2] = torch.cos(position * div_term) # odd indices return encoding # Combine token + position embeddings final_embedding = token_embeds + pos_embeds # [batch, seq_len, d_model]

Why positional encoding matters: Without it, "cat sat" and "sat cat" would be identical to the model. Position info is crucial for understanding order, syntax, and semantics.

2. Transformer Block (Repeated N times)

Each block contains two sub-layers: multi-head attention + feed-forward network, each with residual connections and layer normalization.

class TransformerBlock(nn.Module): """Single Transformer decoder block""" def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1): super().__init__() # Sub-layer 1: Multi-head self-attention self.attention = MultiHeadAttention(d_model, num_heads, dropout) self.norm1 = nn.LayerNorm(d_model) self.dropout1 = nn.Dropout(dropout) # Sub-layer 2: Position-wise feed-forward network self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), # Expand: 512 → 2048 nn.GELU(), # Activation (ReLU in original) nn.Dropout(dropout), nn.Linear(d_ff, d_model) # Project back: 2048 → 512 ) self.norm2 = nn.LayerNorm(d_model) self.dropout2 = nn.Dropout(dropout) def forward(self, x, mask=None): # Sub-layer 1: Self-attention with residual connection # Pre-norm (modern): norm before attention attn_out, _ = self.attention( self.norm1(x), self.norm1(x), self.norm1(x), mask ) x = x + self.dropout1(attn_out) # Residual connection # Sub-layer 2: FFN with residual connection ffn_out = self.ffn(self.norm2(x)) x = x + self.dropout2(ffn_out) # Residual connection return x

Key components:

  • Residual connections (x + f(x)): Enable gradient flow in deep networks, prevent degradation
  • Layer normalization: Stabilize training, allow higher learning rates
  • FFN: Processes each position independently, adds non-linearity and capacity
  • Pre-norm vs Post-norm: Modern models use pre-norm (norm before sublayer) for training stability

3. Output Head (Language Modeling)

Final layer projects hidden states to vocabulary logits for next-token prediction.

# After N transformer blocks, we have: [batch, seq_len, d_model] # Project to vocabulary size lm_head = nn.Linear(d_model, vocab_size) # [512 → 50000] logits = lm_head(final_hidden_states) # [batch, seq_len, 50000] # Compute probability distribution over tokens probs = F.softmax(logits, dim=-1) # [batch, seq_len, 50000] # During training: compute cross-entropy loss with target tokens # During inference: sample from distribution (greedy, top-k, nucleus)

Weight tying: Often, the output projection weights are tied to input embedding weights (same parameters). This reduces parameters and improves performance.

Complete Decoder-Only Transformer Implementation

import torch import torch.nn as nn import torch.nn.functional as F import math class GPTModel(nn.Module): """ Decoder-only Transformer (GPT-style) Args: vocab_size: Size of vocabulary d_model: Model dimension (embedding size) num_layers: Number of transformer blocks num_heads: Number of attention heads d_ff: Feed-forward hidden dimension (typically 4*d_model) max_seq_len: Maximum sequence length dropout: Dropout probability """ def __init__( self, vocab_size=50257, d_model=768, num_layers=12, num_heads=12, d_ff=3072, max_seq_len=1024, dropout=0.1 ): super().__init__() # Input embeddings self.token_embedding = nn.Embedding(vocab_size, d_model) self.pos_embedding = nn.Embedding(max_seq_len, d_model) self.dropout = nn.Dropout(dropout) # Transformer blocks self.blocks = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers) ]) # Final layer norm self.ln_f = nn.LayerNorm(d_model) # Output head (language modeling) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) # Weight tying: share weights between embedding and output self.lm_head.weight = self.token_embedding.weight # Initialize weights self.apply(self._init_weights) def _init_weights(self, module): """Initialize weights with normal distribution""" if isinstance(module, (nn.Linear, nn.Embedding)): module.weight.data.normal_(mean=0.0, std=0.02) if isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() elif isinstance(module, nn.LayerNorm): module.bias.data.zero_() module.weight.data.fill_(1.0) def forward(self, input_ids, targets=None): """ Args: input_ids: [batch, seq_len] token indices targets: [batch, seq_len] target tokens (for training) Returns: logits: [batch, seq_len, vocab_size] loss: scalar (if targets provided) """ batch_size, seq_len = input_ids.shape # Create position IDs pos_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0) # Embed tokens and positions token_embeds = self.token_embedding(input_ids) # [batch, seq, d_model] pos_embeds = self.pos_embedding(pos_ids) # [1, seq, d_model] x = self.dropout(token_embeds + pos_embeds) # Create causal mask (lower triangular) mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device)) mask = mask.view(1, 1, seq_len, seq_len) # [1, 1, seq, seq] # Pass through transformer blocks for block in self.blocks: x = block(x, mask) # Final layer norm x = self.ln_f(x) # Project to vocabulary logits = self.lm_head(x) # [batch, seq, vocab_size] # Compute loss if targets provided loss = None if targets is not None: # Flatten for cross-entropy loss = F.cross_entropy( logits.view(-1, logits.size(-1)), # [batch*seq, vocab] targets.view(-1), # [batch*seq] ignore_index=-1 # Ignore padding tokens ) return logits, loss # Usage example model = GPTModel( vocab_size=50257, # GPT-2 tokenizer d_model=768, # GPT-2 small num_layers=12, num_heads=12, d_ff=3072, max_seq_len=1024 ) print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}") # Output: ~117M parameters (GPT-2 small) # Forward pass input_ids = torch.randint(0, 50257, (4, 128)) # Batch of 4, seq len 128 logits, loss = model(input_ids, targets=input_ids) print(f"Logits shape: {logits.shape}") # [4, 128, 50257]

Model Sizes and Configurations

GPT-2 Variants

Model Layers d_model Heads Parameters
GPT-2 Small 12 768 12 117M
GPT-2 Medium 24 1024 16 345M
GPT-2 Large 36 1280 20 774M
GPT-2 XL 48 1600 25 1.5B

Modern LLMs (2023-2024)

Model Layers d_model Heads Parameters Context
LLaMA-2 7B 32 4096 32 7B 4K
LLaMA-2 13B 40 5120 40 13B 4K
LLaMA-2 70B 80 8192 64 70B 4K
GPT-3 96 12288 96 175B 2K-4K
GPT-4 (rumored) 120+ ? ? ~1.7T (MoE) 8K-32K-128K

Note: GPT-4's architecture details (parameter count, MoE structure) are unconfirmed rumors. OpenAI has not publicly disclosed these specifications.

Scaling pattern: Larger models generally have more layers, wider hidden dimensions, and more attention heads. The ratio d_model/num_heads is typically kept constant (64-128 per head).

Key Architectural Innovations Over Time

Innovation Description Used In
Pre-LayerNorm LayerNorm before sublayers (not after) GPT-2, modern LLMs
GELU Activation Smooth activation (instead of ReLU) BERT, GPT-2+
RoPE (Rotary Position Embedding) Encodes relative positions via rotation LLaMA, GPT-Neo
ALiBi (Attention with Linear Biases) Adds linear bias to attention scores BLOOM, MPT
SwiGLU Gated FFN activation PaLM, LLaMA
Grouped Query Attention (GQA) Share K/V across query heads LLaMA-2, Mistral
RMSNorm Simplified LayerNorm (no bias/mean) LLaMA, T5

Key Papers and Resources

Foundational Papers

Architectural Improvements

Interactive Resources

Recommended Learning Path

  1. Read "The Illustrated Transformer" for visual understanding
  2. Study original "Attention Is All You Need" paper (focus on architecture)
  3. Work through "The Annotated Transformer" tutorial
  4. Implement a tiny Transformer (2-4 layers) from scratch in PyTorch
  5. Train on a small dataset (character-level language modeling)
  6. Study nanoGPT to see production-quality implementation
  7. Experiment with architectural variations (RoPE, SwiGLU, etc.)

The Attention Mechanism

Deep dive into the core innovation that powers Transformers

What is Attention?

Attention is a mechanism that allows models to dynamically focus on relevant parts of the input when processing each position. Unlike traditional models that process sequences word-by-word, attention allows the model to look at the entire sequence at once and decide what's important.

Intuition: When you read "The animal didn't cross the street because it was too tired", you know "it" refers to "animal" not "street". Attention allows models to make these connections by computing relevance scores between all word pairs.

Key Innovation from "Attention Is All You Need" (Vaswani et al., 2017)

The original Transformer paper introduced Scaled Dot-Product Attention, which replaced recurrence entirely with a parallelizable attention mechanism. This breakthrough enabled training on much longer sequences and led to the modern LLM revolution.

Paper: arxiv.org/abs/1706.03762

Scaled Dot-Product Attention: The Mathematics

The Complete Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V Where: - Q (Query): [seq_len × d_k] "What am I looking for?" - K (Key): [seq_len × d_k] "What do I contain?" - V (Value): [seq_len × d_v] "What information do I have?" - d_k: dimension of queries/keys (typically 64) - √d_k: scaling factor (prevents vanishing gradients)

Step-by-Step Breakdown with Concrete Example

Let's walk through attention for the sentence: "The cat sat" (3 tokens)

Step 1: Create Q, K, V from Input Embeddings

# Input: word embeddings (assume d_model = 512) X = [x_The, x_cat, x_sat] # shape: [3, 512] # Linear projections (learned weight matrices) Q = X @ W_Q # [3, 512] @ [512, 64] = [3, 64] K = X @ W_K # [3, 512] @ [512, 64] = [3, 64] V = X @ W_V # [3, 512] @ [512, 64] = [3, 64] # Now each token has 64-dim query, key, and value vectors # Example values (simplified): Q[0] = [0.2, 0.5, ..., 0.1] # Query for "The" K[1] = [0.8, 0.3, ..., 0.7] # Key for "cat" V[2] = [0.1, 0.9, ..., 0.4] # Value for "sat"

Step 2: Compute Attention Scores (QK^T)

# Dot product between queries and keys # Each query attends to all keys scores = Q @ K.T # [3, 64] @ [64, 3] = [3, 3] # Resulting score matrix (before scaling): # The_k cat_k sat_k # The_q [ 45.2 12.3 8.7 ] # cat_q [ 10.1 52.8 31.4 ] # sat_q [ 5.2 28.9 49.1 ] # Interpretation: # - "cat" (row 2) attends strongly to itself (52.8) and "sat" (31.4) # - "The" (row 1) attends mostly to itself (45.2)

Step 3: Scale by √d_k

import math d_k = 64 scaled_scores = scores / math.sqrt(d_k) # Division by √64 = 8 # Scaled scores (prevents saturation in softmax): # The_k cat_k sat_k # The_q [ 5.65 1.54 1.09 ] # cat_q [ 1.26 6.60 3.93 ] # sat_q [ 0.65 3.61 6.14 ]

Why scale? Without scaling, large dot products → large softmax inputs → near-zero gradients. Scaling by √d_k keeps values in a reasonable range.

Step 4: Apply Softmax (Get Attention Weights)

import torch.nn.functional as F attention_weights = F.softmax(scaled_scores, dim=-1) # Softmax over each row (each query attends to all keys) # Attention weights (probabilities sum to 1 per row): # The_k cat_k sat_k # The_q [ 0.823 0.131 0.046 ] ← "The" mostly attends to itself # cat_q [ 0.041 0.752 0.207 ] ← "cat" attends to itself and "sat" # sat_q [ 0.018 0.337 0.645 ] ← "sat" attends to itself and "cat" # These weights are interpretable! Higher = more relevant

Step 5: Weighted Sum of Values (Get Output)

output = attention_weights @ V # [3, 3] @ [3, 64] = [3, 64] # For "cat" (row 2): output[1] = 0.041 * V[0] + 0.752 * V[1] + 0.207 * V[2] # ^small weight ^large weight ^medium weight # from "The" from "cat" from "sat" # The output is a context-aware representation that mixes # information from all tokens, weighted by relevance

Key Insight: Attention is a differentiable lookup mechanism. Instead of hard-indexing (like a dictionary), we do a soft weighted average where the weights are learned based on content similarity (Q·K).

Complete PyTorch Implementation

import torch import torch.nn as nn import torch.nn.functional as F import math class ScaledDotProductAttention(nn.Module): """ Scaled Dot-Product Attention from 'Attention Is All You Need' Args: d_k: dimension of keys/queries (typically 64) dropout: dropout probability (typically 0.1) """ def __init__(self, d_k, dropout=0.1): super().__init__() self.d_k = d_k self.dropout = nn.Dropout(dropout) def forward(self, Q, K, V, mask=None): """ Args: Q: Queries [batch, seq_len, d_k] K: Keys [batch, seq_len, d_k] V: Values [batch, seq_len, d_v] mask: Optional mask [batch, seq_len, seq_len] Returns: output: [batch, seq_len, d_v] attention_weights: [batch, seq_len, seq_len] """ # Step 1: Compute Q·K^T scores = torch.matmul(Q, K.transpose(-2, -1)) # [batch, seq, seq] # Step 2: Scale by √d_k scores = scores / math.sqrt(self.d_k) # Step 3: Apply mask (for causal/padding masks) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # Step 4: Softmax to get attention weights attention_weights = F.softmax(scores, dim=-1) # [batch, seq, seq] attention_weights = self.dropout(attention_weights) # Step 5: Weighted sum of values output = torch.matmul(attention_weights, V) # [batch, seq, d_v] return output, attention_weights # Usage example batch_size, seq_len, d_model = 32, 50, 512 d_k = 64 # Create random Q, K, V (in practice, these come from linear projections) Q = torch.randn(batch_size, seq_len, d_k) K = torch.randn(batch_size, seq_len, d_k) V = torch.randn(batch_size, seq_len, d_k) attention = ScaledDotProductAttention(d_k) output, weights = attention(Q, K, V) print(f"Output shape: {output.shape}") # [32, 50, 64] print(f"Attention weights shape: {weights.shape}") # [32, 50, 50]

Multi-Head Attention: Learning Multiple Relationships

Single attention head can only learn one type of relationship. Multi-Head Attention runs multiple attention mechanisms in parallel, each learning different patterns.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V) Parameters: - h: number of heads (typically 8, 12, or 16) - d_model: model dimension (512 for base, 1024 for large) - d_k = d_v = d_model / h (e.g., 512/8 = 64 per head)

What Different Heads Learn

Head Pattern Learned Example
Head 1 Syntactic relationships Verbs attend to subjects
Head 2 Positional (next word) Each word attends to next word
Head 3 Rare words Technical terms attend to definitions
Head 4 Delimiter tokens Commas, periods, quotes
Head 5-8 Semantic relationships Pronouns to antecedents, entities to attributes

Multi-Head Attention Implementation

class MultiHeadAttention(nn.Module): """ Multi-Head Attention from 'Attention Is All You Need' Splits d_model into h heads, each of dimension d_k = d_model/h """ def __init__(self, d_model=512, num_heads=8, dropout=0.1): super().__init__() assert d_model % num_heads == 0, "d_model must be divisible by num_heads" self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads # 64 if d_model=512, h=8 # Linear projections for Q, K, V (for all heads at once) self.W_Q = nn.Linear(d_model, d_model) self.W_K = nn.Linear(d_model, d_model) self.W_V = nn.Linear(d_model, d_model) # Output projection self.W_O = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout) def split_heads(self, x): """Split last dimension into (num_heads, d_k)""" batch_size, seq_len, d_model = x.size() # Reshape and transpose to [batch, num_heads, seq_len, d_k] return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) # 1. Linear projections Q = self.W_Q(Q) # [batch, seq_len, d_model] K = self.W_K(K) V = self.W_V(V) # 2. Split into multiple heads Q = self.split_heads(Q) # [batch, num_heads, seq_len, d_k] K = self.split_heads(K) V = self.split_heads(V) # 3. Apply scaled dot-product attention scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attention_weights = F.softmax(scores, dim=-1) attention_weights = self.dropout(attention_weights) context = torch.matmul(attention_weights, V) # [batch, h, seq, d_k] # 4. Concatenate heads context = context.transpose(1, 2).contiguous() # [batch, seq, h, d_k] context = context.view(batch_size, -1, self.d_model) # [batch, seq, d_model] # 5. Final linear projection output = self.W_O(context) return output, attention_weights # Usage mha = MultiHeadAttention(d_model=512, num_heads=8) x = torch.randn(32, 50, 512) # [batch, seq_len, d_model] output, attn_weights = mha(x, x, x) # Self-attention: Q=K=V=x print(f"Output: {output.shape}") # [32, 50, 512] print(f"Attention: {attn_weights.shape}") # [32, 8, 50, 50]

Causal (Masked) Attention for Autoregressive Models

In decoder-only models (GPT, LLaMA), we need causal masking to prevent positions from attending to future positions during training.

# Create causal mask (lower triangular matrix) seq_len = 5 mask = torch.tril(torch.ones(seq_len, seq_len)) # Visualization: # 0 1 2 3 4 (positions) # 0 [ 1 0 0 0 0 ] ← position 0 can only see itself # 1 [ 1 1 0 0 0 ] ← position 1 sees 0,1 # 2 [ 1 1 1 0 0 ] ← position 2 sees 0,1,2 # 3 [ 1 1 1 1 0 ] ← position 3 sees 0,1,2,3 # 4 [ 1 1 1 1 1 ] ← position 4 sees all # Apply during attention computation scores = scores.masked_fill(mask == 0, -1e9) # -1e9 becomes ~0 after softmax, effectively blocking attention

Why Causal Masking?

  • Training: When predicting token at position i, model can only use tokens 0 to i-1
  • Inference: Ensures model behavior during training matches generation (no future information)
  • Efficiency: Allows parallel training (all positions trained simultaneously, but with appropriate masking)

Complexity Analysis

Time and Space Complexity

Operation Time Complexity Space Complexity
Linear projections (Q, K, V) O(n·d²) O(d²)
Q·K^T (attention scores) O(n²·d) O(n²)
Softmax O(n²) O(n²)
Attn·V (weighted sum) O(n²·d) O(n²)
Total O(n²·d + n·d²) O(n² + d²)

Where: n = sequence length, d = model dimension

The O(n²) Bottleneck

The quadratic complexity in sequence length is the main limitation of standard attention:

  • 1K tokens: 1M attention scores to compute
  • 10K tokens: 100M attention scores (100x more!)
  • 100K tokens: 10B attention scores → OOM on most GPUs

This is why context length is expensive to scale!

Efficient Attention Variants

Method Complexity Used In
Standard Attention O(n²) GPT-3, GPT-4, Claude
Flash Attention O(n²) but memory-efficient GPT-4, LLaMA-2, Modern LLMs
Sparse Attention O(n√n) or O(n log n) Longformer, BigBird
Linear Attention O(n) Linformer, Performer
State Space Models O(n) Mamba, RWKV

Types of Attention Mechanisms

1. Self-Attention (Intra-Attention)

Each position attends to all positions in the same sequence. Used in encoder and decoder.

output = SelfAttention(X, X, X) # Q=K=V=X # Example: "The cat sat" → each word attends to all 3 words

2. Cross-Attention (Encoder-Decoder Attention)

Decoder positions attend to encoder outputs. Used in seq2seq models (translation, summarization).

# Encoder output: source language representation encoder_output = Encoder("Hello world") # English # Decoder uses cross-attention to source decoder_output = CrossAttention( Q=decoder_hidden, # "Bonjour ___" (French, being generated) K=encoder_output, # Keys from "Hello world" V=encoder_output # Values from "Hello world" ) # Decoder can attend to relevant English words when generating French

3. Masked Self-Attention (Causal Attention)

Prevents attention to future positions. Used in decoder-only models (GPT).

# Training on "The cat sat on the mat" # When predicting "sat", model can only see "The cat" # When predicting "mat", model can see "The cat sat on the"

Common Pitfalls and Best Practices

Implementation Details That Matter

Issue Solution Why It Matters
Numerical instability in softmax Use scaling (√d_k) and stable softmax Prevents vanishing/exploding gradients
Memory explosion for long sequences Use Flash Attention or gradient checkpointing O(n²) memory can OOM quickly
Forgetting positional information Add positional encodings to input Attention is permutation-invariant
Attention collapse (all weights uniform) Use LayerNorm, dropout, proper init Model fails to learn useful patterns
Information leakage in causal models Strictly enforce causal masking Training/inference mismatch

Key Papers and Resources

Foundational Papers

Analysis and Interpretability

Interactive Resources

Video Tutorials

  • Andrej Karpathy - "Let's build GPT: from scratch, in code, spelled out"
  • Stanford CS224N - Lecture on Attention Mechanisms
  • 3Blue1Brown - "But what is a GPT?" (visual intuition)

Recommended Learning Path

  1. Read "The Illustrated Transformer" for visual intuition
  2. Read sections 3.2-3.3 of "Attention Is All You Need" paper
  3. Implement scaled dot-product attention from scratch in NumPy
  4. Implement multi-head attention in PyTorch
  5. Use BertViz to visualize attention patterns in real models
  6. Read "What Does BERT Look At?" to understand what attention learns
  7. Experiment with Flash Attention for efficient implementation

Pre-training: Foundation Learning

Deep dive into how LLMs learn language from massive text corpora

What is Pre-training?

Pre-training is an unsupervised learning process where models learn patterns, structures, and knowledge from vast amounts of raw text without explicit labels. This phase teaches the model fundamental language understanding before any task-specific fine-tuning.

Key Characteristics

  • Self-supervised: Labels are created from the data itself (next token)
  • Massive scale: Trillions of tokens, petabytes of data
  • Expensive: Millions of dollars, months of GPU time
  • One-time cost: Pre-trained models can be reused and fine-tuned

The Pre-training Objective: Next-Token Prediction

Modern decoder-only LLMs use causal language modeling: predict the next token given all previous tokens.

Mathematical Formulation

Given sequence: x = [x_1, x_2, ..., x_T] Objective: Maximize the log-likelihood L(θ) = Σ log P(x_t | x_1, ..., x_{t-1}; θ) t=1 to T Where: - θ: model parameters - x_t: token at position t - P(x_t | context): probability distribution over vocabulary In practice, we minimize negative log-likelihood (cross-entropy loss): Loss = -1/T Σ log P(x_t | x_

Concrete Example Walkthrough

Training on: "The cat sat on the mat"

# Tokenized: [2, 145, 892, 319, 2, 2000] # Vocabulary size: 50,000 tokens Training examples created: 1. [] → predict "The" (token 2) 2. ["The"] → predict "cat" (token 145) 3. ["The", "cat"] → predict "sat" (token 892) 4. ["The", "cat", "sat"] → predict "on" (token 319) 5. ["The", "cat", "sat", "on"] → predict "the" (token 2) 6. ["The", "cat", "sat", "on", "the"] → predict "mat" (token 2000) # For each example: # 1. Model outputs logits: [batch, seq_len, vocab_size] # 2. Apply softmax to get probabilities # 3. Compute cross-entropy with target token # 4. Backpropagate to update weights

Complete Training Loop Implementation

import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader def pretrain_language_model( model, train_dataset, num_epochs=1, batch_size=32, learning_rate=3e-4, grad_accum_steps=1, max_grad_norm=1.0 ): """ Pre-training loop for causal language model Args: model: GPT-style decoder-only transformer train_dataset: Dataset yielding token sequences num_epochs: Number of passes through data batch_size: Batch size learning_rate: Peak learning rate (with warmup/decay) grad_accum_steps: Gradient accumulation for larger effective batch max_grad_norm: Clip gradients to prevent explosion """ # Setup device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, betas=(0.9, 0.95), # Common for LLMs weight_decay=0.1 ) # Learning rate schedule total_steps = len(train_dataset) // batch_size * num_epochs warmup_steps = total_steps // 10 # 10% warmup def get_lr(step): """Learning rate schedule with warmup + cosine decay""" if step < warmup_steps: # Linear warmup return learning_rate * step / warmup_steps else: # Cosine decay progress = (step - warmup_steps) / (total_steps - warmup_steps) return learning_rate * 0.5 * (1 + math.cos(math.pi * progress)) scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, get_lr) # Training loop model.train() global_step = 0 for epoch in range(num_epochs): dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) for batch_idx, batch in enumerate(dataloader): # batch: [batch_size, seq_len] token indices input_ids = batch.to(device) # Create targets (shifted by 1) # Input: [The, cat, sat, on] # Target: [cat, sat, on, the] targets = input_ids[:, 1:] # Shift left input_ids = input_ids[:, :-1] # Remove last token # Forward pass logits, loss = model(input_ids, targets=targets) # Scale loss for gradient accumulation loss = loss / grad_accum_steps loss.backward() # Update weights every grad_accum_steps if (batch_idx + 1) % grad_accum_steps == 0: # Clip gradients torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Update parameters optimizer.step() scheduler.step() optimizer.zero_grad() global_step += 1 # Logging if global_step % 100 == 0: perplexity = torch.exp(loss * grad_accum_steps) lr = scheduler.get_last_lr()[0] print(f"Step {global_step} | " f"Loss: {loss.item():.4f} | " f"Perplexity: {perplexity:.2f} | " f"LR: {lr:.6f}") return model # Usage # model = GPTModel(vocab_size=50257, d_model=768, num_layers=12) # trained_model = pretrain_language_model(model, dataset, num_epochs=1)

Scaling Laws: How Performance Scales

Research shows that LLM performance follows predictable power laws with respect to model size, dataset size, and compute.

The Scaling Law Equations (Kaplan et al., 2020)

Loss L scales with: 1. Model Parameters (N): L(N) ∝ (N_c / N)^α_N where N_c ≈ 8.8 × 10^13, α_N ≈ 0.076 2. Dataset Size (D): L(D) ∝ (D_c / D)^α_D where D_c ≈ 5.4 × 10^13, α_D ≈ 0.095 3. Compute Budget (C): L(C) ∝ (C_c / C)^α_C where C_c ≈ 3.1 × 10^8, α_C ≈ 0.050 Key insight: Performance improves smoothly and predictably! Doubling compute → ~X% loss reduction (consistent)

Important Note: These are the original Kaplan et al. (2020) scaling laws. The Chinchilla paper (Hoffmann et al., 2022) later revised these findings, showing that training should balance model size and data more equally than originally thought.

Optimal Allocation (Chinchilla Scaling)

Chinchilla paper finding (Hoffmann et al., 2022): For optimal performance, model size and training tokens should scale equally.

Chinchilla Optimal Rule:

For compute budget C: - Model parameters: N_opt ∝ C^0.5 - Training tokens: D_opt ∝ C^0.5 Example: - 10x more compute → √10 ≈ 3x larger model, 3x more data - Not: 10x larger model on same data (GPT-3 approach) Chinchilla: 70B params, 1.4T tokens (optimal) GPT-3: 175B params, 300B tokens (over-parameterized, under-trained)

Reference: Training Compute-Optimal Large Language Models (arxiv.org/abs/2203.15556)

Compute Budget Optimal Model Size Optimal Training Tokens
1e20 FLOPs 400M params 8B tokens
1e21 FLOPs 1B params 20B tokens
1e22 FLOPs 3B params 60B tokens
1e23 FLOPs 10B params 200B tokens
1e24 FLOPs 30B params 600B tokens
1e25 FLOPs 100B params 2T tokens

Training Data: Sources and Processing

Data Composition (Common Patterns)

Source Proportion Purpose Quality
Web Crawl (Common Crawl) 60-70% Diverse language, broad knowledge Variable (needs filtering)
Books 10-15% Long-form reasoning, narrative High
Wikipedia 3-5% Factual knowledge, encyclopedic Very High
Code (GitHub) 5-10% Programming, structured reasoning High
ArXiv (Scientific Papers) 2-3% Technical knowledge, math Very High
Reddit (filtered) 5-10% Conversational, Q&A format Medium (filtered)

Data Processing Pipeline

# 1. Collection raw_text = download_common_crawl(shard_id) # 2. Filtering (remove low-quality) def quality_filter(text): """Filter out low-quality documents""" # Language detection if detect_language(text) != 'en': return False # Remove too short/long documents if len(text) < 100 or len(text) > 1_000_000: return False # Content quality heuristics if perplexity(text, model=kenlm) > 500: # Too random return False if repetition_ratio(text) > 0.3: # Too repetitive return False if symbol_to_word_ratio(text) > 0.5: # Too much noise return False return True filtered = [doc for doc in raw_text if quality_filter(doc)] # 3. Deduplication (critical for performance!) def deduplicate(documents): """Remove exact and near-duplicate documents""" # Exact dedup using hashes seen_hashes = set() exact_deduped = [] for doc in documents: doc_hash = hash(doc) if doc_hash not in seen_hashes: seen_hashes.add(doc_hash) exact_deduped.append(doc) # Near-dedup using MinHash + LSH from datasketch import MinHash, MinHashLSH lsh = MinHashLSH(threshold=0.8, num_perm=128) near_deduped = [] for idx, doc in enumerate(exact_deduped): m = MinHash(num_perm=128) for word in doc.split(): m.update(word.encode('utf8')) # Check for near-duplicates if not lsh.query(m): # No similar docs found lsh.insert(f"doc_{idx}", m) near_deduped.append(doc) return near_deduped clean_data = deduplicate(filtered) # 4. Tokenization from transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') tokenized = [tokenizer.encode(doc) for doc in clean_data] # 5. Shuffling (critical for training diversity!) import random random.shuffle(tokenized) # 6. Pack into sequences of fixed length (e.g., 2048 tokens) def pack_sequences(tokenized_docs, seq_len=2048): """Pack variable-length documents into fixed-length sequences""" buffer = [] sequences = [] for doc_tokens in tokenized_docs: buffer.extend(doc_tokens) buffer.append(tokenizer.eos_token_id) # Separate documents # Extract full sequences while len(buffer) >= seq_len: sequences.append(buffer[:seq_len]) buffer = buffer[seq_len:] return sequences training_data = pack_sequences(tokenized)

Computational Cost Breakdown

Training Cost Estimates

Model Parameters Training Tokens Compute (FLOPs) GPU-Hours (A100) Est. Cost
GPT-2 1.5B 10B ~3e19 ~1K ~$2K
GPT-3 175B 300B ~3.1e23 ~1M ~$4-5M
LLaMA-2 7B 7B 2T ~2.8e22 ~180K ~$300K
LLaMA-2 70B 70B 2T ~2.8e23 ~1.7M ~$3M

FLOP calculation: ~6ND FLOPs, where N = parameters, D = tokens (forward + backward pass)

Why So Expensive?

  • Matrix multiplications: Billions of parameters × billions of tokens
  • Parallel training: Requires thousands of GPUs for reasonable time
  • Memory bandwidth: Moving parameters and activations is costly
  • Redundancy: Must train to convergence, can't stop early

Key Papers and Resources

Foundational Papers

Data Processing

Practical Resources

Recommended Learning Path

  1. Read GPT-3 paper to understand scale and capabilities
  2. Study Scaling Laws paper (Kaplan) - fundamental insights
  3. Read Chinchilla paper for compute-optimal training
  4. Implement character-level LM on tiny dataset (Shakespeare)
  5. Study nanoGPT training code
  6. Experiment with data processing (filtering, deduplication)
  7. Train GPT-2 scale model using Hugging Face

Post-training: Alignment & Specialization

Deep dive into SFT, RLHF, and alignment techniques

What is Post-training?

After pre-training, models undergo post-training (also called alignment) to become helpful assistants, specialized tools, or domain experts. This phase transforms raw language models into usable, safe AI systems aligned with human values and preferences.

Why Post-training is Critical

  • Pre-trained models complete text (next-token prediction) but don't follow instructions
  • No safety guardrails: Can generate harmful, biased, or toxic content
  • Poor task performance: Not optimized for specific applications
  • Lack of alignment: Don't understand human preferences or values

Post-training fixes all of these issues, transforming completion models into assistants.

The Three-Stage Pipeline

Stage Goal Data Needed Cost
1. SFT Teach instruction following 10K-100K prompt-response pairs Low (hours on single GPU)
2. Reward Modeling Learn human preferences 10K-100K comparison pairs Low (hours on single GPU)
3. RLHF Optimize for preferences Prompts only (RL generates responses) Medium (days on multiple GPUs)

Stage 1: Supervised Fine-Tuning (SFT)

Train the model on high-quality instruction-response pairs to teach it how to follow instructions and behave as an assistant.

SFT Data Format

# Training example format (chat/instruction format) { "prompt": "Explain quantum computing to a 10-year-old.", "response": "Imagine a regular computer is like flipping one coin at a time - heads or tails. A quantum computer is like flipping many magic coins at once, where each coin can be heads AND tails at the same time until you look at it! This lets quantum computers solve certain problems much faster than regular computers." } # Can also use chat format with roles: { "messages": [ {"role": "user", "content": "What causes rainbows?"}, {"role": "assistant", "content": "Rainbows are caused by..."} ] }

SFT Training Objective

Loss = -Σ log P(y_t | x, y_<t; θ) t Where: - x: instruction/prompt - y: desired response - y_t: token t in response - θ: model parameters Key difference from pre-training: - Loss computed ONLY on response tokens (not prompt) - Teaches model to generate specific responses to instructions - Much smaller dataset (10K-100K vs trillions of tokens)

Complete SFT Implementation

import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer from torch.utils.data import Dataset, DataLoader class InstructionDataset(Dataset): """Dataset for instruction fine-tuning""" def __init__(self, data, tokenizer, max_length=2048): self.data = data self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.data) def __getitem__(self, idx): item = self.data[idx] # Format as chat template text = f"### Instruction:\n{item['prompt']}\n\n### Response:\n{item['response']}" # Tokenize tokens = self.tokenizer.encode(text, max_length=self.max_length, truncation=True) # Create labels: -100 for prompt tokens (ignored in loss) prompt_text = f"### Instruction:\n{item['prompt']}\n\n### Response:\n" prompt_tokens = self.tokenizer.encode(prompt_text) prompt_len = len(prompt_tokens) labels = [-100] * prompt_len + tokens[prompt_len:] return { 'input_ids': torch.tensor(tokens), 'labels': torch.tensor(labels) } def supervised_fine_tune( model_name='gpt2', train_data=None, num_epochs=3, batch_size=4, learning_rate=2e-5, output_dir='./sft_model' ): """ Supervised fine-tuning on instruction data Args: model_name: Base model to fine-tune train_data: List of {"prompt": ..., "response": ...} dicts num_epochs: Training epochs batch_size: Batch size learning_rate: Learning rate (lower than pre-training!) output_dir: Where to save fine-tuned model """ # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Create dataset dataset = InstructionDataset(train_data, tokenizer) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # Optimizer (lower LR than pre-training!) optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate) # Training loop device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) model.train() for epoch in range(num_epochs): total_loss = 0 for batch in dataloader: input_ids = batch['input_ids'].to(device) labels = batch['labels'].to(device) # Forward pass outputs = model(input_ids, labels=labels) loss = outputs.loss # Backward pass loss.backward() optimizer.step() optimizer.zero_grad() total_loss += loss.item() avg_loss = total_loss / len(dataloader) print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}") # Save model model.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) print(f"Model saved to {output_dir}") return model # Usage # train_data = [ # {"prompt": "What is AI?", "response": "AI is..."}, # {"prompt": "Explain photosynthesis", "response": "Photosynthesis is..."}, # ... # ] # model = supervised_fine_tune(model_name='gpt2', train_data=train_data)

Stage 2: Reward Modeling

Train a separate model to predict human preferences between different responses. This reward model will guide RLHF training.

Preference Data Collection

How Comparison Data is Created

  1. Sample prompts: Collect diverse user prompts
  2. Generate responses: SFT model generates 2-4 responses per prompt
  3. Human ranking: Raters rank responses (best to worst)
  4. Create pairs: All pairwise comparisons extracted
# Example comparison data { "prompt": "Explain black holes simply.", "chosen": "Black holes are regions where gravity is so strong that not even light can escape...", # Ranked #1 "rejected": "Black holes are big space things." # Ranked #2 (worse) }

Reward Model Training (Bradley-Terry Model)

# Probability that response_a is preferred over response_b P(a ≻ b) = σ(r(x, a) - r(x, b)) = 1 / (1 + exp(-(r(x, a) - r(x, b)))) Where: - r(x, y): reward score for response y to prompt x - σ: sigmoid function - ≻: preference relation Loss (cross-entropy): L = -log σ(r(x, y_chosen) - r(x, y_rejected)) This teaches the reward model to assign higher scores to preferred responses.

Reward Model Implementation

import torch import torch.nn as nn class RewardModel(nn.Module): """ Reward model for predicting human preferences Takes same architecture as language model but outputs scalar reward """ def __init__(self, base_model): super().__init__() self.model = base_model # Replace LM head with reward head (scalar output) self.reward_head = nn.Linear(base_model.config.hidden_size, 1) def forward(self, input_ids): """ Args: input_ids: [batch, seq_len] Returns: rewards: [batch] scalar reward for each sequence """ # Get hidden states outputs = self.model(input_ids, output_hidden_states=True) hidden_states = outputs.hidden_states[-1] # Last layer # Use last token's hidden state (end of sequence) last_hidden = hidden_states[:, -1, :] # [batch, hidden_size] # Project to scalar reward rewards = self.reward_head(last_hidden).squeeze(-1) # [batch] return rewards def train_reward_model( base_model, comparison_data, num_epochs=3, batch_size=4, learning_rate=1e-5 ): """ Train reward model on comparison data Args: base_model: Pre-trained or SFT model comparison_data: List of {prompt, chosen, rejected} dicts num_epochs: Training epochs batch_size: Batch size learning_rate: Learning rate """ reward_model = RewardModel(base_model) tokenizer = AutoTokenizer.from_pretrained('gpt2') optimizer = torch.optim.AdamW(reward_model.parameters(), lr=learning_rate) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') reward_model = reward_model.to(device) for epoch in range(num_epochs): total_loss = 0 correct = 0 total = 0 for item in comparison_data: # Tokenize chosen and rejected responses chosen_text = f"{item['prompt']}\n{item['chosen']}" rejected_text = f"{item['prompt']}\n{item['rejected']}" chosen_ids = tokenizer.encode(chosen_text, return_tensors='pt').to(device) rejected_ids = tokenizer.encode(rejected_text, return_tensors='pt').to(device) # Get rewards reward_chosen = reward_model(chosen_ids) reward_rejected = reward_model(rejected_ids) # Bradley-Terry loss loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected)) # Backward pass loss.backward() optimizer.step() optimizer.zero_grad() total_loss += loss.item() # Accuracy: chosen should have higher reward if reward_chosen > reward_rejected: correct += 1 total += 1 accuracy = correct / total avg_loss = total_loss / total print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f} | Accuracy: {accuracy:.2%}") return reward_model # Usage # comparison_data = [ # {"prompt": "...", "chosen": "...", "rejected": "..."}, # ... # ] # reward_model = train_reward_model(base_model, comparison_data)

Stage 3: RLHF with PPO

Use reinforcement learning (specifically PPO - Proximal Policy Optimization) to optimize the model against the reward model while preventing it from deviating too far from the original.

RLHF Objective Function

Objective (PPO-clip + KL penalty): L_RLHF = E[ min(ratio * A, clip(ratio, 1-ε, 1+ε) * A) ] - β * KL(π_θ || π_ref) Where: - ratio = π_θ(y|x) / π_old(y|x) (probability ratio) - A: advantage (how much better than expected) - ε: clip parameter (typically 0.2) - β: KL penalty coefficient (typically 0.01-0.1) - π_θ: current policy (model being trained) - π_ref: reference policy (SFT model, frozen) Two terms balance: 1. Maximize reward (via advantage A) 2. Don't drift too far from reference (KL penalty)

The Complete RLHF Loop

Four Models Involved

  1. Policy (Actor): The model being trained (generates responses)
  2. Value Model (Critic): Estimates expected reward (helps compute advantages)
  3. Reward Model: Scores responses (frozen, from Stage 2)
  4. Reference Model: Original SFT model (frozen, prevents drift)

Simplified RLHF Training Loop

def rlhf_training_step( policy_model, # Model being trained value_model, # Estimates expected reward reward_model, # Frozen reward model reference_model, # Frozen SFT model prompts, # Batch of prompts tokenizer, beta=0.1 # KL penalty coefficient ): """ Single RLHF training step using PPO This is simplified for clarity - real implementations use more sophisticated PPO with clipping, advantage estimation, etc. """ device = policy_model.device # 1. Generate responses using current policy responses = [] log_probs_old = [] with torch.no_grad(): for prompt in prompts: prompt_ids = tokenizer.encode(prompt, return_tensors='pt').to(device) # Generate response output = policy_model.generate( prompt_ids, max_length=512, do_sample=True, temperature=1.0, return_dict_in_generate=True, output_scores=True ) response_ids = output.sequences # Compute log probabilities under current policy logits = torch.stack(output.scores, dim=1) # [batch, seq, vocab] log_probs = F.log_softmax(logits, dim=-1) responses.append(response_ids) log_probs_old.append(log_probs) # 2. Compute rewards from reward model rewards = [] for response_ids in responses: reward = reward_model(response_ids) rewards.append(reward) rewards = torch.tensor(rewards).to(device) # 3. Compute KL divergence from reference model kl_penalties = [] for response_ids in responses: # Policy log probs policy_output = policy_model(response_ids, labels=response_ids) policy_log_probs = policy_output.logits.log_softmax(dim=-1) # Reference log probs with torch.no_grad(): ref_output = reference_model(response_ids, labels=response_ids) ref_log_probs = ref_output.logits.log_softmax(dim=-1) # KL divergence: E[log(p) - log(q)] kl = (policy_log_probs - ref_log_probs).sum(dim=-1).mean() kl_penalties.append(kl) kl_penalty = torch.stack(kl_penalties).mean() # 4. Compute advantages using value model values = [] for response_ids in responses: value = value_model(response_ids) values.append(value) values = torch.tensor(values).to(device) advantages = rewards - values # How much better than expected # 5. PPO policy update ratio = torch.exp(log_probs_new - log_probs_old) # π_new / π_old clipped_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2) policy_loss = -torch.min( ratio * advantages, clipped_ratio * advantages ).mean() # 6. Total loss with KL penalty total_loss = policy_loss + beta * kl_penalty return total_loss, { 'reward': rewards.mean().item(), 'kl': kl_penalty.item(), 'advantage': advantages.mean().item() } # Full RLHF training would run this in a loop: # for epoch in range(num_epochs): # for batch_prompts in dataloader: # loss, metrics = rlhf_training_step(...) # loss.backward() # optimizer.step()

Modern Alternative: Direct Preference Optimization (DPO)

DPO is a simpler alternative to RLHF that optimizes preferences directly without training a separate reward model or using RL.

DPO Objective

L_DPO = -E[log σ(β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x)))] Where: - y_w: preferred (won) response - y_l: dis-preferred (lost) response - π_θ: policy being trained - π_ref: reference policy (frozen) - β: temperature parameter - σ: sigmoid Key insight: Can optimize preferences directly without reward model! Simpler, more stable than PPO-based RLHF.

Reference: Direct Preference Optimization (arxiv.org/abs/2305.18290)

Parameter-Efficient Fine-Tuning: LoRA

LoRA (Low-Rank Adaptation) allows fine-tuning large models efficiently by training small adapter matrices instead of all parameters.

LoRA Mathematical Formulation

Instead of updating full weight matrix W ∈ R^(d×k): W_new = W + ΔW LoRA decomposes update as low-rank: W_new = W + BA Where: - W: frozen pre-trained weights (d×k) - B ∈ R^(d×r): trainable (down-projection) - A ∈ R^(r×k): trainable (up-projection) - r << min(d, k): rank (typically 8, 16, 32) Parameters to train: d*r + r*k << d*k Example: d=4096, k=4096, r=16 - Full: 16M params - LoRA: 131K params (~100x reduction!)

LoRA Implementation

import torch import torch.nn as nn class LoRALayer(nn.Module): """ LoRA adapter for linear layer Adds trainable low-rank decomposition to frozen weights """ def __init__(self, in_features, out_features, rank=16, alpha=16): super().__init__() self.rank = rank self.alpha = alpha # Frozen pre-trained weight (not trained) self.weight = nn.Parameter(torch.randn(out_features, in_features)) self.weight.requires_grad = False # Trainable LoRA matrices self.lora_A = nn.Parameter(torch.randn(rank, in_features)) self.lora_B = nn.Parameter(torch.zeros(out_features, rank)) # Scaling factor self.scaling = alpha / rank def forward(self, x): """ Args: x: [batch, ..., in_features] Returns: output: [batch, ..., out_features] """ # Original forward pass (frozen) result = F.linear(x, self.weight) # Add LoRA adaptation: x @ A^T @ B^T lora_out = F.linear(F.linear(x, self.lora_A), self.lora_B) result += lora_out * self.scaling return result # Apply LoRA to a transformer model def apply_lora_to_model(model, rank=16, target_modules=['q_proj', 'v_proj']): """ Replace specified linear layers with LoRA versions Args: model: Transformer model rank: LoRA rank target_modules: Which layers to adapt (commonly Q, V projections) """ for name, module in model.named_modules(): if any(target in name for target in target_modules): if isinstance(module, nn.Linear): # Replace with LoRA layer lora_layer = LoRALayer( module.in_features, module.out_features, rank=rank ) # Copy pre-trained weights lora_layer.weight.data = module.weight.data.clone() # Replace in model parent_name = '.'.join(name.split('.')[:-1]) child_name = name.split('.')[-1] parent = dict(model.named_modules())[parent_name] setattr(parent, child_name, lora_layer) # Freeze all non-LoRA parameters for name, param in model.named_parameters(): if 'lora' not in name: param.requires_grad = False trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)") return model # Usage # model = AutoModelForCausalLM.from_pretrained('gpt2') # model = apply_lora_to_model(model, rank=16) # # Now train only LoRA parameters (much faster, less memory!)

Key Papers and Resources

Foundational RLHF Papers

DPO and Alternatives

Parameter-Efficient Fine-Tuning

Practical Libraries

Recommended Learning Path

  1. Read InstructGPT paper (Ouyang et al.) - foundational RLHF
  2. Implement simple SFT on small instruction dataset
  3. Study LoRA paper and implement low-rank adaptation
  4. Read DPO paper - simpler alternative to RLHF
  5. Experiment with TRL library for RLHF
  6. Read Constitutional AI for AI feedback approaches
  7. Implement full RLHF pipeline on toy problem

Essential Concepts

Deep dive into tokens, embeddings, sampling, and more

Tokens & Tokenization

Tokens are the atomic units that LLMs process. Modern LLMs don't operate on characters or words directly - they use subword tokenization to balance vocabulary size and coverage.

Why Subword Tokenization?

  • Character-level: Tiny vocabulary (~256), but very long sequences → expensive
  • Word-level: Huge vocabulary (millions), can't handle unseen words (OOV problem)
  • Subword-level: Balanced! ~50K vocab, handles rare words via subword splitting

Byte Pair Encoding (BPE)

The most common tokenization algorithm used by GPT, LLaMA, and most modern LLMs.

# BPE Algorithm (simplified) def train_bpe(corpus, num_merges=10000): """ Train BPE tokenizer on corpus Args: corpus: List of words num_merges: Number of merge operations (vocabulary size control) Returns: merge_rules: List of (pair, merged_token) rules """ # Start with character-level vocabulary vocab = {word: freq for word, freq in count_words(corpus).items()} merge_rules = [] for i in range(num_merges): # Find most frequent adjacent pair pairs = count_pairs(vocab) if not pairs: break best_pair = max(pairs, key=pairs.get) # Merge the pair vocab = merge_vocab(vocab, best_pair) merge_rules.append(best_pair) print(f"Merge {i}: {best_pair[0]} + {best_pair[1]} → {best_pair[0]}{best_pair[1]}") return merge_rules # Example tokenization: # "understanding" with BPE: # 1. Start: ['u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', 'i', 'n', 'g'] # 2. After merges: ['under', 'stand', 'ing'] # "unexpected" (less common): # 1. Start: ['u', 'n', 'e', 'x', 'p', 'e', 'c', 't', 'e', 'd'] # 2. After merges: ['un', 'exp', 'ect', 'ed'] # More fragmented

Tokenization Examples

Text Tokens (GPT-2 BPE) Token Count
"Hello, world!" ["Hello", ",", " world", "!"] 4
"ChatGPT" ["Chat", "G", "PT"] 3
"antidisestablishmentarianism" ["ant", "idis", "establish", "ment", "arian", "ism"] 6
" strawberry" (with space) [" straw", "berry"] 2
"🤖🚀" ["�", "�", "�", "�", "�", "�", "�", "�"] 8 (UTF-8 bytes)

Key insight: Spaces are part of tokens! " world" ≠ "world". This affects prompting and token counting.

Vocabulary Sizes

  • GPT-2: 50,257 tokens
  • GPT-3/GPT-4: ~50,000 tokens (cl100k_base)
  • LLaMA: 32,000 tokens (SentencePiece)
  • LLaMA-2: 32,000 tokens

Embeddings: From Tokens to Vectors

Embeddings convert discrete tokens into continuous vector representations that capture semantic meaning.

Embedding Layer

import torch import torch.nn as nn # Embedding layer: lookup table vocab_size = 50000 embedding_dim = 768 # GPT-2 small embedding = nn.Embedding(vocab_size, embedding_dim) # Token IDs → vectors token_ids = torch.tensor([145, 892, 319]) # [cat, sat, on] embeddings = embedding(token_ids) # [3, 768] # Each token is now a 768-dimensional vector print(embeddings.shape) # torch.Size([3, 768]) print(embeddings[0]) # Vector for "cat": [0.234, -0.512, 0.891, ...]

Embedding Properties

Semantic Similarity

Tokens with similar meanings have similar vectors (high cosine similarity).

import torch.nn.functional as F # Cosine similarity def cosine_sim(vec1, vec2): return F.cosine_similarity(vec1, vec2, dim=0) # Example (learned embeddings): emb_king = embedding(torch.tensor([1234])) emb_queen = embedding(torch.tensor([5678])) emb_man = embedding(torch.tensor([789])) emb_car = embedding(torch.tensor([456])) sim_king_queen = cosine_sim(emb_king, emb_queen) # High (~0.8) sim_king_car = cosine_sim(emb_king, emb_car) # Low (~0.2)

Famous Word Analogies

# Vector arithmetic captures relationships! king - man + woman ≈ queen # vec("king") - vec("man") + vec("woman") ≈ vec("queen") paris - france + italy ≈ rome # Geographic relationships walked - walk + run ≈ ran # Grammatical transformations # This works because embeddings learn semantic structure!

Positional Embeddings

Since attention is permutation-invariant, we add positional information to embeddings.

# Learned positional embeddings (GPT-2, BERT) max_position = 1024 pos_embedding = nn.Embedding(max_position, embedding_dim) # For input tokens at positions [0, 1, 2] positions = torch.arange(3) pos_embeds = pos_embedding(positions) # [3, 768] # Combined embedding token_embeds = embedding(token_ids) # [3, 768] final_embeds = token_embeds + pos_embeds # [3, 768] # Now each token knows its position in the sequence!

Reference: "Attention Is All You Need" introduced sinusoidal positional encodings. Modern models (GPT, BERT) use learned positional embeddings. LLaMA uses RoPE (Rotary Position Embedding).

Context Window: The Attention Bottleneck

The context window (or context length) is the maximum number of tokens the model can process at once. This is a hard architectural limit.

Why Context Matters

  • Hard limit: Model can only "see" last N tokens
  • Affects memory: Can't reference information beyond context
  • Quadratic cost: O(n²) attention computation
  • Determines use cases: Short contexts → chat; Long contexts → document analysis

Context Lengths of Modern LLMs

Model Context Length ~Word Count ~Pages (250 words/page)
GPT-3 4,096 tokens ~3,000 words ~12 pages
GPT-4 (8K) 8,192 tokens ~6,000 words ~24 pages
GPT-4 (32K) 32,768 tokens ~24,000 words ~96 pages
GPT-4 Turbo 128,000 tokens ~96,000 words ~384 pages
Claude 2 100,000 tokens ~75,000 words ~300 pages
Claude 3 200,000 tokens ~150,000 words ~600 pages
LLaMA-2 4,096 tokens ~3,000 words ~12 pages

Context Window Trade-offs

Larger context = more memory and compute

  • Memory: Storing attention matrix: O(n²) memory
  • Compute: Computing attention: O(n²·d) FLOPs
  • Latency: Longer sequences → slower generation
# Example: GPT-4 128K context # If each attention score is 2 bytes (FP16): memory = (128_000)^2 * 2 bytes = 32.8 GB just for attention scores! # This is why long context is expensive and requires # techniques like Flash Attention for efficiency

Sampling & Decoding Strategies

How models generate text by sampling from probability distributions over tokens.

Temperature Scaling

Temperature controls the randomness of predictions by scaling logits before softmax.

import torch import torch.nn.functional as F # Model outputs logits (unnormalized scores) logits = torch.tensor([2.0, 1.0, 0.1]) # Three possible next tokens # Apply temperature def sample_with_temperature(logits, temperature=1.0): scaled_logits = logits / temperature probs = F.softmax(scaled_logits, dim=-1) return probs # Low temperature (T=0.5): More confident, peaked distribution probs_low = sample_with_temperature(logits, temperature=0.5) # [0.709, 0.259, 0.032] ← First token dominates # Medium temperature (T=1.0): Balanced probs_medium = sample_with_temperature(logits, temperature=1.0) # [0.659, 0.242, 0.099] ← Still peaked but less extreme # High temperature (T=2.0): More uniform, diverse probs_high = sample_with_temperature(logits, temperature=2.0) # [0.502, 0.307, 0.191] ← More spread out # Very low (T=0.1): Nearly deterministic probs_very_low = sample_with_temperature(logits, temperature=0.1) # [0.998, 0.002, 0.000] ← Almost always first token

Sampling Strategies Comparison

Strategy How It Works Use Cases
Greedy Always pick highest probability token Factual Q&A, translation (deterministic)
Beam Search Keep top K sequences at each step Machine translation, summarization
Top-K Sampling Sample from top K most likely tokens Creative writing (K=40-100)
Top-P (Nucleus) Sample from smallest set with cumulative prob ≥ P Most versatile (P=0.9-0.95)
Temperature + Top-P Combine both for fine control Production systems (T=0.7, P=0.9)

Top-K vs Top-P (Nucleus) Sampling

def top_k_sampling(logits, k=50): """Keep only top K tokens""" top_k_logits, top_k_indices = torch.topk(logits, k) probs = F.softmax(top_k_logits, dim=-1) # Sample from top K next_token = torch.multinomial(probs, 1) return top_k_indices[next_token] def top_p_sampling(logits, p=0.9): """Nucleus sampling: keep smallest set with cumulative prob ≥ p""" sorted_logits, sorted_indices = torch.sort(logits, descending=True) probs = F.softmax(sorted_logits, dim=-1) # Compute cumulative probabilities cumulative_probs = torch.cumsum(probs, dim=-1) # Remove tokens with cumulative prob > p remove_mask = cumulative_probs > p remove_mask[..., 1:] = remove_mask[..., :-1].clone() remove_mask[..., 0] = False # Zero out removed tokens sorted_logits[remove_mask] = float('-inf') # Sample from remaining tokens probs = F.softmax(sorted_logits, dim=-1) next_token_idx = torch.multinomial(probs, 1) return sorted_indices[next_token_idx] # Example: Probability distribution # Token 0: 0.4 ← # Token 1: 0.3 ← Top-P (p=0.9) includes these (cum=0.7) # Token 2: 0.2 ← and this one (cum=0.9) # Token 3: 0.05 ← Excluded (would exceed p) # Token 4: 0.03 # Token 5: 0.02 # Top-K (k=3) would include tokens 0, 1, 2 regardless of probabilities # Top-P (p=0.9) includes 0, 1, 2 because they sum to ≥ 0.9

Recommended Settings

Task Temperature Top-P Notes
Factual Q&A 0.0-0.3 1.0 Deterministic, accurate
Code generation 0.2-0.5 0.95 Mostly deterministic
Chat assistant 0.7-0.9 0.9 Balanced
Creative writing 0.9-1.2 0.95 Diverse, creative
Brainstorming 1.0-1.5 1.0 Very diverse

Key Papers and Resources

Tokenization

Embeddings

Sampling Strategies

Long Context

Practical Tools

Recommended Learning Path

  1. Experiment with tiktoken to see how text is tokenized
  2. Read BPE paper (Sennrich et al.) to understand the algorithm
  3. Implement simple embedding layer in PyTorch
  4. Read Word2Vec paper for embedding intuition
  5. Experiment with temperature and top-p sampling
  6. Read Nucleus sampling paper (Holtzman et al.)
  7. Implement all sampling strategies from scratch

Popular LLM Families

Overview of major language models and their characteristics

Major LLM Families

GPT Series

OpenAI's Generative Pre-trained Transformer models (GPT-3, GPT-4)

Claude

Anthropic's constitutional AI assistant with strong reasoning

LLaMA

Meta's open-source models enabling research and development

BERT

Google's bidirectional encoder for understanding tasks

PaLM/Gemini

Google's latest multimodal AI models

Mistral

Efficient open-source models from Mistral AI

Model Comparison

Model Parameters Context Notable Feature
GPT-4 Unknown 128K tokens Multimodal capabilities
Claude 3 Unknown 200K tokens Long context, strong reasoning
LLaMA 2 7B-70B 4K tokens Open source
Mixtral 8x7B 47B 32K tokens Mixture of Experts
Choosing a Model: Consider factors like cost, latency, capabilities, context length, and whether you need self-hosting (open source) or API access (proprietary).

Test Your Knowledge

Assess your understanding of LLMs and transformers

Knowledge Quiz

1. What is the key innovation that makes Transformers powerful?
A) Recurrent connections
B) Attention mechanism
C) Convolutional layers
D) Decision trees
2. What does RLHF stand for?
A) Recursive Learning from Hidden Features
B) Reinforcement Learning from Human Feedback
C) Random Learning for High Frequency
D) Regulated Learning for Helpful Functions
3. What are tokens in the context of LLMs?
A) Physical hardware components
B) Basic units of text the model processes
C) Security credentials
D) Neural network layers
4. During pre-training, what is the primary objective?
A) Follow human instructions
B) Predict the next token in a sequence
C) Generate creative stories
D) Classify images
5. What is the purpose of temperature in text generation?
A) Control the model's training speed
B) Measure the model's accuracy
C) Control randomness and creativity in outputs
D) Determine the context window size

Learning Resources

Continue your journey into LLMs and AI

Essential Papers

  • "Attention Is All You Need" - The original Transformer paper (Vaswani et al., 2017)
  • "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
  • "Language Models are Few-Shot Learners" - GPT-3 paper (Brown et al., 2020)
  • "Training language models to follow instructions with human feedback" - InstructGPT/RLHF (Ouyang et al., 2022)
  • "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022)

Online Courses & Tutorials

  • Stanford CS224N: Natural Language Processing with Deep Learning
  • Hugging Face NLP Course: Free comprehensive course on transformers
  • Fast.ai: Practical Deep Learning for Coders
  • DeepLearning.AI: Courses on Transformers and LLMs
  • Andrej Karpathy: YouTube tutorials on building GPT from scratch

Tools to Experiment

  • Hugging Face: Pre-trained models, datasets, and the Transformers library
  • OpenAI Playground: Experiment with GPT models
  • Google Colab: Free GPU/TPU for training and experimenting
  • PyTorch / TensorFlow: Deep learning frameworks
  • LangChain: Framework for building LLM applications

Communities & Blogs

  • r/MachineLearning: Reddit community for ML research
  • Papers with Code: Latest research with implementations
  • The Gradient: Online magazine about AI research
  • Distill.pub: Clear explanations of ML concepts
  • AI Alignment Forum: Discussions on AI safety and alignment

Next Steps

  1. Read the foundational Transformer paper
  2. Experiment with models on Hugging Face
  3. Build a simple project using LLM APIs
  4. Follow AI research communities
  5. Consider specializing in an area (safety, applications, research)

Sources & References

This guide draws from foundational papers and research in the field of large language models. Below are key citations and resources:

Foundational Papers

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need"
The original Transformer architecture paper
https://arxiv.org/abs/1706.03762

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Introduced bidirectional pre-training and masked language modeling
https://arxiv.org/abs/1810.04805

Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners"
GPT-2: Demonstrated zero-shot task transfer capabilities
OpenAI GPT-2 Paper

Scaling & Training

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models"
Original scaling laws for LLMs
https://arxiv.org/abs/2001.08361

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models"
Chinchilla paper: Revised scaling laws showing optimal compute allocation
https://arxiv.org/abs/2203.15556

Optimization & Efficiency

Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
Efficient attention mechanism using tiling and kernel fusion
https://arxiv.org/abs/2205.14135

Alignment & Safety

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback"
Self-improvement and alignment through AI-generated feedback
https://arxiv.org/abs/2212.08073

Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
DPO: Simplified alignment method without explicit reward modeling
https://arxiv.org/abs/2305.18290

Emergent Capabilities

Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models"
Documents qualitative improvements that emerge at scale
https://arxiv.org/abs/2206.07682

Note: This field moves rapidly. For the latest research, check arXiv.org, Papers with Code, and conference proceedings (NeurIPS, ICML, ACL, EMNLP).

How LLMs Understand and Generate Code

Deep dive into code comprehension, generation, and the patterns LLMs recognize

Code as a Language

LLMs treat programming code as just another language. During pre-training, models see billions of lines of code from GitHub, Stack Overflow, documentation, and tutorials. They learn:

  • Syntax patterns: Language-specific grammar, keywords, operators
  • Semantic relationships: How variables, functions, and classes relate
  • Common idioms: Standard patterns like loops, recursion, error handling
  • API usage: How to use libraries and frameworks correctly
  • Code structure: File organization, imports, module patterns
  • Documentation patterns: Docstrings, comments, type hints

What Makes Code Different from Natural Language

Precise Syntax

One missing semicolon breaks everything. No room for ambiguity.

Logical Structure

Code follows strict logical flow: control structures, state changes, algorithms.

Compositionality

Complex programs built from simple, reusable components.

Execution Semantics

Code must be executable. The computer is the ultimate judge.

Code Training Data

Modern code-capable LLMs are trained on massive code repositories:

Source Size Content
GitHub 100M+ repositories Open source code in all languages
Stack Overflow 20M+ questions Q&A with code examples
Documentation Billions of tokens Official docs, tutorials, guides
LeetCode/HackerRank 100K+ problems Algorithmic problems with solutions

Common Programming Patterns LLMs Learn

Through exposure to millions of code examples, LLMs develop strong intuitions about common patterns:

1. Data Structure Patterns

# Array/List Patterns two_pointers = lambda arr: ... # Two pointer technique sliding_window = lambda arr: ... # Sliding window prefix_sum = lambda arr: [sum(arr[:i+1]) for i in range(len(arr))] # HashMap Patterns frequency_map = {} # Count occurrences seen = set() # Track visited elements memo = {} # Memoization for DP # Tree Patterns def dfs(node): if not node: return # Process node dfs(node.left) dfs(node.right) # Graph Patterns adj_list = {i: [] for i in range(n)} # Adjacency list visited = set() # Track visited nodes

2. Algorithm Patterns

# Binary Search Pattern def binary_search(arr, target): left, right = 0, len(arr) - 1 while left <= right: mid = (left + right) // 2 if arr[mid] == target: return mid elif arr[mid] < target: left = mid + 1 else: right = mid - 1 return -1 # Dynamic Programming Pattern def dp_problem(n): dp = [0] * (n + 1) dp[0] = base_case for i in range(1, n + 1): dp[i] = transition_function(dp, i) return dp[n] # Backtracking Pattern def backtrack(path, choices): if is_solution(path): result.append(path[:]) return for choice in choices: path.append(choice) backtrack(path, remaining_choices) path.pop() # Undo choice

3. Edge Case Handling

# LLMs learn these common edge cases: if not nums or len(nums) == 0: # Empty input return [] if len(nums) == 1: # Single element return nums[0] if target < nums[0] or target > nums[-1]: # Out of range return -1 if left > right: # Invalid range return None # Handling None/null if root is None: return 0

How LLMs Generate Code

When generating code, LLMs follow a probabilistic process:

1. Understand the Problem

Parse the natural language description, identify key requirements, constraints, and expected inputs/outputs. Extract data structure needs and algorithmic requirements.

2. Recall Similar Problems

Attention mechanism activates memories of similar problems seen during training. Pattern matching against thousands of known solutions.

3. Select High-Level Approach

Choose algorithm family (greedy, DP, divide-and-conquer, etc.) based on problem characteristics. Select appropriate data structures.

4. Generate Structure

Start with function signature, parameter types. Add main logic scaffold (loops, conditionals, base cases). Structure follows learned templates.

5. Fill in Details

Token by token generation of the actual logic. Each token is predicted based on all previous context. Variable names follow conventions.

6. Add Error Handling

Insert edge case checks, validation, and error handling based on learned patterns from high-quality code.

Strengths of LLMs in Coding

  • Pattern Recognition: Excellent at recognizing which algorithm pattern fits a problem
  • Boilerplate: Great at generating standard code structures and setup
  • Syntax: Strong knowledge of syntax across many programming languages
  • Common Algorithms: Can implement standard algorithms (sorting, searching, graph traversal)
  • Code Translation: Can convert code between programming languages
  • Explanation: Can explain what code does line by line
  • Debugging: Can spot common bugs and suggest fixes

Limitations in Coding

  • Complex Logic: Struggles with highly nested or intricate logical conditions
  • Novel Algorithms: Can't invent truly new algorithmic approaches
  • Mathematical Proofs: Can't rigorously prove correctness
  • Optimization: May not generate the most efficient solution
  • Edge Cases: Might miss rare or subtle edge cases
  • Testing: Generated code needs human verification
  • Large Codebases: Context window limits understanding of large projects

Algorithmic Reasoning in LLMs

How LLMs solve algorithmic problems and approach LeetCode-style challenges

Understanding Algorithmic Reasoning

Algorithmic reasoning requires step-by-step logical thinking, understanding of complexity, and systematic problem-solving. LLMs develop these capabilities through exposure to millions of algorithm implementations and explanations.

How LLMs Learn Algorithms

During training, LLMs see algorithms presented in multiple forms:

Code Implementations

Working code in Python, Java, C++, etc.

Pseudocode

High-level algorithmic descriptions

Natural Language

Written explanations and tutorials

Examples & Traces

Step-by-step execution examples

Core Algorithm Categories LLMs Understand

1. Searching & Sorting (Mastery: High)

# Binary Search - LLMs excel at this def binary_search(nums, target): """ Time: O(log n), Space: O(1) LLMs understand: logarithmic complexity, divide-and-conquer """ left, right = 0, len(nums) - 1 while left <= right: mid = left + (right - left) // 2 # Avoid overflow if nums[mid] == target: return mid elif nums[mid] < target: left = mid + 1 else: right = mid - 1 return -1 # Merge Sort - Recursive pattern well-learned def merge_sort(arr): """ Time: O(n log n), Space: O(n) LLMs understand: divide-and-conquer, recursion """ if len(arr) <= 1: return arr mid = len(arr) // 2 left = merge_sort(arr[:mid]) right = merge_sort(arr[mid:]) return merge(left, right)

2. Dynamic Programming (Mastery: Medium-High)

# Fibonacci - Classic DP pattern def fib_dp(n): """ LLMs recognize: overlapping subproblems, memoization Bottom-up vs top-down approaches """ dp = [0] * (n + 1) dp[1] = 1 for i in range(2, n + 1): dp[i] = dp[i-1] + dp[i-2] return dp[n] # Longest Common Subsequence def lcs(s1, s2): """ LLMs understand: 2D DP table, recurrence relations Building solution from smaller subproblems """ m, n = len(s1), len(s2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1] + 1 else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) return dp[m][n]

3. Graph Algorithms (Mastery: Medium)

# DFS - Depth First Search def dfs(graph, start, visited=None): """ LLMs understand: recursion, backtracking Visited set pattern, adjacency lists """ if visited is None: visited = set() visited.add(start) for neighbor in graph[start]: if neighbor not in visited: dfs(graph, neighbor, visited) return visited # BFS - Breadth First Search from collections import deque def bfs(graph, start): """ LLMs recognize: queue usage, level-by-level traversal Shortest path in unweighted graphs """ visited = set([start]) queue = deque([start]) while queue: node = queue.popleft() for neighbor in graph[node]: if neighbor not in visited: visited.add(neighbor) queue.append(neighbor) return visited # Dijkstra's Algorithm import heapq def dijkstra(graph, start): """ LLMs understand: priority queue, greedy approach Distance tracking, relaxation step """ distances = {node: float('inf') for node in graph} distances[start] = 0 pq = [(0, start)] while pq: current_dist, current_node = heapq.heappop(pq) if current_dist > distances[current_node]: continue for neighbor, weight in graph[current_node].items(): distance = current_dist + weight if distance < distances[neighbor]: distances[neighbor] = distance heapq.heappush(pq, (distance, neighbor)) return distances

4. Tree Algorithms (Mastery: High)

class TreeNode: def __init__(self, val=0, left=None, right=None): self.val = val self.left = left self.right = right # Inorder Traversal - LLMs excel at this def inorder(root): """ Pattern: left -> root -> right LLMs understand: recursive tree traversal """ if not root: return [] return inorder(root.left) + [root.val] + inorder(root.right) # Lowest Common Ancestor def lca(root, p, q): """ LLMs recognize: recursive structure Base cases and divide-and-conquer """ if not root or root == p or root == q: return root left = lca(root.left, p, q) right = lca(root.right, p, q) if left and right: return root return left if left else right # Validate BST def isValidBST(root, min_val=float('-inf'), max_val=float('inf')): """ LLMs understand: constraints propagation Range validation pattern """ if not root: return True if root.val <= min_val or root.val >= max_val: return False return (isValidBST(root.left, min_val, root.val) and isValidBST(root.right, root.val, max_val))

Problem-Solving Strategies LLMs Learn

Strategy When to Use Example Problems
Two Pointers Sorted arrays, strings, linked lists Two Sum II, Container With Most Water
Sliding Window Subarray/substring problems Longest Substring, Max Sum Subarray
Fast & Slow Pointers Cycle detection, middle element Linked List Cycle, Happy Number
Merge Intervals Overlapping intervals Merge Intervals, Meeting Rooms
Top K Elements Finding largest/smallest K items Kth Largest, Top K Frequent
Binary Search Sorted data, search space reduction Search in Rotated Array, Find Min
Backtracking All combinations, permutations Subsets, N-Queens, Sudoku
Dynamic Programming Optimal substructure, overlapping Coin Change, Longest Palindrome

Time Complexity Recognition

LLMs develop intuition about complexity by seeing code annotated with Big-O notation:

# Common Complexity Classes LLMs Recognize # O(1) - Constant def get_first(arr): return arr[0] if arr else None # O(log n) - Logarithmic (binary search pattern) def binary_search(arr, target): left, right = 0, len(arr) - 1 while left <= right: mid = (left + right) // 2 # ... halving search space each time # O(n) - Linear (single pass) def find_max(arr): return max(arr) # O(n log n) - Linearithmic (efficient sorting) def sort(arr): return sorted(arr) # Timsort # O(n²) - Quadratic (nested loops) def bubble_sort(arr): for i in range(len(arr)): for j in range(len(arr) - 1 - i): if arr[j] > arr[j+1]: arr[j], arr[j+1] = arr[j+1], arr[j] # O(2^n) - Exponential (recursive branching) def fibonacci_naive(n): if n <= 1: return n return fibonacci_naive(n-1) + fibonacci_naive(n-2)

LLM Reasoning Process for LeetCode Problems

Step 1: Pattern Recognition

Identify which algorithmic pattern the problem fits. Keywords trigger pattern matching: "sorted array" → binary search, "subarray" → sliding window, "paths" → DFS/BFS.

Step 2: Data Structure Selection

Choose appropriate data structures based on operations needed. HashMap for fast lookups, heap for priority, stack for LIFO, queue for BFS.

Step 3: Algorithm Template

Apply learned template for the identified pattern. Start with the scaffolding that fits the problem type.

Step 4: Implementation

Fill in problem-specific logic within the template structure. Handle indices, boundaries, update rules.

Step 5: Edge Cases

Add checks for empty inputs, single elements, negative numbers, duplicates, etc.

What LLMs Struggle With

  • Novel Problem Types: Completely new problems without similar training examples
  • Multi-step Mathematical Reasoning: Problems requiring careful mathematical derivation
  • Subtle Optimizations: Finding the most space-efficient or time-efficient solution
  • Complex State Management: Problems with many interacting state variables
  • Proof of Correctness: Formally proving an algorithm is correct
  • Rare Edge Cases: Unusual boundary conditions not well-represented in training

Prompting LLMs for Coding Problems

Master techniques for getting the best solutions from LLMs for LeetCode-style problems

Effective Prompting Strategies

How you ask an LLM to solve a coding problem dramatically affects the quality of the solution. Well-structured prompts lead to better, more reliable code.

The Anatomy of a Good Coding Prompt

# Poor Prompt: "Solve two sum problem" # Good Prompt: "Given an array of integers nums and an integer target, return indices of the two numbers that add up to target. Constraints: - 2 <= nums.length <= 10^4 - -10^9 <= nums[i] <= 10^9 - Only one valid answer exists Example: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: nums[0] + nums[1] == 9 Please provide: 1. Optimal solution with explanation 2. Time and space complexity analysis 3. Edge cases to consider"

Essential Prompting Techniques

1. Chain-of-Thought Prompting

Ask the LLM to think step-by-step before coding:

Prompt: "Solve this problem step by step: 1. First, analyze the problem and identify the pattern 2. Explain your high-level approach 3. Discuss time and space complexity 4. Then provide the implementation 5. Finally, test with edge cases Problem: [Your problem here]" # This leads to better reasoning and more reliable solutions

2. Few-Shot Learning

Provide examples of similar solved problems:

Prompt: "Here's a similar problem I solved: Problem: Two Sum Solution: Use HashMap for O(n) time [code example] Now solve this similar problem: Problem: Three Sum [problem description]" # LLM learns from the pattern in your example

3. Specify Constraints and Requirements

Prompt: "Solve this problem with these requirements: - Time complexity: Must be O(n log n) or better - Space complexity: Must be O(1) auxiliary space - Language: Python 3 - Style: Use type hints and docstrings - Handle edge cases: empty array, single element, duplicates - Add comments explaining the algorithm Problem: [description]"

4. Ask for Multiple Approaches

Prompt: "Provide three different solutions: 1. Brute force approach (for understanding) 2. Optimized approach with trade-offs explained 3. Most optimal approach For each, include: - Code implementation - Time and space complexity - When to use this approach Problem: [description]"

5. Iterative Refinement

Build on previous responses:

First prompt: "Solve this problem: [description]" [Get initial solution] Second prompt: "The solution works but fails for edge case: [specific case]. Please fix this bug." [Get improved solution] Third prompt: "Can you optimize the space complexity? Currently using O(n) space." [Get optimized solution] Fourth prompt: "Add comprehensive unit tests"

Prompting Patterns for Common Problem Types

For Array Problems:

"Solve this array problem: [problem] Consider these approaches: - Two pointers - Sliding window - Prefix sum - HashMap for tracking Choose the most appropriate and explain why."

For Tree Problems:

"Solve this binary tree problem: [problem] Specify whether to use: - DFS (preorder/inorder/postorder) - BFS (level-order) - Recursive vs iterative - Any special tree properties (BST, balanced, etc.) Include the TreeNode class definition."

For Dynamic Programming:

"Solve this DP problem: [problem] Please: 1. Define the state (what does dp[i] represent?) 2. Write the recurrence relation 3. Identify base cases 4. Implement bottom-up solution 5. Discuss if space optimization is possible"

For Graph Problems:

"Solve this graph problem: [problem] Clarify: - Graph representation (adjacency list/matrix) - Directed or undirected - Weighted or unweighted - Algorithm choice (DFS/BFS/Dijkstra/etc.) Include graph construction from input."

Advanced Prompting Techniques

Self-Consistency Prompting

"Generate 3 different solutions to this problem. Then analyze which one is best and why. Finally, provide the optimal solution. Problem: [description]" # This makes the LLM reason about different approaches

Socratic Prompting

"Let's solve this problem together: Problem: [description] Question 1: What data structure would best suit this problem? [Wait for response] Question 2: What would be the time complexity of using that? [Wait for response] Question 3: Can we do better? If so, how? [Wait for response] Now implement the optimal solution."

Explain-Then-Code Pattern

"First, explain the algorithm in plain English without code. Walk through an example step by step. Only after the explanation is clear, provide the implementation. Problem: [description]" # Forces logical thinking before coding

Debugging with LLMs

When Code Fails:

"This code fails for test case: [test case] Expected: [expected output] Got: [actual output] Here's the code: [your code] Please: 1. Identify the bug 2. Explain why it fails 3. Provide the corrected code 4. Suggest similar edge cases to test"

Code Review Prompt:

"Review this solution: [your code] Check for: 1. Correctness 2. Edge cases 3. Time/space complexity 4. Code quality and readability 5. Potential optimizations Provide detailed feedback."

Anti-Patterns to Avoid

Don't:

  • Give vague prompts like "help with this problem"
  • Omit problem constraints and examples
  • Accept first solution without testing
  • Ask for code without understanding the approach first
  • Trust LLM blindly for complex mathematical proofs
  • Skip edge case discussion
  • Forget to specify language and style preferences

Testing LLM-Generated Solutions

"After providing the solution, generate: 1. Unit tests covering: - Happy path cases - Edge cases (empty, single element, max size) - Boundary conditions - Invalid inputs 2. Performance tests for large inputs 3. Comparison with expected output format"

Best Practices for LeetCode with LLMs

Practice Why
Start with understanding Ask LLM to explain problem before coding
Request multiple solutions Learn different approaches and trade-offs
Verify complexity claims LLMs can be wrong about Big-O analysis
Test edge cases LLM might miss rare boundary conditions
Understand before copying Learn the pattern for future problems
Iterate and refine First solution may not be optimal

Example: Complete LeetCode Session

# Step 1: Understanding Prompt: "Explain the 'Container With Most Water' problem. What's the intuition? What patterns does it use?" # Step 2: Approach Prompt: "What approaches can solve this? Compare brute force vs optimal. Include complexity for each." # Step 3: Implementation Prompt: "Implement the optimal solution in Python with: - Type hints - Detailed comments - Clear variable names" # Step 4: Testing Prompt: "Generate 5 test cases including edge cases. Show the execution trace for one example." # Step 5: Optimization Prompt: "Can we reduce space complexity further? Are there any micro-optimizations?" # Step 6: Similar Problems Prompt: "What are 3 similar problems that use the same two-pointer technique?"

Remember: LLMs are powerful tools for learning and problem-solving, but they work best when you:

  • Provide clear, detailed prompts
  • Think critically about responses
  • Test thoroughly
  • Understand rather than blindly copy
  • Use them as learning aids, not crutches

Projects & Practice Ideas

Hands-on projects to practice pre-training, post-training, and RLHF techniques

Pre-training Projects

Learn how to train language models from scratch or continue pre-training existing models.

Beginner: Train a Character-Level Language Model

Goal: Understand the fundamentals of autoregressive language modeling

Technical Requirements:

  • Dataset: Small text corpus (Shakespeare, Wikipedia subset, ~1-10MB)
  • Model: Small transformer (2-4 layers, 128-256 dim, ~1M parameters)
  • Hardware: CPU or single GPU (Google Colab free tier)
  • Framework: PyTorch or TensorFlow, Hugging Face Transformers
  • Time: 1-2 weeks

Key Learning Objectives:

  • Implement tokenization (character or BPE)
  • Build transformer architecture from scratch or adapt existing
  • Implement cross-entropy loss and next-token prediction
  • Track training metrics (loss, perplexity)
  • Generate text samples during training
  • Experiment with hyperparameters (learning rate, batch size, context length)

Deliverables: Trained model that generates coherent text in training domain style

Intermediate: Continue Pre-training a Small LLM

Goal: Adapt an existing pre-trained model to a specific domain

Technical Requirements:

  • Base Model: GPT-2 (124M), DistilGPT-2, or similar (~100M-350M params)
  • Dataset: Domain-specific corpus (medical papers, code, legal docs, 100MB-1GB)
  • Hardware: Single GPU (V100, A100, or T4 with gradient accumulation)
  • Framework: Hugging Face Transformers, DeepSpeed for optimization
  • Time: 2-4 weeks

Key Learning Objectives:

  • Load and freeze selective layers (optional)
  • Prepare domain-specific training data pipeline
  • Implement learning rate warmup and decay schedules
  • Use gradient accumulation for larger effective batch sizes
  • Evaluate on domain-specific benchmarks
  • Compare performance before/after domain adaptation
  • Implement checkpointing and resume training

Deliverables: Specialized model with measurably better domain performance

Advanced: Train a Small LLM from Scratch with Custom Tokenizer

Goal: Complete pre-training pipeline including data preprocessing and tokenizer training

Technical Requirements:

  • Dataset: Large diverse corpus (The Pile subset, C4, Common Crawl, 10GB+)
  • Model: Custom architecture (6-12 layers, 512-1024 dim, 100M-1B params)
  • Hardware: Multi-GPU setup or TPU (4-8 GPUs, distributed training)
  • Framework: PyTorch + DeepSpeed/FSDP, or JAX + Mesh TensorFlow
  • Time: 2-3 months

Key Learning Objectives:

  • Train custom BPE/WordPiece tokenizer on your corpus
  • Implement distributed data loading and shuffling
  • Use mixed precision training (FP16/BF16)
  • Implement gradient checkpointing for memory efficiency
  • Track scaling laws (loss vs compute)
  • Implement evaluation on diverse benchmarks (LAMBADA, HellaSwag, etc.)
  • Monitor training stability and implement interventions
  • Optimize throughput (tokens/sec, MFU - Model FLOPS Utilization)

Deliverables: Fully pre-trained model with documented training run, loss curves, and benchmark results

Post-training Projects (SFT, Instruction Tuning)

Take pre-trained models and fine-tune them to follow instructions and perform specific tasks.

Beginner: Fine-tune for Single Task (Classification/Summarization)

Goal: Understand supervised fine-tuning on a specific downstream task

Technical Requirements:

  • Base Model: FLAN-T5-small, GPT-2, or DistilGPT-2
  • Dataset: Single-task dataset (IMDB sentiment, CNN/DailyMail summarization, ~10K examples)
  • Hardware: Single GPU or Google Colab
  • Framework: Hugging Face Transformers + PEFT (LoRA)
  • Time: 1-2 weeks

Key Learning Objectives:

  • Format data as input-output pairs
  • Implement LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
  • Use appropriate loss masking (only compute loss on outputs)
  • Evaluate with task-specific metrics (accuracy, ROUGE, BLEU)
  • Compare full fine-tuning vs LoRA
  • Prevent overfitting with early stopping

Deliverables: Fine-tuned model that performs well on held-out test set

Intermediate: Multi-Task Instruction Tuning

Goal: Create a model that can follow diverse instructions across multiple task types

Technical Requirements:

  • Base Model: LLaMA-2-7B, Mistral-7B, or Falcon-7B
  • Dataset: Multi-task instruction dataset (FLAN, Dolly, OpenOrca, ~50K-500K examples)
  • Hardware: Single A100/H100 or multiple smaller GPUs
  • Framework: Axolotl, LLaMA-Factory, or custom training loop
  • Time: 3-6 weeks

Key Learning Objectives:

  • Curate and mix instruction datasets from multiple sources
  • Implement proper instruction templates (chat format)
  • Use QLoRA for memory-efficient training of larger models
  • Balance dataset mixing ratios
  • Evaluate across multiple tasks (reasoning, knowledge, coding)
  • Implement generation sampling strategies (temperature, top-p, top-k)
  • Measure instruction-following capability qualitatively

Deliverables: Instruction-tuned model that generalizes to unseen task types

Advanced: Build a Domain Expert with Specialized SFT

Goal: Create a highly capable domain-specific assistant (e.g., medical, legal, coding)

Technical Requirements:

  • Base Model: LLaMA-2-13B/70B, Mistral-7B, or Code Llama
  • Dataset: Custom domain dataset + synthetic data generation (50K-1M examples)
  • Hardware: Multi-GPU setup (4-8 A100s) or cloud compute
  • Framework: DeepSpeed, FSDP, or Megatron-LM for large models
  • Time: 2-3 months

Key Learning Objectives:

  • Use strong models (GPT-4) to generate synthetic training data
  • Implement data quality filtering and deduplication
  • Create domain-specific evaluation benchmarks
  • Fine-tune with curriculum learning (easy → hard examples)
  • Implement safety guardrails and refusal behavior
  • Test for hallucinations and factual accuracy
  • Compare against general-purpose baselines

Deliverables: Production-ready domain expert with comprehensive evaluation report

RLHF (Reinforcement Learning from Human Feedback) Projects

Implement the complete RLHF pipeline to align models with human preferences.

Beginner: Implement Reward Modeling

Goal: Build a reward model that predicts human preferences between responses

Technical Requirements:

  • Base Model: Same architecture as your SFT model (smaller variant)
  • Dataset: Comparison dataset (Anthropic HH-RLHF, OpenAssistant, ~10K-50K pairs)
  • Hardware: Single GPU (T4/V100)
  • Framework: Hugging Face Transformers + TRL (Transformer Reinforcement Learning)
  • Time: 2-3 weeks

Key Learning Objectives:

  • Understand pairwise comparison loss (Bradley-Terry model)
  • Modify model head to output scalar reward score
  • Implement training loop with paired examples
  • Evaluate reward model accuracy on preference prediction
  • Analyze what the reward model has learned (correlations with length, style, etc.)
  • Test on out-of-distribution prompts

Deliverables: Trained reward model that accurately predicts human preferences (>60% accuracy)

Intermediate: Full RLHF Pipeline (PPO)

Goal: Implement complete RLHF with Proximal Policy Optimization

Technical Requirements:

  • Base Model: Instruction-tuned model (GPT-2, LLaMA-7B)
  • Reward Model: From previous step or pre-trained
  • Hardware: 2-4 GPUs (need separate GPUs for policy, value, and reward models)
  • Framework: TRL (trl library), or DeepSpeed-Chat
  • Time: 4-8 weeks

Key Learning Objectives:

  • Implement PPO algorithm for language models
  • Manage four models: policy (actor), value (critic), reward, reference
  • Compute advantages using Generalized Advantage Estimation (GAE)
  • Add KL divergence penalty to prevent drift from reference model
  • Sample generations and compute rewards in batches
  • Monitor training stability (KL divergence, reward statistics, policy entropy)
  • Evaluate qualitative improvements in response quality

Deliverables: RLHF-trained model with measurably improved preference ratings vs SFT baseline

Advanced: Constitutional AI & DPO (Direct Preference Optimization)

Goal: Implement modern RLHF alternatives without explicit reward models

Technical Requirements:

  • Base Model: Large instruction-tuned model (LLaMA-2-13B+, Mistral-7B)
  • Dataset: Generate synthetic preferences using AI feedback (self-critique)
  • Hardware: 4-8 GPUs for larger models
  • Framework: Custom implementation or TRL with DPO support
  • Time: 2-3 months

Key Learning Objectives:

  • Implement Constitutional AI: use model to critique and revise its own outputs
  • Generate preference pairs through self-evaluation against principles
  • Implement DPO algorithm (simpler than PPO, no reward model needed)
  • Compare DPO vs PPO on same dataset
  • Design constitution (set of principles for model behavior)
  • Measure alignment on safety benchmarks (TruthfulQA, BBQ bias)
  • Test robustness to adversarial prompts

Deliverables: Aligned model using modern techniques with comprehensive safety evaluation

End-to-End Application Projects

Build complete systems that combine pre-training, post-training, and RLHF concepts.

Project Idea: Personal Code Assistant

Pipeline:

  • Pre-training: Continue pre-train Code Llama on your company's codebase
  • SFT: Fine-tune on code completion and debugging examples
  • RLHF: Use developer feedback (thumbs up/down) to refine suggestions

Technologies: VS Code extension, local model serving (vLLM), feedback collection UI

Project Idea: Custom Chatbot with Domain Expertise

Pipeline:

  • Pre-training: Continue training on domain-specific documents (medical/legal/financial)
  • SFT: Train on QA pairs and instruction examples from domain
  • RLHF: Collect expert feedback on response quality and factuality

Technologies: RAG (Retrieval-Augmented Generation), vector database (Pinecone/Weaviate), web interface

Project Idea: Creative Writing Assistant

Pipeline:

  • Pre-training: Train on large corpus of books, stories, creative writing
  • SFT: Fine-tune on story completion, character development prompts
  • RLHF: Use writer feedback to align on style, creativity, coherence

Technologies: React web app, streaming responses, style transfer controls

Learning Resources for Projects

Essential Libraries & Tools

Tool Purpose Best For
Hugging Face Transformers Pre-trained models, tokenizers, training utilities All projects, especially SFT
TRL (Transformer RL) RLHF, PPO, DPO implementations RLHF projects
PEFT (LoRA, QLoRA) Parameter-efficient fine-tuning Limited compute scenarios
DeepSpeed Distributed training, memory optimization Large model training
Axolotl Simplified training configs for LLMs Quick experiments
vLLM / TGI Fast inference serving Production deployment
Weights & Biases Experiment tracking, visualization All projects

Datasets for Practice

  • Pre-training: The Pile, C4, RedPajama, mC4 (multilingual)
  • Instruction Tuning: FLAN, Dolly-15k, OpenOrca, WizardLM, Alpaca
  • Coding: The Stack, CodeSearchNet, APPS, HumanEval
  • RLHF: Anthropic HH-RLHF, OpenAssistant Conversations, SHP (StackOverflow preferences)
  • Alignment: TruthfulQA, BBQ (bias), RealToxicityPrompts

Compute Requirements

Model Size Training (Full FT) Training (LoRA/QLoRA) Inference
1M params CPU CPU CPU
100M-350M 1x T4/V100 (16GB) 1x T4 (16GB) CPU or small GPU
1B-3B 1-2x A100 (40GB) 1x V100/T4 1x T4 or CPU (slow)
7B 4-8x A100 (80GB) 1x A100 (40GB) 1x A100 or 2x T4
13B-30B 8-16x A100 2x A100 (80GB) 2-4x A100
70B+ 16-32x A100/H100 4-8x A100 8x A100 or quantization

Getting Started Checklist

  1. Choose a project that matches your compute budget and timeline
  2. Set up development environment (PyTorch, Transformers, Weights & Biases)
  3. Start with smallest viable dataset to test pipeline end-to-end
  4. Establish baseline metrics before training
  5. Implement logging and checkpointing from day one
  6. Run small-scale experiments first (hyperparameter search on tiny model)
  7. Scale up only after validating pipeline works
  8. Document everything (model cards, training logs, evaluation results)
  9. Share your work (blog post, GitHub repo, Hugging Face model)