Introduction to Large Language Models

Understanding the fundamentals of LLMs and their revolutionary impact on AI

What are Large Language Models?

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict, understand, and generate human language. They represent one of the most significant breakthroughs in artificial intelligence, powering applications from ChatGPT and Claude to code completion tools and research assistants.

Unlike traditional software that follows explicit rules, LLMs learn statistical patterns from data. By training on trillions of words from books, websites, scientific papers, and code repositories, they develop an implicit understanding of language structure, world knowledge, and reasoning patterns.

Key Characteristics

Scale: From 7 billion to over 1 trillion parameters (weights) that encode knowledge and language patterns
Training Data: Hundreds of billions to trillions of tokens from diverse text sources
Architecture: Primarily based on the Transformer architecture (2017) with attention mechanisms
Training Cost: Millions to tens of millions of dollars in computational resources
Capabilities: Text generation, question answering, translation, summarization, code generation, mathematical reasoning, and more
Context Window: Ability to process from 2K to 200K+ tokens (approximately 1.5K to 150K+ words) at once

The Evolution of Language Models

Language modeling has progressed through several distinct eras:

1950s-1990s: Statistical Language Models

N-gram models that predicted words based on previous N words. Simple but limited by memory and context constraints. Used in early speech recognition and machine translation.

2003-2013: Neural Language Models

Introduction of neural networks for language modeling. Bengio et al. (2003) showed neural networks could learn word embeddings. Word2Vec (2013) popularized dense word representations.

2013-2017: Sequence Models

LSTMs and GRUs enabled longer context understanding. Seq2Seq models (2014) revolutionized machine translation. But training was slow and sequential.

2017: The Transformer Revolution

"Attention Is All You Need" paper introduced the Transformer architecture. Parallel processing and attention mechanisms solved the long-range dependency problem.

2018-2019: Pre-training Era

BERT (110M-340M params), GPT-2 (1.5B params) demonstrated that pre-training on massive unlabeled data, then fine-tuning on specific tasks, dramatically improved performance.

2020-2022: Scaling Up

GPT-3 (175B params) showed emergent abilities at scale: few-shot learning, in-context learning. Models became general-purpose tools. This era proved: scale is all you need.

2022-Present: Alignment & Specialization

InstructGPT, ChatGPT showed that aligning models with human preferences (RLHF) makes them far more useful. Multimodal models (GPT-4, Gemini) integrate vision. Open-source models (LLaMA, Mistral) democratize access.

Scaling Laws: Bigger is Better

Research by OpenAI, DeepMind, and others revealed predictable scaling laws: model performance improves as a power law with respect to three factors:

# Scaling Law (Kaplan et al., 2020)
# Loss scales as a power law with compute budget

L(C) = (C_c / C)^α_c

Where:
- L(C) is the cross-entropy loss
- C is the compute budget (FLOPs)
- C_c and α_c are constants
- Typical α_c ≈ 0.05-0.10

Key findings:
1. Model size (N) - More parameters → better performance
2. Dataset size (D) - More data → better performance
3. Compute (C) - More training → better performance

Optimal allocation: N^0.73 ∝ D ∝ C^0.27
(Scale model size faster than dataset size)
                        

This means: doubling the compute budget reduces loss by approximately 10%. This predictability has driven the race to scale up models.

Try It: Understanding Scale

GPT-3 has 175 billion parameters. If each parameter was a grain of rice, you'd need:

How Do LLMs Work?

At their core, LLMs are trained with a simple objective: predict the next token given all previous tokens. This is called autoregressive language modeling or causal language modeling.

# Training Objective (simplified)

Given text: "The quick brown fox jumps over the lazy dog"

Training samples:
Input: "The"              → Predict: "quick"
Input: "The quick"        → Predict: "brown"
Input: "The quick brown"  → Predict: "fox"
...and so on

The model learns:
P(word_t | word_1, word_2, ..., word_{t-1})

After training on trillions of such examples,
the model learns language patterns, facts, reasoning, and more.
                        

This simple task, when performed at massive scale with billions of parameters, leads to emergent abilities - capabilities that appear suddenly at certain scale thresholds.

What LLMs Learn

Through next-token prediction on diverse data, LLMs develop:

Grammar & Syntax

Correct sentence structure, verb conjugation, agreement

World Knowledge

Facts, entities, events, relationships, common sense

Reasoning

Logical inference, mathematical reasoning, analogies

Context Understanding

Discourse, pronouns, implicit meaning, pragmatics

Task Patterns

Translation, summarization, question answering formats

Code & Formulas

Programming syntax, algorithms, mathematical notation

Emergent Abilities at Scale

Larger models develop abilities that smaller ones don't have. These capabilities "emerge" suddenly at certain parameter counts:

Ability	Emerges At	Description
Few-shot Learning	~13B params	Learn from just a few examples in the prompt
In-context Learning	~13B params	Adapt behavior based on context without fine-tuning
Chain-of-Thought	~100B params	Break down complex reasoning into steps
Multi-step Reasoning	~100B params	Solve problems requiring multiple inference steps
Code Generation	~10B params	Write functional code from natural language

Key Limitations

Despite their impressive capabilities, LLMs have fundamental limitations:

No True Understanding: LLMs are statistical pattern matchers, not conscious entities. They don't "understand" in a human sense.
Hallucinations: Can generate plausible-sounding but factually incorrect information with high confidence.
Knowledge Cutoff: Only know what was in their training data, with a specific cutoff date.
Reasoning Limitations: Struggle with tasks requiring precise logic, counting, or mathematical operations.
Context Length: Limited by maximum context window (though improving rapidly).
Bias & Toxicity: Can reflect and amplify biases present in training data.
No Memory: Each conversation starts fresh (unless explicitly provided context).

The Transformer Architecture

Deep dive into the revolutionary architecture that changed NLP

Understanding Transformers

The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., revolutionized natural language processing by replacing recurrence with pure attention mechanisms. It's the foundation of all modern LLMs including GPT, BERT, Claude, and LLaMA.

Key Innovation: Parallelization Through Attention

Previous architectures (RNNs, LSTMs) processed sequences sequentially: word 1 → word 2 → word 3. This was slow and struggled with long-range dependencies.

Transformers process entire sequences in parallel using attention, enabling:

100x faster training on modern GPUs
Better long-range dependency modeling
Scalability to billions of parameters

Paper: arxiv.org/abs/1706.03762

Three Types of Transformer Architectures

Type	Structure	Use Cases	Examples
Encoder-Only	Bidirectional attention (sees entire input)	Classification, NER, Q&A	BERT, RoBERTa, DeBERTa
Decoder-Only	Causal attention (only sees past)	Text generation, chat, code	GPT-3/4, LLaMA, Claude
Encoder-Decoder	Encoder (bidirectional) + Decoder (causal)	Translation, summarization	T5, BART, mT5

Modern LLMs are almost all decoder-only (GPT-style) because they're more flexible and scale better.

Decoder-Only Transformer: Component Breakdown

Let's examine a GPT-style decoder-only model, the most common architecture for modern LLMs.

1. Token + Positional Embeddings

Token Embedding: Maps each token ID to a dense vector

# Vocabulary size: 50,000 tokens, embedding dim: 512
token_embedding = nn.Embedding(50000, 512)

# Input: "The cat sat" → token IDs: [2, 145, 892]
token_ids = torch.tensor([[2, 145, 892]])
token_embeds = token_embedding(token_ids)  # [1, 3, 512]
                            

Positional Encoding: Adds position information (attention is permutation-invariant)

# Two approaches:

# 1. Learned Positional Embeddings (GPT, BERT)
pos_embedding = nn.Embedding(max_seq_len, d_model)
pos_ids = torch.arange(seq_len)
pos_embeds = pos_embedding(pos_ids)  # [seq_len, d_model]

# 2. Sinusoidal Positional Encoding (Original Transformer)
def sinusoidal_encoding(seq_len, d_model):
    """Fixed positional encoding using sin/cos functions"""
    position = torch.arange(seq_len).unsqueeze(1)  # [seq_len, 1]
    div_term = torch.exp(torch.arange(0, d_model, 2) *
                         -(math.log(10000.0) / d_model))

    encoding = torch.zeros(seq_len, d_model)
    encoding[:, 0::2] = torch.sin(position * div_term)  # even indices
    encoding[:, 1::2] = torch.cos(position * div_term)  # odd indices
    return encoding

# Combine token + position embeddings
final_embedding = token_embeds + pos_embeds  # [batch, seq_len, d_model]
                            

Why positional encoding matters: Without it, "cat sat" and "sat cat" would be identical to the model. Position info is crucial for understanding order, syntax, and semantics.

2. Transformer Block (Repeated N times)

Each block contains two sub-layers: multi-head attention + feed-forward network, each with residual connections and layer normalization.

class TransformerBlock(nn.Module):
    """Single Transformer decoder block"""
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()

        # Sub-layer 1: Multi-head self-attention
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        # Sub-layer 2: Position-wise feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),      # Expand: 512 → 2048
            nn.GELU(),                      # Activation (ReLU in original)
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)        # Project back: 2048 → 512
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Sub-layer 1: Self-attention with residual connection
        # Pre-norm (modern): norm before attention
        attn_out, _ = self.attention(
            self.norm1(x), self.norm1(x), self.norm1(x), mask
        )
        x = x + self.dropout1(attn_out)  # Residual connection

        # Sub-layer 2: FFN with residual connection
        ffn_out = self.ffn(self.norm2(x))
        x = x + self.dropout2(ffn_out)  # Residual connection

        return x
                            

Key components:

Residual connections (x + f(x)): Enable gradient flow in deep networks, prevent degradation
Layer normalization: Stabilize training, allow higher learning rates
FFN: Processes each position independently, adds non-linearity and capacity
Pre-norm vs Post-norm: Modern models use pre-norm (norm before sublayer) for training stability

3. Output Head (Language Modeling)

Final layer projects hidden states to vocabulary logits for next-token prediction.

# After N transformer blocks, we have: [batch, seq_len, d_model]

# Project to vocabulary size
lm_head = nn.Linear(d_model, vocab_size)  # [512 → 50000]
logits = lm_head(final_hidden_states)  # [batch, seq_len, 50000]

# Compute probability distribution over tokens
probs = F.softmax(logits, dim=-1)  # [batch, seq_len, 50000]

# During training: compute cross-entropy loss with target tokens
# During inference: sample from distribution (greedy, top-k, nucleus)
                            

Weight tying: Often, the output projection weights are tied to input embedding weights (same parameters). This reduces parameters and improves performance.

Complete Decoder-Only Transformer Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class GPTModel(nn.Module):
    """
    Decoder-only Transformer (GPT-style)

    Args:
        vocab_size: Size of vocabulary
        d_model: Model dimension (embedding size)
        num_layers: Number of transformer blocks
        num_heads: Number of attention heads
        d_ff: Feed-forward hidden dimension (typically 4*d_model)
        max_seq_len: Maximum sequence length
        dropout: Dropout probability
    """
    def __init__(
        self,
        vocab_size=50257,
        d_model=768,
        num_layers=12,
        num_heads=12,
        d_ff=3072,
        max_seq_len=1024,
        dropout=0.1
    ):
        super().__init__()

        # Input embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Final layer norm
        self.ln_f = nn.LayerNorm(d_model)

        # Output head (language modeling)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying: share weights between embedding and output
        self.lm_head.weight = self.token_embedding.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        """Initialize weights with normal distribution"""
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, input_ids, targets=None):
        """
        Args:
            input_ids: [batch, seq_len] token indices
            targets: [batch, seq_len] target tokens (for training)

        Returns:
            logits: [batch, seq_len, vocab_size]
            loss: scalar (if targets provided)
        """
        batch_size, seq_len = input_ids.shape

        # Create position IDs
        pos_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)

        # Embed tokens and positions
        token_embeds = self.token_embedding(input_ids)  # [batch, seq, d_model]
        pos_embeds = self.pos_embedding(pos_ids)        # [1, seq, d_model]
        x = self.dropout(token_embeds + pos_embeds)

        # Create causal mask (lower triangular)
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))
        mask = mask.view(1, 1, seq_len, seq_len)  # [1, 1, seq, seq]

        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final layer norm
        x = self.ln_f(x)

        # Project to vocabulary
        logits = self.lm_head(x)  # [batch, seq, vocab_size]

        # Compute loss if targets provided
        loss = None
        if targets is not None:
            # Flatten for cross-entropy
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),  # [batch*seq, vocab]
                targets.view(-1),                   # [batch*seq]
                ignore_index=-1  # Ignore padding tokens
            )

        return logits, loss

# Usage example
model = GPTModel(
    vocab_size=50257,  # GPT-2 tokenizer
    d_model=768,       # GPT-2 small
    num_layers=12,
    num_heads=12,
    d_ff=3072,
    max_seq_len=1024
)

print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
# Output: ~117M parameters (GPT-2 small)

# Forward pass
input_ids = torch.randint(0, 50257, (4, 128))  # Batch of 4, seq len 128
logits, loss = model(input_ids, targets=input_ids)
print(f"Logits shape: {logits.shape}")  # [4, 128, 50257]
                        

Model Sizes and Configurations

GPT-2 Variants

Model	Layers	d_model	Heads	Parameters
GPT-2 Small	12	768	12	117M
GPT-2 Medium	24	1024	16	345M
GPT-2 Large	36	1280	20	774M
GPT-2 XL	48	1600	25	1.5B

Modern LLMs (2023-2024)

Model	Layers	d_model	Heads	Parameters	Context
LLaMA-2 7B	32	4096	32	7B	4K
LLaMA-2 13B	40	5120	40	13B	4K
LLaMA-2 70B	80	8192	64	70B	4K
GPT-3	96	12288	96	175B	2K-4K
GPT-4 (rumored)	120+	?	?	~1.7T (MoE)	8K-32K-128K

Note: GPT-4's architecture details (parameter count, MoE structure) are unconfirmed rumors. OpenAI has not publicly disclosed these specifications.

Scaling pattern: Larger models generally have more layers, wider hidden dimensions, and more attention heads. The ratio d_model/num_heads is typically kept constant (64-128 per head).

Key Architectural Innovations Over Time

Innovation	Description	Used In
Pre-LayerNorm	LayerNorm before sublayers (not after)	GPT-2, modern LLMs
GELU Activation	Smooth activation (instead of ReLU)	BERT, GPT-2+
RoPE (Rotary Position Embedding)	Encodes relative positions via rotation	LLaMA, GPT-Neo
ALiBi (Attention with Linear Biases)	Adds linear bias to attention scores	BLOOM, MPT
SwiGLU	Gated FFN activation	PaLM, LLaMA
Grouped Query Attention (GQA)	Share K/V across query heads	LLaMA-2, Mistral
RMSNorm	Simplified LayerNorm (no bias/mean)	LLaMA, T5

Key Papers and Resources

Foundational Papers

"Attention Is All You Need" (Vaswani et al., 2017) - The original Transformer
arxiv.org/abs/1706.03762
"Improving Language Understanding by Generative Pre-Training" (Radford et al., 2018) - GPT-1
OpenAI GPT-1 Paper
"Language Models are Unsupervised Multitask Learners" (Radford et al., 2019) - GPT-2
OpenAI GPT-2 Paper
"BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
arxiv.org/abs/1810.04805
"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
arxiv.org/abs/2302.13971

Architectural Improvements

"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
arxiv.org/abs/2104.09864
"GLU Variants Improve Transformer" (Shazeer, 2020) - SwiGLU
arxiv.org/abs/2002.05202
"Train Short, Test Long: Attention with Linear Biases" (Press et al., 2021) - ALiBi
arxiv.org/abs/2108.12409

Interactive Resources

The Illustrated Transformer by Jay Alammar
jalammar.github.io/illustrated-transformer
The Annotated Transformer by Harvard NLP - Line-by-line implementation
nlp.seas.harvard.edu/annotated-transformer
nanoGPT by Andrej Karpathy - Minimal GPT implementation
github.com/karpathy/nanoGPT
minGPT by Andrej Karpathy - Educational GPT implementation
github.com/karpathy/minGPT

Recommended Learning Path

Read "The Illustrated Transformer" for visual understanding
Study original "Attention Is All You Need" paper (focus on architecture)
Work through "The Annotated Transformer" tutorial
Implement a tiny Transformer (2-4 layers) from scratch in PyTorch
Train on a small dataset (character-level language modeling)
Study nanoGPT to see production-quality implementation
Experiment with architectural variations (RoPE, SwiGLU, etc.)

The Attention Mechanism

Deep dive into the core innovation that powers Transformers

What is Attention?

Attention is a mechanism that allows models to dynamically focus on relevant parts of the input when processing each position. Unlike traditional models that process sequences word-by-word, attention allows the model to look at the entire sequence at once and decide what's important.

Intuition: When you read "The animal didn't cross the street because it was too tired", you know "it" refers to "animal" not "street". Attention allows models to make these connections by computing relevance scores between all word pairs.

Key Innovation from "Attention Is All You Need" (Vaswani et al., 2017)

The original Transformer paper introduced Scaled Dot-Product Attention, which replaced recurrence entirely with a parallelizable attention mechanism. This breakthrough enabled training on much longer sequences and led to the modern LLM revolution.

Paper: arxiv.org/abs/1706.03762

Scaled Dot-Product Attention: The Mathematics

The Complete Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
- Q (Query):  [seq_len × d_k]  "What am I looking for?"
- K (Key):    [seq_len × d_k]  "What do I contain?"
- V (Value):  [seq_len × d_v]  "What information do I have?"
- d_k: dimension of queries/keys (typically 64)
- √d_k: scaling factor (prevents vanishing gradients)
                        

Step-by-Step Breakdown with Concrete Example

Let's walk through attention for the sentence: "The cat sat" (3 tokens)

Step 1: Create Q, K, V from Input Embeddings

# Input: word embeddings (assume d_model = 512)
X = [x_The, x_cat, x_sat]  # shape: [3, 512]

# Linear projections (learned weight matrices)
Q = X @ W_Q  # [3, 512] @ [512, 64] = [3, 64]
K = X @ W_K  # [3, 512] @ [512, 64] = [3, 64]
V = X @ W_V  # [3, 512] @ [512, 64] = [3, 64]

# Now each token has 64-dim query, key, and value vectors
# Example values (simplified):
Q[0] = [0.2, 0.5, ..., 0.1]  # Query for "The"
K[1] = [0.8, 0.3, ..., 0.7]  # Key for "cat"
V[2] = [0.1, 0.9, ..., 0.4]  # Value for "sat"
                            

Step 2: Compute Attention Scores (QK^T)

# Dot product between queries and keys
# Each query attends to all keys
scores = Q @ K.T  # [3, 64] @ [64, 3] = [3, 3]

# Resulting score matrix (before scaling):
#           The_k  cat_k  sat_k
# The_q  [  45.2   12.3   8.7  ]
# cat_q  [  10.1   52.8   31.4 ]
# sat_q  [   5.2   28.9   49.1 ]

# Interpretation:
# - "cat" (row 2) attends strongly to itself (52.8) and "sat" (31.4)
# - "The" (row 1) attends mostly to itself (45.2)
                            

Step 3: Scale by √d_k

import math

d_k = 64
scaled_scores = scores / math.sqrt(d_k)
# Division by √64 = 8

# Scaled scores (prevents saturation in softmax):
#           The_k  cat_k  sat_k
# The_q  [  5.65   1.54   1.09 ]
# cat_q  [  1.26   6.60   3.93 ]
# sat_q  [  0.65   3.61   6.14 ]
                            

Why scale? Without scaling, large dot products → large softmax inputs → near-zero gradients. Scaling by √d_k keeps values in a reasonable range.

Step 4: Apply Softmax (Get Attention Weights)

import torch.nn.functional as F

attention_weights = F.softmax(scaled_scores, dim=-1)
# Softmax over each row (each query attends to all keys)

# Attention weights (probabilities sum to 1 per row):
#           The_k  cat_k  sat_k
# The_q  [ 0.823  0.131  0.046 ]  ← "The" mostly attends to itself
# cat_q  [ 0.041  0.752  0.207 ]  ← "cat" attends to itself and "sat"
# sat_q  [ 0.018  0.337  0.645 ]  ← "sat" attends to itself and "cat"

# These weights are interpretable! Higher = more relevant
                            

Step 5: Weighted Sum of Values (Get Output)

output = attention_weights @ V  # [3, 3] @ [3, 64] = [3, 64]

# For "cat" (row 2):
output[1] = 0.041 * V[0] + 0.752 * V[1] + 0.207 * V[2]
#           ^small weight   ^large weight  ^medium weight
#           from "The"      from "cat"     from "sat"

# The output is a context-aware representation that mixes
# information from all tokens, weighted by relevance
                            

Key Insight: Attention is a differentiable lookup mechanism. Instead of hard-indexing (like a dictionary), we do a soft weighted average where the weights are learned based on content similarity (Q·K).

Complete PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention from 'Attention Is All You Need'

    Args:
        d_k: dimension of keys/queries (typically 64)
        dropout: dropout probability (typically 0.1)
    """
    def __init__(self, d_k, dropout=0.1):
        super().__init__()
        self.d_k = d_k
        self.dropout = nn.Dropout(dropout)

    def forward(self, Q, K, V, mask=None):
        """
        Args:
            Q: Queries [batch, seq_len, d_k]
            K: Keys    [batch, seq_len, d_k]
            V: Values  [batch, seq_len, d_v]
            mask: Optional mask [batch, seq_len, seq_len]

        Returns:
            output: [batch, seq_len, d_v]
            attention_weights: [batch, seq_len, seq_len]
        """
        # Step 1: Compute Q·K^T
        scores = torch.matmul(Q, K.transpose(-2, -1))  # [batch, seq, seq]

        # Step 2: Scale by √d_k
        scores = scores / math.sqrt(self.d_k)

        # Step 3: Apply mask (for causal/padding masks)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Step 4: Softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)  # [batch, seq, seq]
        attention_weights = self.dropout(attention_weights)

        # Step 5: Weighted sum of values
        output = torch.matmul(attention_weights, V)  # [batch, seq, d_v]

        return output, attention_weights

# Usage example
batch_size, seq_len, d_model = 32, 50, 512
d_k = 64

# Create random Q, K, V (in practice, these come from linear projections)
Q = torch.randn(batch_size, seq_len, d_k)
K = torch.randn(batch_size, seq_len, d_k)
V = torch.randn(batch_size, seq_len, d_k)

attention = ScaledDotProductAttention(d_k)
output, weights = attention(Q, K, V)

print(f"Output shape: {output.shape}")  # [32, 50, 64]
print(f"Attention weights shape: {weights.shape}")  # [32, 50, 50]
                        

Multi-Head Attention: Learning Multiple Relationships

Single attention head can only learn one type of relationship. Multi-Head Attention runs multiple attention mechanisms in parallel, each learning different patterns.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O

where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

Parameters:
- h: number of heads (typically 8, 12, or 16)
- d_model: model dimension (512 for base, 1024 for large)
- d_k = d_v = d_model / h (e.g., 512/8 = 64 per head)
                        

What Different Heads Learn

Head	Pattern Learned	Example
Head 1	Syntactic relationships	Verbs attend to subjects
Head 2	Positional (next word)	Each word attends to next word
Head 3	Rare words	Technical terms attend to definitions
Head 4	Delimiter tokens	Commas, periods, quotes
Head 5-8	Semantic relationships	Pronouns to antecedents, entities to attributes

Multi-Head Attention Implementation

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention from 'Attention Is All You Need'

    Splits d_model into h heads, each of dimension d_k = d_model/h
    """
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 64 if d_model=512, h=8

        # Linear projections for Q, K, V (for all heads at once)
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)

        # Output projection
        self.W_O = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def split_heads(self, x):
        """Split last dimension into (num_heads, d_k)"""
        batch_size, seq_len, d_model = x.size()
        # Reshape and transpose to [batch, num_heads, seq_len, d_k]
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # 1. Linear projections
        Q = self.W_Q(Q)  # [batch, seq_len, d_model]
        K = self.W_K(K)
        V = self.W_V(V)

        # 2. Split into multiple heads
        Q = self.split_heads(Q)  # [batch, num_heads, seq_len, d_k]
        K = self.split_heads(K)
        V = self.split_heads(V)

        # 3. Apply scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        context = torch.matmul(attention_weights, V)  # [batch, h, seq, d_k]

        # 4. Concatenate heads
        context = context.transpose(1, 2).contiguous()  # [batch, seq, h, d_k]
        context = context.view(batch_size, -1, self.d_model)  # [batch, seq, d_model]

        # 5. Final linear projection
        output = self.W_O(context)

        return output, attention_weights

# Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(32, 50, 512)  # [batch, seq_len, d_model]
output, attn_weights = mha(x, x, x)  # Self-attention: Q=K=V=x
print(f"Output: {output.shape}")  # [32, 50, 512]
print(f"Attention: {attn_weights.shape}")  # [32, 8, 50, 50]
                        

Causal (Masked) Attention for Autoregressive Models

In decoder-only models (GPT, LLaMA), we need causal masking to prevent positions from attending to future positions during training.

# Create causal mask (lower triangular matrix)
seq_len = 5
mask = torch.tril(torch.ones(seq_len, seq_len))

# Visualization:
#     0  1  2  3  4  (positions)
# 0 [ 1  0  0  0  0 ]  ← position 0 can only see itself
# 1 [ 1  1  0  0  0 ]  ← position 1 sees 0,1
# 2 [ 1  1  1  0  0 ]  ← position 2 sees 0,1,2
# 3 [ 1  1  1  1  0 ]  ← position 3 sees 0,1,2,3
# 4 [ 1  1  1  1  1 ]  ← position 4 sees all

# Apply during attention computation
scores = scores.masked_fill(mask == 0, -1e9)
# -1e9 becomes ~0 after softmax, effectively blocking attention
                        

Why Causal Masking?

Training: When predicting token at position i, model can only use tokens 0 to i-1
Inference: Ensures model behavior during training matches generation (no future information)
Efficiency: Allows parallel training (all positions trained simultaneously, but with appropriate masking)

Complexity Analysis

Time and Space Complexity

Operation	Time Complexity	Space Complexity
Linear projections (Q, K, V)	O(n·d²)	O(d²)
Q·K^T (attention scores)	O(n²·d)	O(n²)
Softmax	O(n²)	O(n²)
Attn·V (weighted sum)	O(n²·d)	O(n²)
Total	O(n²·d + n·d²)	O(n² + d²)

Where: n = sequence length, d = model dimension

The O(n²) Bottleneck

The quadratic complexity in sequence length is the main limitation of standard attention:

1K tokens: 1M attention scores to compute
10K tokens: 100M attention scores (100x more!)
100K tokens: 10B attention scores → OOM on most GPUs

This is why context length is expensive to scale!

Efficient Attention Variants

Method	Complexity	Used In
Standard Attention	O(n²)	GPT-3, GPT-4, Claude
Flash Attention	O(n²) but memory-efficient	GPT-4, LLaMA-2, Modern LLMs
Sparse Attention	O(n√n) or O(n log n)	Longformer, BigBird
Linear Attention	O(n)	Linformer, Performer
State Space Models	O(n)	Mamba, RWKV

Types of Attention Mechanisms

1. Self-Attention (Intra-Attention)

Each position attends to all positions in the same sequence. Used in encoder and decoder.

output = SelfAttention(X, X, X)  # Q=K=V=X
# Example: "The cat sat" → each word attends to all 3 words
                            

2. Cross-Attention (Encoder-Decoder Attention)

Decoder positions attend to encoder outputs. Used in seq2seq models (translation, summarization).

# Encoder output: source language representation
encoder_output = Encoder("Hello world")  # English

# Decoder uses cross-attention to source
decoder_output = CrossAttention(
    Q=decoder_hidden,      # "Bonjour ___" (French, being generated)
    K=encoder_output,      # Keys from "Hello world"
    V=encoder_output       # Values from "Hello world"
)
# Decoder can attend to relevant English words when generating French
                            

3. Masked Self-Attention (Causal Attention)

Prevents attention to future positions. Used in decoder-only models (GPT).

# Training on "The cat sat on the mat"
# When predicting "sat", model can only see "The cat"
# When predicting "mat", model can see "The cat sat on the"
                            

Common Pitfalls and Best Practices

Implementation Details That Matter

Issue	Solution	Why It Matters
Numerical instability in softmax	Use scaling (√d_k) and stable softmax	Prevents vanishing/exploding gradients
Memory explosion for long sequences	Use Flash Attention or gradient checkpointing	O(n²) memory can OOM quickly
Forgetting positional information	Add positional encodings to input	Attention is permutation-invariant
Attention collapse (all weights uniform)	Use LayerNorm, dropout, proper init	Model fails to learn useful patterns
Information leakage in causal models	Strictly enforce causal masking	Training/inference mismatch

Key Papers and Resources

Foundational Papers

"Attention Is All You Need" (Vaswani et al., 2017) - The original Transformer
arxiv.org/abs/1706.03762
"Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014) - First attention mechanism
arxiv.org/abs/1409.0473
"Flash Attention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
arxiv.org/abs/2205.14135
"Longformer: The Long-Document Transformer" (Beltagy et al., 2020) - Sparse attention
arxiv.org/abs/2004.05150

Analysis and Interpretability

"What Does BERT Look At?" (Clark et al., 2019) - Analyzing attention patterns
arxiv.org/abs/1906.04341
"Analyzing Multi-Head Self-Attention" (Voita et al., 2019)
arxiv.org/abs/1905.09418

Interactive Resources

The Illustrated Transformer by Jay Alammar
jalammar.github.io/illustrated-transformer
The Annotated Transformer by Harvard NLP
nlp.seas.harvard.edu/annotated-transformer
Attention? Attention! by Lil'Log
lilianweng.github.io/posts/2018-06-24-attention
BertViz - Visualize attention in BERT, GPT-2, etc.
github.com/jessevig/bertviz

Video Tutorials

Andrej Karpathy - "Let's build GPT: from scratch, in code, spelled out"
Stanford CS224N - Lecture on Attention Mechanisms
3Blue1Brown - "But what is a GPT?" (visual intuition)

Recommended Learning Path

Read "The Illustrated Transformer" for visual intuition
Read sections 3.2-3.3 of "Attention Is All You Need" paper
Implement scaled dot-product attention from scratch in NumPy
Implement multi-head attention in PyTorch
Use BertViz to visualize attention patterns in real models
Read "What Does BERT Look At?" to understand what attention learns
Experiment with Flash Attention for efficient implementation

Pre-training: Foundation Learning

Deep dive into how LLMs learn language from massive text corpora

What is Pre-training?

Pre-training is an unsupervised learning process where models learn patterns, structures, and knowledge from vast amounts of raw text without explicit labels. This phase teaches the model fundamental language understanding before any task-specific fine-tuning.

Key Characteristics

Self-supervised: Labels are created from the data itself (next token)
Massive scale: Trillions of tokens, petabytes of data
Expensive: Millions of dollars, months of GPU time
One-time cost: Pre-trained models can be reused and fine-tuned

The Pre-training Objective: Next-Token Prediction

Modern decoder-only LLMs use causal language modeling: predict the next token given all previous tokens.

Mathematical Formulation

Given sequence: x = [x_1, x_2, ..., x_T]

Objective: Maximize the log-likelihood
L(θ) = Σ log P(x_t | x_1, ..., x_{t-1}; θ)
       t=1 to T

Where:
- θ: model parameters
- x_t: token at position t
- P(x_t | context): probability distribution over vocabulary

In practice, we minimize negative log-likelihood (cross-entropy loss):
Loss = -1/T Σ log P(x_t | x_Concrete Example WalkthroughTraining on: "The cat sat on the mat"
# Tokenized: [2, 145, 892, 319, 2, 2000]
# Vocabulary size: 50,000 tokens

Training examples created:
1. [] → predict "The" (token 2)
2. ["The"] → predict "cat" (token 145)
3. ["The", "cat"] → predict "sat" (token 892)
4. ["The", "cat", "sat"] → predict "on" (token 319)
5. ["The", "cat", "sat", "on"] → predict "the" (token 2)
6. ["The", "cat", "sat", "on", "the"] → predict "mat" (token 2000)

# For each example:
# 1. Model outputs logits: [batch, seq_len, vocab_size]
# 2. Apply softmax to get probabilities
# 3. Compute cross-entropy with target token
# 4. Backpropagate to update weights
                            
Complete Training Loop Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

def pretrain_language_model(
    model,
    train_dataset,
    num_epochs=1,
    batch_size=32,
    learning_rate=3e-4,
    grad_accum_steps=1,
    max_grad_norm=1.0
):
    """
    Pre-training loop for causal language model

    Args:
        model: GPT-style decoder-only transformer
        train_dataset: Dataset yielding token sequences
        num_epochs: Number of passes through data
        batch_size: Batch size
        learning_rate: Peak learning rate (with warmup/decay)
        grad_accum_steps: Gradient accumulation for larger effective batch
        max_grad_norm: Clip gradients to prevent explosion
    """
    # Setup
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        betas=(0.9, 0.95),  # Common for LLMs
        weight_decay=0.1
    )

    # Learning rate schedule
    total_steps = len(train_dataset) // batch_size * num_epochs
    warmup_steps = total_steps // 10  # 10% warmup

    def get_lr(step):
        """Learning rate schedule with warmup + cosine decay"""
        if step < warmup_steps:
            # Linear warmup
            return learning_rate * step / warmup_steps
        else:
            # Cosine decay
            progress = (step - warmup_steps) / (total_steps - warmup_steps)
            return learning_rate * 0.5 * (1 + math.cos(math.pi * progress))

    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, get_lr)

    # Training loop
    model.train()
    global_step = 0

    for epoch in range(num_epochs):
        dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        for batch_idx, batch in enumerate(dataloader):
            # batch: [batch_size, seq_len] token indices
            input_ids = batch.to(device)

            # Create targets (shifted by 1)
            # Input:  [The, cat, sat, on]
            # Target: [cat, sat, on, the]
            targets = input_ids[:, 1:]  # Shift left
            input_ids = input_ids[:, :-1]  # Remove last token

            # Forward pass
            logits, loss = model(input_ids, targets=targets)

            # Scale loss for gradient accumulation
            loss = loss / grad_accum_steps
            loss.backward()

            # Update weights every grad_accum_steps
            if (batch_idx + 1) % grad_accum_steps == 0:
                # Clip gradients
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

                # Update parameters
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

                global_step += 1

                # Logging
                if global_step % 100 == 0:
                    perplexity = torch.exp(loss * grad_accum_steps)
                    lr = scheduler.get_last_lr()[0]
                    print(f"Step {global_step} | "
                          f"Loss: {loss.item():.4f} | "
                          f"Perplexity: {perplexity:.2f} | "
                          f"LR: {lr:.6f}")

    return model

# Usage
# model = GPTModel(vocab_size=50257, d_model=768, num_layers=12)
# trained_model = pretrain_language_model(model, dataset, num_epochs=1)
                        

Scaling Laws: How Performance Scales

Research shows that LLM performance follows predictable power laws with respect to model size, dataset size, and compute.

The Scaling Law Equations (Kaplan et al., 2020)

Loss L scales with:

1. Model Parameters (N):
   L(N) ∝ (N_c / N)^α_N
   where N_c ≈ 8.8 × 10^13, α_N ≈ 0.076

2. Dataset Size (D):
   L(D) ∝ (D_c / D)^α_D
   where D_c ≈ 5.4 × 10^13, α_D ≈ 0.095

3. Compute Budget (C):
   L(C) ∝ (C_c / C)^α_C
   where C_c ≈ 3.1 × 10^8, α_C ≈ 0.050

Key insight: Performance improves smoothly and predictably!
Doubling compute → ~X% loss reduction (consistent)
                        

Important Note: These are the original Kaplan et al. (2020) scaling laws. The Chinchilla paper (Hoffmann et al., 2022) later revised these findings, showing that training should balance model size and data more equally than originally thought.

Optimal Allocation (Chinchilla Scaling)

Chinchilla paper finding (Hoffmann et al., 2022): For optimal performance, model size and training tokens should scale equally.

Chinchilla Optimal Rule:

For compute budget C:
- Model parameters: N_opt ∝ C^0.5
- Training tokens: D_opt ∝ C^0.5

Example:
- 10x more compute → √10 ≈ 3x larger model, 3x more data
- Not: 10x larger model on same data (GPT-3 approach)

Chinchilla: 70B params, 1.4T tokens (optimal)
GPT-3: 175B params, 300B tokens (over-parameterized, under-trained)
                            

Reference: Training Compute-Optimal Large Language Models (arxiv.org/abs/2203.15556)

Compute Budget	Optimal Model Size	Optimal Training Tokens
1e20 FLOPs	400M params	8B tokens
1e21 FLOPs	1B params	20B tokens
1e22 FLOPs	3B params	60B tokens
1e23 FLOPs	10B params	200B tokens
1e24 FLOPs	30B params	600B tokens
1e25 FLOPs	100B params	2T tokens

Training Data: Sources and Processing

Data Composition (Common Patterns)

Source	Proportion	Purpose	Quality
Web Crawl (Common Crawl)	60-70%	Diverse language, broad knowledge	Variable (needs filtering)
Books	10-15%	Long-form reasoning, narrative	High
Wikipedia	3-5%	Factual knowledge, encyclopedic	Very High
Code (GitHub)	5-10%	Programming, structured reasoning	High
ArXiv (Scientific Papers)	2-3%	Technical knowledge, math	Very High
Reddit (filtered)	5-10%	Conversational, Q&A format	Medium (filtered)

Data Processing Pipeline

# 1. Collection
raw_text = download_common_crawl(shard_id)

# 2. Filtering (remove low-quality)
def quality_filter(text):
    """Filter out low-quality documents"""
    # Language detection
    if detect_language(text) != 'en':
        return False

    # Remove too short/long documents
    if len(text) < 100 or len(text) > 1_000_000:
        return False

    # Content quality heuristics
    if perplexity(text, model=kenlm) > 500:  # Too random
        return False
    if repetition_ratio(text) > 0.3:  # Too repetitive
        return False
    if symbol_to_word_ratio(text) > 0.5:  # Too much noise
        return False

    return True

filtered = [doc for doc in raw_text if quality_filter(doc)]

# 3. Deduplication (critical for performance!)
def deduplicate(documents):
    """Remove exact and near-duplicate documents"""
    # Exact dedup using hashes
    seen_hashes = set()
    exact_deduped = []
    for doc in documents:
        doc_hash = hash(doc)
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            exact_deduped.append(doc)

    # Near-dedup using MinHash + LSH
    from datasketch import MinHash, MinHashLSH
    lsh = MinHashLSH(threshold=0.8, num_perm=128)

    near_deduped = []
    for idx, doc in enumerate(exact_deduped):
        m = MinHash(num_perm=128)
        for word in doc.split():
            m.update(word.encode('utf8'))

        # Check for near-duplicates
        if not lsh.query(m):  # No similar docs found
            lsh.insert(f"doc_{idx}", m)
            near_deduped.append(doc)

    return near_deduped

clean_data = deduplicate(filtered)

# 4. Tokenization
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenized = [tokenizer.encode(doc) for doc in clean_data]

# 5. Shuffling (critical for training diversity!)
import random
random.shuffle(tokenized)

# 6. Pack into sequences of fixed length (e.g., 2048 tokens)
def pack_sequences(tokenized_docs, seq_len=2048):
    """Pack variable-length documents into fixed-length sequences"""
    buffer = []
    sequences = []

    for doc_tokens in tokenized_docs:
        buffer.extend(doc_tokens)
        buffer.append(tokenizer.eos_token_id)  # Separate documents

        # Extract full sequences
        while len(buffer) >= seq_len:
            sequences.append(buffer[:seq_len])
            buffer = buffer[seq_len:]

    return sequences

training_data = pack_sequences(tokenized)
                        

Computational Cost Breakdown

Training Cost Estimates

Model	Parameters	Training Tokens	Compute (FLOPs)	GPU-Hours (A100)	Est. Cost
GPT-2	1.5B	10B	~3e19	~1K	~$2K
GPT-3	175B	300B	~3.1e23	~1M	~$4-5M
LLaMA-2 7B	7B	2T	~2.8e22	~180K	~$300K
LLaMA-2 70B	70B	2T	~2.8e23	~1.7M	~$3M

FLOP calculation: ~6ND FLOPs, where N = parameters, D = tokens (forward + backward pass)

Why So Expensive?

Matrix multiplications: Billions of parameters × billions of tokens
Parallel training: Requires thousands of GPUs for reasonable time
Memory bandwidth: Moving parameters and activations is costly
Redundancy: Must train to convergence, can't stop early

Key Papers and Resources

Foundational Papers

"Language Models are Few-Shot Learners" (Brown et al., 2020) - GPT-3
arxiv.org/abs/2005.14165
"Scaling Laws for Neural Language Models" (Kaplan et al., 2020)
arxiv.org/abs/2001.08361
"Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) - Chinchilla
arxiv.org/abs/2203.15556
"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
arxiv.org/abs/2302.13971
"The Pile: An 800GB Dataset of Diverse Text" (Gao et al., 2020)
arxiv.org/abs/2101.00027

Data Processing

"Deduplicating Training Data Makes Language Models Better" (Lee et al., 2021)
arxiv.org/abs/2107.06499
"Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (Kreutzer et al., 2021)
arxiv.org/abs/2103.12028

Practical Resources

nanoGPT by Andrej Karpathy - Minimal training code
github.com/karpathy/nanoGPT
Scaling Laws Calculator - Estimate compute needs
github.com/Alignment-Lab-AI/scaling-laws
RedPajama - Open dataset for LLM training
github.com/togethercomputer/RedPajama-Data

Recommended Learning Path

Read GPT-3 paper to understand scale and capabilities
Study Scaling Laws paper (Kaplan) - fundamental insights
Read Chinchilla paper for compute-optimal training
Implement character-level LM on tiny dataset (Shakespeare)
Study nanoGPT training code
Experiment with data processing (filtering, deduplication)
Train GPT-2 scale model using Hugging Face

Post-training: Alignment & Specialization

Deep dive into SFT, RLHF, and alignment techniques

What is Post-training?

After pre-training, models undergo post-training (also called alignment) to become helpful assistants, specialized tools, or domain experts. This phase transforms raw language models into usable, safe AI systems aligned with human values and preferences.

Why Post-training is Critical

Pre-trained models complete text (next-token prediction) but don't follow instructions
No safety guardrails: Can generate harmful, biased, or toxic content
Poor task performance: Not optimized for specific applications
Lack of alignment: Don't understand human preferences or values

Post-training fixes all of these issues, transforming completion models into assistants.

The Three-Stage Pipeline

Stage	Goal	Data Needed	Cost
1. SFT	Teach instruction following	10K-100K prompt-response pairs	Low (hours on single GPU)
2. Reward Modeling	Learn human preferences	10K-100K comparison pairs	Low (hours on single GPU)
3. RLHF	Optimize for preferences	Prompts only (RL generates responses)	Medium (days on multiple GPUs)

Stage 1: Supervised Fine-Tuning (SFT)

Train the model on high-quality instruction-response pairs to teach it how to follow instructions and behave as an assistant.

SFT Data Format

# Training example format (chat/instruction format)
{
    "prompt": "Explain quantum computing to a 10-year-old.",
    "response": "Imagine a regular computer is like flipping one coin
                 at a time - heads or tails. A quantum computer is like
                 flipping many magic coins at once, where each coin can be
                 heads AND tails at the same time until you look at it!
                 This lets quantum computers solve certain problems much
                 faster than regular computers."
}

# Can also use chat format with roles:
{
    "messages": [
        {"role": "user", "content": "What causes rainbows?"},
        {"role": "assistant", "content": "Rainbows are caused by..."}
    ]
}
                        

SFT Training Objective

Loss = -Σ log P(y_t | x, y_<t; θ)
        t

Where:
- x: instruction/prompt
- y: desired response
- y_t: token t in response
- θ: model parameters

Key difference from pre-training:
- Loss computed ONLY on response tokens (not prompt)
- Teaches model to generate specific responses to instructions
- Much smaller dataset (10K-100K vs trillions of tokens)
                        

Complete SFT Implementation

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import Dataset, DataLoader

class InstructionDataset(Dataset):
    """Dataset for instruction fine-tuning"""
    def __init__(self, data, tokenizer, max_length=2048):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Format as chat template
        text = f"### Instruction:\n{item['prompt']}\n\n### Response:\n{item['response']}"

        # Tokenize
        tokens = self.tokenizer.encode(text, max_length=self.max_length,
                                       truncation=True)

        # Create labels: -100 for prompt tokens (ignored in loss)
        prompt_text = f"### Instruction:\n{item['prompt']}\n\n### Response:\n"
        prompt_tokens = self.tokenizer.encode(prompt_text)
        prompt_len = len(prompt_tokens)

        labels = [-100] * prompt_len + tokens[prompt_len:]

        return {
            'input_ids': torch.tensor(tokens),
            'labels': torch.tensor(labels)
        }

def supervised_fine_tune(
    model_name='gpt2',
    train_data=None,
    num_epochs=3,
    batch_size=4,
    learning_rate=2e-5,
    output_dir='./sft_model'
):
    """
    Supervised fine-tuning on instruction data

    Args:
        model_name: Base model to fine-tune
        train_data: List of {"prompt": ..., "response": ...} dicts
        num_epochs: Training epochs
        batch_size: Batch size
        learning_rate: Learning rate (lower than pre-training!)
        output_dir: Where to save fine-tuned model
    """
    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # Create dataset
    dataset = InstructionDataset(train_data, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Optimizer (lower LR than pre-training!)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    # Training loop
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    model.train()

    for epoch in range(num_epochs):
        total_loss = 0
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids, labels=labels)
            loss = outputs.loss

            # Backward pass
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")

    # Save model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Model saved to {output_dir}")

    return model

# Usage
# train_data = [
#     {"prompt": "What is AI?", "response": "AI is..."},
#     {"prompt": "Explain photosynthesis", "response": "Photosynthesis is..."},
#     ...
# ]
# model = supervised_fine_tune(model_name='gpt2', train_data=train_data)
                        

Stage 2: Reward Modeling

Train a separate model to predict human preferences between different responses. This reward model will guide RLHF training.

Preference Data Collection

How Comparison Data is Created

Sample prompts: Collect diverse user prompts
Generate responses: SFT model generates 2-4 responses per prompt
Human ranking: Raters rank responses (best to worst)
Create pairs: All pairwise comparisons extracted

# Example comparison data
{
    "prompt": "Explain black holes simply.",
    "chosen": "Black holes are regions where gravity is so strong
               that not even light can escape...",  # Ranked #1
    "rejected": "Black holes are big space things." # Ranked #2 (worse)
}
                            

Reward Model Training (Bradley-Terry Model)

# Probability that response_a is preferred over response_b
P(a ≻ b) = σ(r(x, a) - r(x, b))
         = 1 / (1 + exp(-(r(x, a) - r(x, b))))

Where:
- r(x, y): reward score for response y to prompt x
- σ: sigmoid function
- ≻: preference relation

Loss (cross-entropy):
L = -log σ(r(x, y_chosen) - r(x, y_rejected))

This teaches the reward model to assign higher scores to preferred responses.
                        

Reward Model Implementation

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Reward model for predicting human preferences

    Takes same architecture as language model but outputs scalar reward
    """
    def __init__(self, base_model):
        super().__init__()
        self.model = base_model
        # Replace LM head with reward head (scalar output)
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids):
        """
        Args:
            input_ids: [batch, seq_len]

        Returns:
            rewards: [batch] scalar reward for each sequence
        """
        # Get hidden states
        outputs = self.model(input_ids, output_hidden_states=True)
        hidden_states = outputs.hidden_states[-1]  # Last layer

        # Use last token's hidden state (end of sequence)
        last_hidden = hidden_states[:, -1, :]  # [batch, hidden_size]

        # Project to scalar reward
        rewards = self.reward_head(last_hidden).squeeze(-1)  # [batch]

        return rewards

def train_reward_model(
    base_model,
    comparison_data,
    num_epochs=3,
    batch_size=4,
    learning_rate=1e-5
):
    """
    Train reward model on comparison data

    Args:
        base_model: Pre-trained or SFT model
        comparison_data: List of {prompt, chosen, rejected} dicts
        num_epochs: Training epochs
        batch_size: Batch size
        learning_rate: Learning rate
    """
    reward_model = RewardModel(base_model)
    tokenizer = AutoTokenizer.from_pretrained('gpt2')

    optimizer = torch.optim.AdamW(reward_model.parameters(), lr=learning_rate)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    reward_model = reward_model.to(device)

    for epoch in range(num_epochs):
        total_loss = 0
        correct = 0
        total = 0

        for item in comparison_data:
            # Tokenize chosen and rejected responses
            chosen_text = f"{item['prompt']}\n{item['chosen']}"
            rejected_text = f"{item['prompt']}\n{item['rejected']}"

            chosen_ids = tokenizer.encode(chosen_text, return_tensors='pt').to(device)
            rejected_ids = tokenizer.encode(rejected_text, return_tensors='pt').to(device)

            # Get rewards
            reward_chosen = reward_model(chosen_ids)
            reward_rejected = reward_model(rejected_ids)

            # Bradley-Terry loss
            loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected))

            # Backward pass
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            total_loss += loss.item()

            # Accuracy: chosen should have higher reward
            if reward_chosen > reward_rejected:
                correct += 1
            total += 1

        accuracy = correct / total
        avg_loss = total_loss / total
        print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f} | Accuracy: {accuracy:.2%}")

    return reward_model

# Usage
# comparison_data = [
#     {"prompt": "...", "chosen": "...", "rejected": "..."},
#     ...
# ]
# reward_model = train_reward_model(base_model, comparison_data)
                        

Stage 3: RLHF with PPO

Use reinforcement learning (specifically PPO - Proximal Policy Optimization) to optimize the model against the reward model while preventing it from deviating too far from the original.

RLHF Objective Function

Objective (PPO-clip + KL penalty):

L_RLHF = E[ min(ratio * A, clip(ratio, 1-ε, 1+ε) * A) ]
         - β * KL(π_θ || π_ref)

Where:
- ratio = π_θ(y|x) / π_old(y|x)  (probability ratio)
- A: advantage (how much better than expected)
- ε: clip parameter (typically 0.2)
- β: KL penalty coefficient (typically 0.01-0.1)
- π_θ: current policy (model being trained)
- π_ref: reference policy (SFT model, frozen)

Two terms balance:
1. Maximize reward (via advantage A)
2. Don't drift too far from reference (KL penalty)
                        

The Complete RLHF Loop

Four Models Involved

Policy (Actor): The model being trained (generates responses)
Value Model (Critic): Estimates expected reward (helps compute advantages)
Reward Model: Scores responses (frozen, from Stage 2)
Reference Model: Original SFT model (frozen, prevents drift)

Simplified RLHF Training Loop

def rlhf_training_step(
    policy_model,      # Model being trained
    value_model,       # Estimates expected reward
    reward_model,      # Frozen reward model
    reference_model,   # Frozen SFT model
    prompts,           # Batch of prompts
    tokenizer,
    beta=0.1          # KL penalty coefficient
):
    """
    Single RLHF training step using PPO

    This is simplified for clarity - real implementations use
    more sophisticated PPO with clipping, advantage estimation, etc.
    """
    device = policy_model.device

    # 1. Generate responses using current policy
    responses = []
    log_probs_old = []

    with torch.no_grad():
        for prompt in prompts:
            prompt_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

            # Generate response
            output = policy_model.generate(
                prompt_ids,
                max_length=512,
                do_sample=True,
                temperature=1.0,
                return_dict_in_generate=True,
                output_scores=True
            )
            response_ids = output.sequences

            # Compute log probabilities under current policy
            logits = torch.stack(output.scores, dim=1)  # [batch, seq, vocab]
            log_probs = F.log_softmax(logits, dim=-1)

            responses.append(response_ids)
            log_probs_old.append(log_probs)

    # 2. Compute rewards from reward model
    rewards = []
    for response_ids in responses:
        reward = reward_model(response_ids)
        rewards.append(reward)
    rewards = torch.tensor(rewards).to(device)

    # 3. Compute KL divergence from reference model
    kl_penalties = []
    for response_ids in responses:
        # Policy log probs
        policy_output = policy_model(response_ids, labels=response_ids)
        policy_log_probs = policy_output.logits.log_softmax(dim=-1)

        # Reference log probs
        with torch.no_grad():
            ref_output = reference_model(response_ids, labels=response_ids)
            ref_log_probs = ref_output.logits.log_softmax(dim=-1)

        # KL divergence: E[log(p) - log(q)]
        kl = (policy_log_probs - ref_log_probs).sum(dim=-1).mean()
        kl_penalties.append(kl)
    kl_penalty = torch.stack(kl_penalties).mean()

    # 4. Compute advantages using value model
    values = []
    for response_ids in responses:
        value = value_model(response_ids)
        values.append(value)
    values = torch.tensor(values).to(device)

    advantages = rewards - values  # How much better than expected

    # 5. PPO policy update
    ratio = torch.exp(log_probs_new - log_probs_old)  # π_new / π_old
    clipped_ratio = torch.clamp(ratio, 1 - 0.2, 1 + 0.2)

    policy_loss = -torch.min(
        ratio * advantages,
        clipped_ratio * advantages
    ).mean()

    # 6. Total loss with KL penalty
    total_loss = policy_loss + beta * kl_penalty

    return total_loss, {
        'reward': rewards.mean().item(),
        'kl': kl_penalty.item(),
        'advantage': advantages.mean().item()
    }

# Full RLHF training would run this in a loop:
# for epoch in range(num_epochs):
#     for batch_prompts in dataloader:
#         loss, metrics = rlhf_training_step(...)
#         loss.backward()
#         optimizer.step()
                        

Modern Alternative: Direct Preference Optimization (DPO)

DPO is a simpler alternative to RLHF that optimizes preferences directly without training a separate reward model or using RL.

DPO Objective

L_DPO = -E[log σ(β * log(π_θ(y_w|x) / π_ref(y_w|x))
                   - β * log(π_θ(y_l|x) / π_ref(y_l|x)))]

Where:
- y_w: preferred (won) response
- y_l: dis-preferred (lost) response
- π_θ: policy being trained
- π_ref: reference policy (frozen)
- β: temperature parameter
- σ: sigmoid

Key insight: Can optimize preferences directly without reward model!
Simpler, more stable than PPO-based RLHF.
                        

Reference: Direct Preference Optimization (arxiv.org/abs/2305.18290)

Parameter-Efficient Fine-Tuning: LoRA

LoRA (Low-Rank Adaptation) allows fine-tuning large models efficiently by training small adapter matrices instead of all parameters.

LoRA Mathematical Formulation

Instead of updating full weight matrix W ∈ R^(d×k):

W_new = W + ΔW

LoRA decomposes update as low-rank:

W_new = W + BA

Where:
- W: frozen pre-trained weights (d×k)
- B ∈ R^(d×r): trainable (down-projection)
- A ∈ R^(r×k): trainable (up-projection)
- r << min(d, k): rank (typically 8, 16, 32)

Parameters to train: d*r + r*k << d*k
Example: d=4096, k=4096, r=16
- Full: 16M params
- LoRA: 131K params (~100x reduction!)
                        

LoRA Implementation

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """
    LoRA adapter for linear layer

    Adds trainable low-rank decomposition to frozen weights
    """
    def __init__(self, in_features, out_features, rank=16, alpha=16):
        super().__init__()
        self.rank = rank
        self.alpha = alpha

        # Frozen pre-trained weight (not trained)
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.weight.requires_grad = False

        # Trainable LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        # Scaling factor
        self.scaling = alpha / rank

    def forward(self, x):
        """
        Args:
            x: [batch, ..., in_features]

        Returns:
            output: [batch, ..., out_features]
        """
        # Original forward pass (frozen)
        result = F.linear(x, self.weight)

        # Add LoRA adaptation: x @ A^T @ B^T
        lora_out = F.linear(F.linear(x, self.lora_A), self.lora_B)
        result += lora_out * self.scaling

        return result

# Apply LoRA to a transformer model
def apply_lora_to_model(model, rank=16, target_modules=['q_proj', 'v_proj']):
    """
    Replace specified linear layers with LoRA versions

    Args:
        model: Transformer model
        rank: LoRA rank
        target_modules: Which layers to adapt (commonly Q, V projections)
    """
    for name, module in model.named_modules():
        if any(target in name for target in target_modules):
            if isinstance(module, nn.Linear):
                # Replace with LoRA layer
                lora_layer = LoRALayer(
                    module.in_features,
                    module.out_features,
                    rank=rank
                )
                # Copy pre-trained weights
                lora_layer.weight.data = module.weight.data.clone()

                # Replace in model
                parent_name = '.'.join(name.split('.')[:-1])
                child_name = name.split('.')[-1]
                parent = dict(model.named_modules())[parent_name]
                setattr(parent, child_name, lora_layer)

    # Freeze all non-LoRA parameters
    for name, param in model.named_parameters():
        if 'lora' not in name:
            param.requires_grad = False

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    return model

# Usage
# model = AutoModelForCausalLM.from_pretrained('gpt2')
# model = apply_lora_to_model(model, rank=16)
# # Now train only LoRA parameters (much faster, less memory!)
                        

Key Papers and Resources

Foundational RLHF Papers

"Training language models to follow instructions with human feedback" (Ouyang et al., 2022) - InstructGPT
arxiv.org/abs/2203.02155
"Learning to summarize from human feedback" (Stiennon et al., 2020) - Early RLHF
arxiv.org/abs/2009.01325
"Fine-Tuning Language Models from Human Preferences" (Ziegler et al., 2019)
arxiv.org/abs/1909.08593

DPO and Alternatives

"Direct Preference Optimization" (Rafailov et al., 2023)
arxiv.org/abs/2305.18290
"Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022) - Anthropic
arxiv.org/abs/2212.08073
"RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Lee et al., 2023)
arxiv.org/abs/2309.00267

Parameter-Efficient Fine-Tuning

"LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
arxiv.org/abs/2106.09685
"QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
arxiv.org/abs/2305.14314

Practical Libraries

TRL (Transformer Reinforcement Learning) - Hugging Face RLHF library
github.com/huggingface/trl
PEFT - Parameter-Efficient Fine-Tuning (LoRA, etc.)
github.com/huggingface/peft
DeepSpeed-Chat - Microsoft's RLHF implementation
github.com/microsoft/DeepSpeed

Recommended Learning Path

Read InstructGPT paper (Ouyang et al.) - foundational RLHF
Implement simple SFT on small instruction dataset
Study LoRA paper and implement low-rank adaptation
Read DPO paper - simpler alternative to RLHF
Experiment with TRL library for RLHF
Read Constitutional AI for AI feedback approaches
Implement full RLHF pipeline on toy problem

Essential Concepts

Deep dive into tokens, embeddings, sampling, and more

Tokens & Tokenization

Tokens are the atomic units that LLMs process. Modern LLMs don't operate on characters or words directly - they use subword tokenization to balance vocabulary size and coverage.

Why Subword Tokenization?

Character-level: Tiny vocabulary (~256), but very long sequences → expensive
Word-level: Huge vocabulary (millions), can't handle unseen words (OOV problem)
Subword-level: Balanced! ~50K vocab, handles rare words via subword splitting

Byte Pair Encoding (BPE)

The most common tokenization algorithm used by GPT, LLaMA, and most modern LLMs.

# BPE Algorithm (simplified)
def train_bpe(corpus, num_merges=10000):
    """
    Train BPE tokenizer on corpus

    Args:
        corpus: List of words
        num_merges: Number of merge operations (vocabulary size control)

    Returns:
        merge_rules: List of (pair, merged_token) rules
    """
    # Start with character-level vocabulary
    vocab = {word: freq for word, freq in count_words(corpus).items()}

    merge_rules = []

    for i in range(num_merges):
        # Find most frequent adjacent pair
        pairs = count_pairs(vocab)
        if not pairs:
            break

        best_pair = max(pairs, key=pairs.get)

        # Merge the pair
        vocab = merge_vocab(vocab, best_pair)
        merge_rules.append(best_pair)

        print(f"Merge {i}: {best_pair[0]} + {best_pair[1]} → {best_pair[0]}{best_pair[1]}")

    return merge_rules

# Example tokenization:
# "understanding" with BPE:
# 1. Start: ['u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', 'i', 'n', 'g']
# 2. After merges: ['under', 'stand', 'ing']

# "unexpected" (less common):
# 1. Start: ['u', 'n', 'e', 'x', 'p', 'e', 'c', 't', 'e', 'd']
# 2. After merges: ['un', 'exp', 'ect', 'ed']  # More fragmented
                        

Tokenization Examples

Text	Tokens (GPT-2 BPE)	Token Count
"Hello, world!"	["Hello", ",", " world", "!"]	4
"ChatGPT"	["Chat", "G", "PT"]	3
"antidisestablishmentarianism"	["ant", "idis", "establish", "ment", "arian", "ism"]	6
" strawberry" (with space)	[" straw", "berry"]	2
"🤖🚀"	["�", "�", "�", "�", "�", "�", "�", "�"]	8 (UTF-8 bytes)

Key insight: Spaces are part of tokens! " world" ≠ "world". This affects prompting and token counting.

Vocabulary Sizes

GPT-2: 50,257 tokens
GPT-3/GPT-4: ~50,000 tokens (cl100k_base)
LLaMA: 32,000 tokens (SentencePiece)
LLaMA-2: 32,000 tokens

Embeddings: From Tokens to Vectors

Embeddings convert discrete tokens into continuous vector representations that capture semantic meaning.

Embedding Layer

import torch
import torch.nn as nn

# Embedding layer: lookup table
vocab_size = 50000
embedding_dim = 768  # GPT-2 small

embedding = nn.Embedding(vocab_size, embedding_dim)

# Token IDs → vectors
token_ids = torch.tensor([145, 892, 319])  # [cat, sat, on]
embeddings = embedding(token_ids)  # [3, 768]

# Each token is now a 768-dimensional vector
print(embeddings.shape)  # torch.Size([3, 768])
print(embeddings[0])  # Vector for "cat": [0.234, -0.512, 0.891, ...]
                        

Embedding Properties

Semantic Similarity

Tokens with similar meanings have similar vectors (high cosine similarity).

import torch.nn.functional as F

# Cosine similarity
def cosine_sim(vec1, vec2):
    return F.cosine_similarity(vec1, vec2, dim=0)

# Example (learned embeddings):
emb_king = embedding(torch.tensor([1234]))
emb_queen = embedding(torch.tensor([5678]))
emb_man = embedding(torch.tensor([789]))
emb_car = embedding(torch.tensor([456]))

sim_king_queen = cosine_sim(emb_king, emb_queen)  # High (~0.8)
sim_king_car = cosine_sim(emb_king, emb_car)      # Low (~0.2)
                            

Famous Word Analogies

# Vector arithmetic captures relationships!

king - man + woman ≈ queen
# vec("king") - vec("man") + vec("woman") ≈ vec("queen")

paris - france + italy ≈ rome
# Geographic relationships

walked - walk + run ≈ ran
# Grammatical transformations

# This works because embeddings learn semantic structure!
                        

Positional Embeddings

Since attention is permutation-invariant, we add positional information to embeddings.

# Learned positional embeddings (GPT-2, BERT)
max_position = 1024
pos_embedding = nn.Embedding(max_position, embedding_dim)

# For input tokens at positions [0, 1, 2]
positions = torch.arange(3)
pos_embeds = pos_embedding(positions)  # [3, 768]

# Combined embedding
token_embeds = embedding(token_ids)     # [3, 768]
final_embeds = token_embeds + pos_embeds  # [3, 768]

# Now each token knows its position in the sequence!
                        

Reference: "Attention Is All You Need" introduced sinusoidal positional encodings. Modern models (GPT, BERT) use learned positional embeddings. LLaMA uses RoPE (Rotary Position Embedding).

Context Window: The Attention Bottleneck

The context window (or context length) is the maximum number of tokens the model can process at once. This is a hard architectural limit.

Why Context Matters

Hard limit: Model can only "see" last N tokens
Affects memory: Can't reference information beyond context
Quadratic cost: O(n²) attention computation
Determines use cases: Short contexts → chat; Long contexts → document analysis

Context Lengths of Modern LLMs

Model	Context Length	~Word Count	~Pages (250 words/page)
GPT-3	4,096 tokens	~3,000 words	~12 pages
GPT-4 (8K)	8,192 tokens	~6,000 words	~24 pages
GPT-4 (32K)	32,768 tokens	~24,000 words	~96 pages
GPT-4 Turbo	128,000 tokens	~96,000 words	~384 pages
Claude 2	100,000 tokens	~75,000 words	~300 pages
Claude 3	200,000 tokens	~150,000 words	~600 pages
LLaMA-2	4,096 tokens	~3,000 words	~12 pages

Context Window Trade-offs

Larger context = more memory and compute

Memory: Storing attention matrix: O(n²) memory
Compute: Computing attention: O(n²·d) FLOPs
Latency: Longer sequences → slower generation

# Example: GPT-4 128K context
# If each attention score is 2 bytes (FP16):
memory = (128_000)^2 * 2 bytes
       = 32.8 GB just for attention scores!

# This is why long context is expensive and requires
# techniques like Flash Attention for efficiency
                            

Sampling & Decoding Strategies

How models generate text by sampling from probability distributions over tokens.

Temperature Scaling

Temperature controls the randomness of predictions by scaling logits before softmax.

import torch
import torch.nn.functional as F

# Model outputs logits (unnormalized scores)
logits = torch.tensor([2.0, 1.0, 0.1])  # Three possible next tokens

# Apply temperature
def sample_with_temperature(logits, temperature=1.0):
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    return probs

# Low temperature (T=0.5): More confident, peaked distribution
probs_low = sample_with_temperature(logits, temperature=0.5)
# [0.709, 0.259, 0.032]  ← First token dominates

# Medium temperature (T=1.0): Balanced
probs_medium = sample_with_temperature(logits, temperature=1.0)
# [0.659, 0.242, 0.099]  ← Still peaked but less extreme

# High temperature (T=2.0): More uniform, diverse
probs_high = sample_with_temperature(logits, temperature=2.0)
# [0.502, 0.307, 0.191]  ← More spread out

# Very low (T=0.1): Nearly deterministic
probs_very_low = sample_with_temperature(logits, temperature=0.1)
# [0.998, 0.002, 0.000]  ← Almost always first token
                        

Sampling Strategies Comparison

Strategy	How It Works	Use Cases
Greedy	Always pick highest probability token	Factual Q&A, translation (deterministic)
Beam Search	Keep top K sequences at each step	Machine translation, summarization
Top-K Sampling	Sample from top K most likely tokens	Creative writing (K=40-100)
Top-P (Nucleus)	Sample from smallest set with cumulative prob ≥ P	Most versatile (P=0.9-0.95)
Temperature + Top-P	Combine both for fine control	Production systems (T=0.7, P=0.9)

Top-K vs Top-P (Nucleus) Sampling

def top_k_sampling(logits, k=50):
    """Keep only top K tokens"""
    top_k_logits, top_k_indices = torch.topk(logits, k)
    probs = F.softmax(top_k_logits, dim=-1)
    # Sample from top K
    next_token = torch.multinomial(probs, 1)
    return top_k_indices[next_token]

def top_p_sampling(logits, p=0.9):
    """Nucleus sampling: keep smallest set with cumulative prob ≥ p"""
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    probs = F.softmax(sorted_logits, dim=-1)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(probs, dim=-1)

    # Remove tokens with cumulative prob > p
    remove_mask = cumulative_probs > p
    remove_mask[..., 1:] = remove_mask[..., :-1].clone()
    remove_mask[..., 0] = False

    # Zero out removed tokens
    sorted_logits[remove_mask] = float('-inf')

    # Sample from remaining tokens
    probs = F.softmax(sorted_logits, dim=-1)
    next_token_idx = torch.multinomial(probs, 1)
    return sorted_indices[next_token_idx]

# Example: Probability distribution
# Token 0: 0.4  ←
# Token 1: 0.3  ← Top-P (p=0.9) includes these (cum=0.7)
# Token 2: 0.2  ← and this one (cum=0.9)
# Token 3: 0.05 ← Excluded (would exceed p)
# Token 4: 0.03
# Token 5: 0.02

# Top-K (k=3) would include tokens 0, 1, 2 regardless of probabilities
# Top-P (p=0.9) includes 0, 1, 2 because they sum to ≥ 0.9
                        

Recommended Settings

Task	Temperature	Top-P	Notes
Factual Q&A	0.0-0.3	1.0	Deterministic, accurate
Code generation	0.2-0.5	0.95	Mostly deterministic
Chat assistant	0.7-0.9	0.9	Balanced
Creative writing	0.9-1.2	0.95	Diverse, creative
Brainstorming	1.0-1.5	1.0	Very diverse

Key Papers and Resources

Tokenization

"Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016) - BPE
arxiv.org/abs/1508.07909
"SentencePiece: A simple and language independent approach" (Kudo & Richardson, 2018)
arxiv.org/abs/1808.06226

Embeddings

"Efficient Estimation of Word Representations" (Mikolov et al., 2013) - Word2Vec
arxiv.org/abs/1301.3781
"GloVe: Global Vectors for Word Representation" (Pennington et al., 2014)
nlp.stanford.edu/pubs/glove.pdf

Sampling Strategies

"The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019) - Nucleus sampling
arxiv.org/abs/1904.09751

Long Context

"Flash Attention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
arxiv.org/abs/2205.14135
"Extending Context Window of Large Language Models via RoPE" (Chen et al., 2023)
arxiv.org/abs/2309.16039

Practical Tools

tiktoken - OpenAI's fast BPE tokenizer
github.com/openai/tiktoken
Hugging Face Tokenizers - Fast tokenizer library
github.com/huggingface/tokenizers

Recommended Learning Path

Experiment with tiktoken to see how text is tokenized
Read BPE paper (Sennrich et al.) to understand the algorithm
Implement simple embedding layer in PyTorch
Read Word2Vec paper for embedding intuition
Experiment with temperature and top-p sampling
Read Nucleus sampling paper (Holtzman et al.)
Implement all sampling strategies from scratch

Popular LLM Families

Overview of major language models and their characteristics

Major LLM Families

GPT Series

OpenAI's Generative Pre-trained Transformer models (GPT-3, GPT-4)

Claude

Anthropic's constitutional AI assistant with strong reasoning

LLaMA

Meta's open-source models enabling research and development

BERT

Google's bidirectional encoder for understanding tasks

PaLM/Gemini

Google's latest multimodal AI models

Mistral

Efficient open-source models from Mistral AI

Model Comparison

Model	Parameters	Context	Notable Feature
GPT-4	Unknown	128K tokens	Multimodal capabilities
Claude 3	Unknown	200K tokens	Long context, strong reasoning
LLaMA 2	7B-70B	4K tokens	Open source
Mixtral 8x7B	47B	32K tokens	Mixture of Experts

Choosing a Model: Consider factors like cost, latency, capabilities, context length, and whether you need self-hosting (open source) or API access (proprietary).

Test Your Knowledge

Assess your understanding of LLMs and transformers

Knowledge Quiz

1. What is the key innovation that makes Transformers powerful?

A) Recurrent connections

B) Attention mechanism

C) Convolutional layers

D) Decision trees

2. What does RLHF stand for?

A) Recursive Learning from Hidden Features

B) Reinforcement Learning from Human Feedback

C) Random Learning for High Frequency

D) Regulated Learning for Helpful Functions

3. What are tokens in the context of LLMs?

A) Physical hardware components

B) Basic units of text the model processes

C) Security credentials

D) Neural network layers

4. During pre-training, what is the primary objective?

A) Follow human instructions

B) Predict the next token in a sequence

C) Generate creative stories

D) Classify images

5. What is the purpose of temperature in text generation?

A) Control the model's training speed

B) Measure the model's accuracy

C) Control randomness and creativity in outputs

D) Determine the context window size

Learning Resources

Continue your journey into LLMs and AI

Essential Papers

"Attention Is All You Need" - The original Transformer paper (Vaswani et al., 2017)
"BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
"Language Models are Few-Shot Learners" - GPT-3 paper (Brown et al., 2020)
"Training language models to follow instructions with human feedback" - InstructGPT/RLHF (Ouyang et al., 2022)
"Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022)

Online Courses & Tutorials

Stanford CS224N: Natural Language Processing with Deep Learning
Hugging Face NLP Course: Free comprehensive course on transformers
Fast.ai: Practical Deep Learning for Coders
DeepLearning.AI: Courses on Transformers and LLMs
Andrej Karpathy: YouTube tutorials on building GPT from scratch

Tools to Experiment

Hugging Face: Pre-trained models, datasets, and the Transformers library
OpenAI Playground: Experiment with GPT models
Google Colab: Free GPU/TPU for training and experimenting
PyTorch / TensorFlow: Deep learning frameworks
LangChain: Framework for building LLM applications

Communities & Blogs

r/MachineLearning: Reddit community for ML research
Papers with Code: Latest research with implementations
The Gradient: Online magazine about AI research
Distill.pub: Clear explanations of ML concepts
AI Alignment Forum: Discussions on AI safety and alignment

Next Steps

Read the foundational Transformer paper
Experiment with models on Hugging Face
Build a simple project using LLM APIs
Follow AI research communities
Consider specializing in an area (safety, applications, research)

Sources & References

This guide draws from foundational papers and research in the field of large language models. Below are key citations and resources:

Foundational Papers

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need"
The original Transformer architecture paper
https://arxiv.org/abs/1706.03762

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Introduced bidirectional pre-training and masked language modeling
https://arxiv.org/abs/1810.04805

Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners"
GPT-2: Demonstrated zero-shot task transfer capabilities
OpenAI GPT-2 Paper

Scaling & Training

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models"
Original scaling laws for LLMs
https://arxiv.org/abs/2001.08361

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models"
Chinchilla paper: Revised scaling laws showing optimal compute allocation
https://arxiv.org/abs/2203.15556

Optimization & Efficiency

Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
Efficient attention mechanism using tiling and kernel fusion
https://arxiv.org/abs/2205.14135

Alignment & Safety

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback"
Self-improvement and alignment through AI-generated feedback
https://arxiv.org/abs/2212.08073

Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
DPO: Simplified alignment method without explicit reward modeling
https://arxiv.org/abs/2305.18290

Emergent Capabilities

Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models"
Documents qualitative improvements that emerge at scale
https://arxiv.org/abs/2206.07682

Note: This field moves rapidly. For the latest research, check arXiv.org, Papers with Code, and conference proceedings (NeurIPS, ICML, ACL, EMNLP).

How LLMs Understand and Generate Code

Deep dive into code comprehension, generation, and the patterns LLMs recognize

Code as a Language

LLMs treat programming code as just another language. During pre-training, models see billions of lines of code from GitHub, Stack Overflow, documentation, and tutorials. They learn:

Syntax patterns: Language-specific grammar, keywords, operators
Semantic relationships: How variables, functions, and classes relate
Common idioms: Standard patterns like loops, recursion, error handling
API usage: How to use libraries and frameworks correctly
Code structure: File organization, imports, module patterns
Documentation patterns: Docstrings, comments, type hints

What Makes Code Different from Natural Language

Precise Syntax

One missing semicolon breaks everything. No room for ambiguity.

Logical Structure

Code follows strict logical flow: control structures, state changes, algorithms.

Compositionality

Complex programs built from simple, reusable components.

Execution Semantics

Code must be executable. The computer is the ultimate judge.

Code Training Data

Modern code-capable LLMs are trained on massive code repositories:

Source	Size	Content
GitHub	100M+ repositories	Open source code in all languages
Stack Overflow	20M+ questions	Q&A with code examples
Documentation	Billions of tokens	Official docs, tutorials, guides
LeetCode/HackerRank	100K+ problems	Algorithmic problems with solutions

Common Programming Patterns LLMs Learn

Through exposure to millions of code examples, LLMs develop strong intuitions about common patterns:

1. Data Structure Patterns

# Array/List Patterns
two_pointers = lambda arr: ...  # Two pointer technique
sliding_window = lambda arr: ...  # Sliding window
prefix_sum = lambda arr: [sum(arr[:i+1]) for i in range(len(arr))]

# HashMap Patterns
frequency_map = {}  # Count occurrences
seen = set()  # Track visited elements
memo = {}  # Memoization for DP

# Tree Patterns
def dfs(node):
    if not node: return
    # Process node
    dfs(node.left)
    dfs(node.right)

# Graph Patterns
adj_list = {i: [] for i in range(n)}  # Adjacency list
visited = set()  # Track visited nodes
                        

2. Algorithm Patterns

# Binary Search Pattern
def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

# Dynamic Programming Pattern
def dp_problem(n):
    dp = [0] * (n + 1)
    dp[0] = base_case
    for i in range(1, n + 1):
        dp[i] = transition_function(dp, i)
    return dp[n]

# Backtracking Pattern
def backtrack(path, choices):
    if is_solution(path):
        result.append(path[:])
        return
    for choice in choices:
        path.append(choice)
        backtrack(path, remaining_choices)
        path.pop()  # Undo choice
                        

3. Edge Case Handling

# LLMs learn these common edge cases:
if not nums or len(nums) == 0:  # Empty input
    return []

if len(nums) == 1:  # Single element
    return nums[0]

if target < nums[0] or target > nums[-1]:  # Out of range
    return -1

if left > right:  # Invalid range
    return None

# Handling None/null
if root is None:
    return 0
                        

How LLMs Generate Code

When generating code, LLMs follow a probabilistic process:

1. Understand the Problem

Parse the natural language description, identify key requirements, constraints, and expected inputs/outputs. Extract data structure needs and algorithmic requirements.

2. Recall Similar Problems

Attention mechanism activates memories of similar problems seen during training. Pattern matching against thousands of known solutions.

3. Select High-Level Approach

Choose algorithm family (greedy, DP, divide-and-conquer, etc.) based on problem characteristics. Select appropriate data structures.

4. Generate Structure

Start with function signature, parameter types. Add main logic scaffold (loops, conditionals, base cases). Structure follows learned templates.

5. Fill in Details

Token by token generation of the actual logic. Each token is predicted based on all previous context. Variable names follow conventions.

6. Add Error Handling

Insert edge case checks, validation, and error handling based on learned patterns from high-quality code.

Strengths of LLMs in Coding

Pattern Recognition: Excellent at recognizing which algorithm pattern fits a problem
Boilerplate: Great at generating standard code structures and setup
Syntax: Strong knowledge of syntax across many programming languages
Common Algorithms: Can implement standard algorithms (sorting, searching, graph traversal)
Code Translation: Can convert code between programming languages
Explanation: Can explain what code does line by line
Debugging: Can spot common bugs and suggest fixes

Limitations in Coding

Complex Logic: Struggles with highly nested or intricate logical conditions
Novel Algorithms: Can't invent truly new algorithmic approaches
Mathematical Proofs: Can't rigorously prove correctness
Optimization: May not generate the most efficient solution
Edge Cases: Might miss rare or subtle edge cases
Testing: Generated code needs human verification
Large Codebases: Context window limits understanding of large projects

Algorithmic Reasoning in LLMs

How LLMs solve algorithmic problems and approach LeetCode-style challenges

Understanding Algorithmic Reasoning

Algorithmic reasoning requires step-by-step logical thinking, understanding of complexity, and systematic problem-solving. LLMs develop these capabilities through exposure to millions of algorithm implementations and explanations.

How LLMs Learn Algorithms

During training, LLMs see algorithms presented in multiple forms:

Code Implementations

Working code in Python, Java, C++, etc.

Pseudocode

High-level algorithmic descriptions

Natural Language

Written explanations and tutorials

Examples & Traces

Step-by-step execution examples

Core Algorithm Categories LLMs Understand

1. Searching & Sorting (Mastery: High)

# Binary Search - LLMs excel at this
def binary_search(nums, target):
    """
    Time: O(log n), Space: O(1)
    LLMs understand: logarithmic complexity, divide-and-conquer
    """
    left, right = 0, len(nums) - 1
    while left <= right:
        mid = left + (right - left) // 2  # Avoid overflow
        if nums[mid] == target:
            return mid
        elif nums[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

# Merge Sort - Recursive pattern well-learned
def merge_sort(arr):
    """
    Time: O(n log n), Space: O(n)
    LLMs understand: divide-and-conquer, recursion
    """
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])
    return merge(left, right)
                        

2. Dynamic Programming (Mastery: Medium-High)

# Fibonacci - Classic DP pattern
def fib_dp(n):
    """
    LLMs recognize: overlapping subproblems, memoization
    Bottom-up vs top-down approaches
    """
    dp = [0] * (n + 1)
    dp[1] = 1
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    return dp[n]

# Longest Common Subsequence
def lcs(s1, s2):
    """
    LLMs understand: 2D DP table, recurrence relations
    Building solution from smaller subproblems
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    return dp[m][n]
                        

3. Graph Algorithms (Mastery: Medium)

# DFS - Depth First Search
def dfs(graph, start, visited=None):
    """
    LLMs understand: recursion, backtracking
    Visited set pattern, adjacency lists
    """
    if visited is None:
        visited = set()
    visited.add(start)

    for neighbor in graph[start]:
        if neighbor not in visited:
            dfs(graph, neighbor, visited)

    return visited

# BFS - Breadth First Search
from collections import deque

def bfs(graph, start):
    """
    LLMs recognize: queue usage, level-by-level traversal
    Shortest path in unweighted graphs
    """
    visited = set([start])
    queue = deque([start])

    while queue:
        node = queue.popleft()
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)

    return visited

# Dijkstra's Algorithm
import heapq

def dijkstra(graph, start):
    """
    LLMs understand: priority queue, greedy approach
    Distance tracking, relaxation step
    """
    distances = {node: float('inf') for node in graph}
    distances[start] = 0
    pq = [(0, start)]

    while pq:
        current_dist, current_node = heapq.heappop(pq)

        if current_dist > distances[current_node]:
            continue

        for neighbor, weight in graph[current_node].items():
            distance = current_dist + weight
            if distance < distances[neighbor]:
                distances[neighbor] = distance
                heapq.heappush(pq, (distance, neighbor))

    return distances
                        

4. Tree Algorithms (Mastery: High)

class TreeNode:
    def __init__(self, val=0, left=None, right=None):
        self.val = val
        self.left = left
        self.right = right

# Inorder Traversal - LLMs excel at this
def inorder(root):
    """
    Pattern: left -> root -> right
    LLMs understand: recursive tree traversal
    """
    if not root:
        return []
    return inorder(root.left) + [root.val] + inorder(root.right)

# Lowest Common Ancestor
def lca(root, p, q):
    """
    LLMs recognize: recursive structure
    Base cases and divide-and-conquer
    """
    if not root or root == p or root == q:
        return root

    left = lca(root.left, p, q)
    right = lca(root.right, p, q)

    if left and right:
        return root
    return left if left else right

# Validate BST
def isValidBST(root, min_val=float('-inf'), max_val=float('inf')):
    """
    LLMs understand: constraints propagation
    Range validation pattern
    """
    if not root:
        return True

    if root.val <= min_val or root.val >= max_val:
        return False

    return (isValidBST(root.left, min_val, root.val) and
            isValidBST(root.right, root.val, max_val))
                        

Problem-Solving Strategies LLMs Learn

Strategy	When to Use	Example Problems
Two Pointers	Sorted arrays, strings, linked lists	Two Sum II, Container With Most Water
Sliding Window	Subarray/substring problems	Longest Substring, Max Sum Subarray
Fast & Slow Pointers	Cycle detection, middle element	Linked List Cycle, Happy Number
Merge Intervals	Overlapping intervals	Merge Intervals, Meeting Rooms
Top K Elements	Finding largest/smallest K items	Kth Largest, Top K Frequent
Binary Search	Sorted data, search space reduction	Search in Rotated Array, Find Min
Backtracking	All combinations, permutations	Subsets, N-Queens, Sudoku
Dynamic Programming	Optimal substructure, overlapping	Coin Change, Longest Palindrome

Time Complexity Recognition

LLMs develop intuition about complexity by seeing code annotated with Big-O notation:

# Common Complexity Classes LLMs Recognize

# O(1) - Constant
def get_first(arr):
    return arr[0] if arr else None

# O(log n) - Logarithmic (binary search pattern)
def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        # ... halving search space each time

# O(n) - Linear (single pass)
def find_max(arr):
    return max(arr)

# O(n log n) - Linearithmic (efficient sorting)
def sort(arr):
    return sorted(arr)  # Timsort

# O(n²) - Quadratic (nested loops)
def bubble_sort(arr):
    for i in range(len(arr)):
        for j in range(len(arr) - 1 - i):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]

# O(2^n) - Exponential (recursive branching)
def fibonacci_naive(n):
    if n <= 1:
        return n
    return fibonacci_naive(n-1) + fibonacci_naive(n-2)
                        

LLM Reasoning Process for LeetCode Problems

Step 1: Pattern Recognition

Identify which algorithmic pattern the problem fits. Keywords trigger pattern matching: "sorted array" → binary search, "subarray" → sliding window, "paths" → DFS/BFS.

Step 2: Data Structure Selection

Choose appropriate data structures based on operations needed. HashMap for fast lookups, heap for priority, stack for LIFO, queue for BFS.

Step 3: Algorithm Template

Apply learned template for the identified pattern. Start with the scaffolding that fits the problem type.

Step 4: Implementation

Fill in problem-specific logic within the template structure. Handle indices, boundaries, update rules.

Step 5: Edge Cases

Add checks for empty inputs, single elements, negative numbers, duplicates, etc.

What LLMs Struggle With

Novel Problem Types: Completely new problems without similar training examples
Multi-step Mathematical Reasoning: Problems requiring careful mathematical derivation
Subtle Optimizations: Finding the most space-efficient or time-efficient solution
Complex State Management: Problems with many interacting state variables
Proof of Correctness: Formally proving an algorithm is correct
Rare Edge Cases: Unusual boundary conditions not well-represented in training

Prompting LLMs for Coding Problems

Master techniques for getting the best solutions from LLMs for LeetCode-style problems

Effective Prompting Strategies

How you ask an LLM to solve a coding problem dramatically affects the quality of the solution. Well-structured prompts lead to better, more reliable code.

The Anatomy of a Good Coding Prompt

# Poor Prompt:
"Solve two sum problem"

# Good Prompt:
"Given an array of integers nums and an integer target,
return indices of the two numbers that add up to target.

Constraints:
- 2 <= nums.length <= 10^4
- -10^9 <= nums[i] <= 10^9
- Only one valid answer exists

Example:
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Explanation: nums[0] + nums[1] == 9

Please provide:
1. Optimal solution with explanation
2. Time and space complexity analysis
3. Edge cases to consider"
                        

Essential Prompting Techniques

1. Chain-of-Thought Prompting

Ask the LLM to think step-by-step before coding:

Prompt: "Solve this problem step by step:

1. First, analyze the problem and identify the pattern
2. Explain your high-level approach
3. Discuss time and space complexity
4. Then provide the implementation
5. Finally, test with edge cases

Problem: [Your problem here]"

# This leads to better reasoning and more reliable solutions
                        

2. Few-Shot Learning

Provide examples of similar solved problems:

Prompt: "Here's a similar problem I solved:

Problem: Two Sum
Solution: Use HashMap for O(n) time
[code example]

Now solve this similar problem:
Problem: Three Sum
[problem description]"

# LLM learns from the pattern in your example
                        

3. Specify Constraints and Requirements

Prompt: "Solve this problem with these requirements:

- Time complexity: Must be O(n log n) or better
- Space complexity: Must be O(1) auxiliary space
- Language: Python 3
- Style: Use type hints and docstrings
- Handle edge cases: empty array, single element, duplicates
- Add comments explaining the algorithm

Problem: [description]"
                        

4. Ask for Multiple Approaches

Prompt: "Provide three different solutions:

1. Brute force approach (for understanding)
2. Optimized approach with trade-offs explained
3. Most optimal approach

For each, include:
- Code implementation
- Time and space complexity
- When to use this approach

Problem: [description]"
                        

5. Iterative Refinement

Build on previous responses:

First prompt: "Solve this problem: [description]"
[Get initial solution]

Second prompt: "The solution works but fails for edge case:
[specific case]. Please fix this bug."
[Get improved solution]

Third prompt: "Can you optimize the space complexity?
Currently using O(n) space."
[Get optimized solution]

Fourth prompt: "Add comprehensive unit tests"
                        

Prompting Patterns for Common Problem Types

For Array Problems:

"Solve this array problem:
[problem]

Consider these approaches:
- Two pointers
- Sliding window
- Prefix sum
- HashMap for tracking

Choose the most appropriate and explain why."
                        

For Tree Problems:

"Solve this binary tree problem:
[problem]

Specify whether to use:
- DFS (preorder/inorder/postorder)
- BFS (level-order)
- Recursive vs iterative
- Any special tree properties (BST, balanced, etc.)

Include the TreeNode class definition."
                        

For Dynamic Programming:

"Solve this DP problem:
[problem]

Please:
1. Define the state (what does dp[i] represent?)
2. Write the recurrence relation
3. Identify base cases
4. Implement bottom-up solution
5. Discuss if space optimization is possible"
                        

For Graph Problems:

"Solve this graph problem:
[problem]

Clarify:
- Graph representation (adjacency list/matrix)
- Directed or undirected
- Weighted or unweighted
- Algorithm choice (DFS/BFS/Dijkstra/etc.)

Include graph construction from input."
                        

Advanced Prompting Techniques

Self-Consistency Prompting

"Generate 3 different solutions to this problem.
Then analyze which one is best and why.
Finally, provide the optimal solution.

Problem: [description]"

# This makes the LLM reason about different approaches
                        

Socratic Prompting

"Let's solve this problem together:

Problem: [description]

Question 1: What data structure would best suit this problem?
[Wait for response]

Question 2: What would be the time complexity of using that?
[Wait for response]

Question 3: Can we do better? If so, how?
[Wait for response]

Now implement the optimal solution."
                        

Explain-Then-Code Pattern

"First, explain the algorithm in plain English without code.
Walk through an example step by step.
Only after the explanation is clear, provide the implementation.

Problem: [description]"

# Forces logical thinking before coding
                        

Debugging with LLMs

When Code Fails:

"This code fails for test case: [test case]
Expected: [expected output]
Got: [actual output]

Here's the code:
[your code]

Please:
1. Identify the bug
2. Explain why it fails
3. Provide the corrected code
4. Suggest similar edge cases to test"
                        

Code Review Prompt:

"Review this solution:
[your code]

Check for:
1. Correctness
2. Edge cases
3. Time/space complexity
4. Code quality and readability
5. Potential optimizations

Provide detailed feedback."
                        

Anti-Patterns to Avoid

Don't:

Give vague prompts like "help with this problem"
Omit problem constraints and examples
Accept first solution without testing
Ask for code without understanding the approach first
Trust LLM blindly for complex mathematical proofs
Skip edge case discussion
Forget to specify language and style preferences

Testing LLM-Generated Solutions

"After providing the solution, generate:

1. Unit tests covering:
   - Happy path cases
   - Edge cases (empty, single element, max size)
   - Boundary conditions
   - Invalid inputs

2. Performance tests for large inputs

3. Comparison with expected output format"
                        

Best Practices for LeetCode with LLMs

Practice	Why
Start with understanding	Ask LLM to explain problem before coding
Request multiple solutions	Learn different approaches and trade-offs
Verify complexity claims	LLMs can be wrong about Big-O analysis
Test edge cases	LLM might miss rare boundary conditions
Understand before copying	Learn the pattern for future problems
Iterate and refine	First solution may not be optimal

Example: Complete LeetCode Session

# Step 1: Understanding
Prompt: "Explain the 'Container With Most Water' problem.
What's the intuition? What patterns does it use?"

# Step 2: Approach
Prompt: "What approaches can solve this? Compare brute force
vs optimal. Include complexity for each."

# Step 3: Implementation
Prompt: "Implement the optimal solution in Python with:
- Type hints
- Detailed comments
- Clear variable names"

# Step 4: Testing
Prompt: "Generate 5 test cases including edge cases.
Show the execution trace for one example."

# Step 5: Optimization
Prompt: "Can we reduce space complexity further?
Are there any micro-optimizations?"

# Step 6: Similar Problems
Prompt: "What are 3 similar problems that use
the same two-pointer technique?"
                        

Remember: LLMs are powerful tools for learning and problem-solving, but they work best when you:

Provide clear, detailed prompts
Think critically about responses
Test thoroughly
Understand rather than blindly copy
Use them as learning aids, not crutches

Projects & Practice Ideas

Hands-on projects to practice pre-training, post-training, and RLHF techniques

Pre-training Projects

Learn how to train language models from scratch or continue pre-training existing models.

Beginner: Train a Character-Level Language Model

Goal: Understand the fundamentals of autoregressive language modeling

Technical Requirements:

Dataset: Small text corpus (Shakespeare, Wikipedia subset, ~1-10MB)
Model: Small transformer (2-4 layers, 128-256 dim, ~1M parameters)
Hardware: CPU or single GPU (Google Colab free tier)
Framework: PyTorch or TensorFlow, Hugging Face Transformers
Time: 1-2 weeks

Key Learning Objectives:

Implement tokenization (character or BPE)
Build transformer architecture from scratch or adapt existing
Implement cross-entropy loss and next-token prediction
Track training metrics (loss, perplexity)
Generate text samples during training
Experiment with hyperparameters (learning rate, batch size, context length)

Deliverables: Trained model that generates coherent text in training domain style

Intermediate: Continue Pre-training a Small LLM

Goal: Adapt an existing pre-trained model to a specific domain

Technical Requirements:

Base Model: GPT-2 (124M), DistilGPT-2, or similar (~100M-350M params)
Dataset: Domain-specific corpus (medical papers, code, legal docs, 100MB-1GB)
Hardware: Single GPU (V100, A100, or T4 with gradient accumulation)
Framework: Hugging Face Transformers, DeepSpeed for optimization
Time: 2-4 weeks

Key Learning Objectives:

Load and freeze selective layers (optional)
Prepare domain-specific training data pipeline
Implement learning rate warmup and decay schedules
Use gradient accumulation for larger effective batch sizes
Evaluate on domain-specific benchmarks
Compare performance before/after domain adaptation
Implement checkpointing and resume training

Deliverables: Specialized model with measurably better domain performance

Advanced: Train a Small LLM from Scratch with Custom Tokenizer

Goal: Complete pre-training pipeline including data preprocessing and tokenizer training

Technical Requirements:

Dataset: Large diverse corpus (The Pile subset, C4, Common Crawl, 10GB+)
Model: Custom architecture (6-12 layers, 512-1024 dim, 100M-1B params)
Hardware: Multi-GPU setup or TPU (4-8 GPUs, distributed training)
Framework: PyTorch + DeepSpeed/FSDP, or JAX + Mesh TensorFlow
Time: 2-3 months

Key Learning Objectives:

Train custom BPE/WordPiece tokenizer on your corpus
Implement distributed data loading and shuffling
Use mixed precision training (FP16/BF16)
Implement gradient checkpointing for memory efficiency
Track scaling laws (loss vs compute)
Implement evaluation on diverse benchmarks (LAMBADA, HellaSwag, etc.)
Monitor training stability and implement interventions
Optimize throughput (tokens/sec, MFU - Model FLOPS Utilization)

Deliverables: Fully pre-trained model with documented training run, loss curves, and benchmark results

Post-training Projects (SFT, Instruction Tuning)

Take pre-trained models and fine-tune them to follow instructions and perform specific tasks.

Beginner: Fine-tune for Single Task (Classification/Summarization)

Goal: Understand supervised fine-tuning on a specific downstream task

Technical Requirements:

Base Model: FLAN-T5-small, GPT-2, or DistilGPT-2
Dataset: Single-task dataset (IMDB sentiment, CNN/DailyMail summarization, ~10K examples)
Hardware: Single GPU or Google Colab
Framework: Hugging Face Transformers + PEFT (LoRA)
Time: 1-2 weeks

Key Learning Objectives:

Format data as input-output pairs
Implement LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
Use appropriate loss masking (only compute loss on outputs)
Evaluate with task-specific metrics (accuracy, ROUGE, BLEU)
Compare full fine-tuning vs LoRA
Prevent overfitting with early stopping

Deliverables: Fine-tuned model that performs well on held-out test set

Intermediate: Multi-Task Instruction Tuning

Goal: Create a model that can follow diverse instructions across multiple task types

Technical Requirements:

Base Model: LLaMA-2-7B, Mistral-7B, or Falcon-7B
Dataset: Multi-task instruction dataset (FLAN, Dolly, OpenOrca, ~50K-500K examples)
Hardware: Single A100/H100 or multiple smaller GPUs
Framework: Axolotl, LLaMA-Factory, or custom training loop
Time: 3-6 weeks

Key Learning Objectives:

Curate and mix instruction datasets from multiple sources
Implement proper instruction templates (chat format)
Use QLoRA for memory-efficient training of larger models
Balance dataset mixing ratios
Evaluate across multiple tasks (reasoning, knowledge, coding)
Implement generation sampling strategies (temperature, top-p, top-k)
Measure instruction-following capability qualitatively

Deliverables: Instruction-tuned model that generalizes to unseen task types

Advanced: Build a Domain Expert with Specialized SFT

Goal: Create a highly capable domain-specific assistant (e.g., medical, legal, coding)

Technical Requirements:

Base Model: LLaMA-2-13B/70B, Mistral-7B, or Code Llama
Dataset: Custom domain dataset + synthetic data generation (50K-1M examples)
Hardware: Multi-GPU setup (4-8 A100s) or cloud compute
Framework: DeepSpeed, FSDP, or Megatron-LM for large models
Time: 2-3 months

Key Learning Objectives:

Use strong models (GPT-4) to generate synthetic training data
Implement data quality filtering and deduplication
Create domain-specific evaluation benchmarks
Fine-tune with curriculum learning (easy → hard examples)
Implement safety guardrails and refusal behavior
Test for hallucinations and factual accuracy
Compare against general-purpose baselines

Deliverables: Production-ready domain expert with comprehensive evaluation report

RLHF (Reinforcement Learning from Human Feedback) Projects

Implement the complete RLHF pipeline to align models with human preferences.

Beginner: Implement Reward Modeling

Goal: Build a reward model that predicts human preferences between responses

Technical Requirements:

Base Model: Same architecture as your SFT model (smaller variant)
Dataset: Comparison dataset (Anthropic HH-RLHF, OpenAssistant, ~10K-50K pairs)
Hardware: Single GPU (T4/V100)
Framework: Hugging Face Transformers + TRL (Transformer Reinforcement Learning)
Time: 2-3 weeks

Key Learning Objectives:

Understand pairwise comparison loss (Bradley-Terry model)
Modify model head to output scalar reward score
Implement training loop with paired examples
Evaluate reward model accuracy on preference prediction
Analyze what the reward model has learned (correlations with length, style, etc.)
Test on out-of-distribution prompts

Deliverables: Trained reward model that accurately predicts human preferences (>60% accuracy)

Intermediate: Full RLHF Pipeline (PPO)

Goal: Implement complete RLHF with Proximal Policy Optimization

Technical Requirements:

Base Model: Instruction-tuned model (GPT-2, LLaMA-7B)
Reward Model: From previous step or pre-trained
Hardware: 2-4 GPUs (need separate GPUs for policy, value, and reward models)
Framework: TRL (trl library), or DeepSpeed-Chat
Time: 4-8 weeks

Key Learning Objectives:

Implement PPO algorithm for language models
Manage four models: policy (actor), value (critic), reward, reference
Compute advantages using Generalized Advantage Estimation (GAE)
Add KL divergence penalty to prevent drift from reference model
Sample generations and compute rewards in batches
Monitor training stability (KL divergence, reward statistics, policy entropy)
Evaluate qualitative improvements in response quality

Deliverables: RLHF-trained model with measurably improved preference ratings vs SFT baseline

Advanced: Constitutional AI & DPO (Direct Preference Optimization)

Goal: Implement modern RLHF alternatives without explicit reward models

Technical Requirements:

Base Model: Large instruction-tuned model (LLaMA-2-13B+, Mistral-7B)
Dataset: Generate synthetic preferences using AI feedback (self-critique)
Hardware: 4-8 GPUs for larger models
Framework: Custom implementation or TRL with DPO support
Time: 2-3 months

Key Learning Objectives:

Implement Constitutional AI: use model to critique and revise its own outputs
Generate preference pairs through self-evaluation against principles
Implement DPO algorithm (simpler than PPO, no reward model needed)
Compare DPO vs PPO on same dataset
Design constitution (set of principles for model behavior)
Measure alignment on safety benchmarks (TruthfulQA, BBQ bias)
Test robustness to adversarial prompts

Deliverables: Aligned model using modern techniques with comprehensive safety evaluation

End-to-End Application Projects

Build complete systems that combine pre-training, post-training, and RLHF concepts.

Project Idea: Personal Code Assistant

Pipeline:

Pre-training: Continue pre-train Code Llama on your company's codebase
SFT: Fine-tune on code completion and debugging examples
RLHF: Use developer feedback (thumbs up/down) to refine suggestions

Technologies: VS Code extension, local model serving (vLLM), feedback collection UI

Project Idea: Custom Chatbot with Domain Expertise

Pipeline:

Pre-training: Continue training on domain-specific documents (medical/legal/financial)
SFT: Train on QA pairs and instruction examples from domain
RLHF: Collect expert feedback on response quality and factuality

Technologies: RAG (Retrieval-Augmented Generation), vector database (Pinecone/Weaviate), web interface

Project Idea: Creative Writing Assistant

Pipeline:

Pre-training: Train on large corpus of books, stories, creative writing
SFT: Fine-tune on story completion, character development prompts
RLHF: Use writer feedback to align on style, creativity, coherence

Technologies: React web app, streaming responses, style transfer controls

Learning Resources for Projects

Essential Libraries & Tools

Tool	Purpose	Best For
Hugging Face Transformers	Pre-trained models, tokenizers, training utilities	All projects, especially SFT
TRL (Transformer RL)	RLHF, PPO, DPO implementations	RLHF projects
PEFT (LoRA, QLoRA)	Parameter-efficient fine-tuning	Limited compute scenarios
DeepSpeed	Distributed training, memory optimization	Large model training
Axolotl	Simplified training configs for LLMs	Quick experiments
vLLM / TGI	Fast inference serving	Production deployment
Weights & Biases	Experiment tracking, visualization	All projects

Datasets for Practice

Pre-training: The Pile, C4, RedPajama, mC4 (multilingual)
Instruction Tuning: FLAN, Dolly-15k, OpenOrca, WizardLM, Alpaca
Coding: The Stack, CodeSearchNet, APPS, HumanEval
RLHF: Anthropic HH-RLHF, OpenAssistant Conversations, SHP (StackOverflow preferences)
Alignment: TruthfulQA, BBQ (bias), RealToxicityPrompts

Compute Requirements

Model Size	Training (Full FT)	Training (LoRA/QLoRA)	Inference
1M params	CPU	CPU	CPU
100M-350M	1x T4/V100 (16GB)	1x T4 (16GB)	CPU or small GPU
1B-3B	1-2x A100 (40GB)	1x V100/T4	1x T4 or CPU (slow)
7B	4-8x A100 (80GB)	1x A100 (40GB)	1x A100 or 2x T4
13B-30B	8-16x A100	2x A100 (80GB)	2-4x A100
70B+	16-32x A100/H100	4-8x A100	8x A100 or quantization

Getting Started Checklist

Choose a project that matches your compute budget and timeline
Set up development environment (PyTorch, Transformers, Weights & Biases)
Start with smallest viable dataset to test pipeline end-to-end
Establish baseline metrics before training
Implement logging and checkpointing from day one
Run small-scale experiments first (hyperparameter search on tiny model)
Scale up only after validating pipeline works
Document everything (model cards, training logs, evaluation results)
Share your work (blog post, GitHub repo, Hugging Face model)