Introduction to Large Language Models
Understanding the fundamentals of LLMs and their revolutionary impact on AI
What are Large Language Models?
Large Language Models (LLMs) are neural networks trained on massive text datasets to predict, understand, and generate human language. They represent one of the most significant breakthroughs in artificial intelligence, powering applications from ChatGPT and Claude to code completion tools and research assistants.
Unlike traditional software that follows explicit rules, LLMs learn statistical patterns from data. By training on trillions of words from books, websites, scientific papers, and code repositories, they develop an implicit understanding of language structure, world knowledge, and reasoning patterns.
Key Characteristics
- Scale: From 7 billion to over 1 trillion parameters (weights) that encode knowledge and language patterns
- Training Data: Hundreds of billions to trillions of tokens from diverse text sources
- Architecture: Primarily based on the Transformer architecture (2017) with attention mechanisms
- Training Cost: Millions to tens of millions of dollars in computational resources
- Capabilities: Text generation, question answering, translation, summarization, code generation, mathematical reasoning, and more
- Context Window: Ability to process from 2K to 200K+ tokens (approximately 1.5K to 150K+ words) at once
The Evolution of Language Models
Language modeling has progressed through several distinct eras:
1950s-1990s: Statistical Language Models
N-gram models that predicted words based on previous N words. Simple but limited by memory and context constraints. Used in early speech recognition and machine translation.
2003-2013: Neural Language Models
Introduction of neural networks for language modeling. Bengio et al. (2003) showed neural networks could learn word embeddings. Word2Vec (2013) popularized dense word representations.
2013-2017: Sequence Models
LSTMs and GRUs enabled longer context understanding. Seq2Seq models (2014) revolutionized machine translation. But training was slow and sequential.
2017: The Transformer Revolution
"Attention Is All You Need" paper introduced the Transformer architecture. Parallel processing and attention mechanisms solved the long-range dependency problem.
2018-2019: Pre-training Era
BERT (110M-340M params), GPT-2 (1.5B params) demonstrated that pre-training on massive unlabeled data, then fine-tuning on specific tasks, dramatically improved performance.
2020-2022: Scaling Up
GPT-3 (175B params) showed emergent abilities at scale: few-shot learning, in-context learning. Models became general-purpose tools. This era proved: scale is all you need.
2022-Present: Alignment & Specialization
InstructGPT, ChatGPT showed that aligning models with human preferences (RLHF) makes them far more useful. Multimodal models (GPT-4, Gemini) integrate vision. Open-source models (LLaMA, Mistral) democratize access.
Scaling Laws: Bigger is Better
Research by OpenAI, DeepMind, and others revealed predictable scaling laws: model performance improves as a power law with respect to three factors:
This means: doubling the compute budget reduces loss by approximately 10%. This predictability has driven the race to scale up models.
Try It: Understanding Scale
GPT-3 has 175 billion parameters. If each parameter was a grain of rice, you'd need:
How Do LLMs Work?
At their core, LLMs are trained with a simple objective: predict the next token given all previous tokens. This is called autoregressive language modeling or causal language modeling.
This simple task, when performed at massive scale with billions of parameters, leads to emergent abilities - capabilities that appear suddenly at certain scale thresholds.
What LLMs Learn
Through next-token prediction on diverse data, LLMs develop:
Grammar & Syntax
Correct sentence structure, verb conjugation, agreement
World Knowledge
Facts, entities, events, relationships, common sense
Reasoning
Logical inference, mathematical reasoning, analogies
Context Understanding
Discourse, pronouns, implicit meaning, pragmatics
Task Patterns
Translation, summarization, question answering formats
Code & Formulas
Programming syntax, algorithms, mathematical notation
Emergent Abilities at Scale
Larger models develop abilities that smaller ones don't have. These capabilities "emerge" suddenly at certain parameter counts:
| Ability | Emerges At | Description |
|---|---|---|
| Few-shot Learning | ~13B params | Learn from just a few examples in the prompt |
| In-context Learning | ~13B params | Adapt behavior based on context without fine-tuning |
| Chain-of-Thought | ~100B params | Break down complex reasoning into steps |
| Multi-step Reasoning | ~100B params | Solve problems requiring multiple inference steps |
| Code Generation | ~10B params | Write functional code from natural language |
Key Limitations
Despite their impressive capabilities, LLMs have fundamental limitations:
- No True Understanding: LLMs are statistical pattern matchers, not conscious entities. They don't "understand" in a human sense.
- Hallucinations: Can generate plausible-sounding but factually incorrect information with high confidence.
- Knowledge Cutoff: Only know what was in their training data, with a specific cutoff date.
- Reasoning Limitations: Struggle with tasks requiring precise logic, counting, or mathematical operations.
- Context Length: Limited by maximum context window (though improving rapidly).
- Bias & Toxicity: Can reflect and amplify biases present in training data.
- No Memory: Each conversation starts fresh (unless explicitly provided context).
The Transformer Architecture
Deep dive into the revolutionary architecture that changed NLP
Understanding Transformers
The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., revolutionized natural language processing by replacing recurrence with pure attention mechanisms. It's the foundation of all modern LLMs including GPT, BERT, Claude, and LLaMA.
Key Innovation: Parallelization Through Attention
Previous architectures (RNNs, LSTMs) processed sequences sequentially: word 1 → word 2 → word 3. This was slow and struggled with long-range dependencies.
Transformers process entire sequences in parallel using attention, enabling:
- 100x faster training on modern GPUs
- Better long-range dependency modeling
- Scalability to billions of parameters
Paper: arxiv.org/abs/1706.03762
Three Types of Transformer Architectures
| Type | Structure | Use Cases | Examples |
|---|---|---|---|
| Encoder-Only | Bidirectional attention (sees entire input) | Classification, NER, Q&A | BERT, RoBERTa, DeBERTa |
| Decoder-Only | Causal attention (only sees past) | Text generation, chat, code | GPT-3/4, LLaMA, Claude |
| Encoder-Decoder | Encoder (bidirectional) + Decoder (causal) | Translation, summarization | T5, BART, mT5 |
Modern LLMs are almost all decoder-only (GPT-style) because they're more flexible and scale better.
Decoder-Only Transformer: Component Breakdown
Let's examine a GPT-style decoder-only model, the most common architecture for modern LLMs.
1. Token + Positional Embeddings
Token Embedding: Maps each token ID to a dense vector
Positional Encoding: Adds position information (attention is permutation-invariant)
Why positional encoding matters: Without it, "cat sat" and "sat cat" would be identical to the model. Position info is crucial for understanding order, syntax, and semantics.
2. Transformer Block (Repeated N times)
Each block contains two sub-layers: multi-head attention + feed-forward network, each with residual connections and layer normalization.
Key components:
- Residual connections (x + f(x)): Enable gradient flow in deep networks, prevent degradation
- Layer normalization: Stabilize training, allow higher learning rates
- FFN: Processes each position independently, adds non-linearity and capacity
- Pre-norm vs Post-norm: Modern models use pre-norm (norm before sublayer) for training stability
3. Output Head (Language Modeling)
Final layer projects hidden states to vocabulary logits for next-token prediction.
Weight tying: Often, the output projection weights are tied to input embedding weights (same parameters). This reduces parameters and improves performance.
Complete Decoder-Only Transformer Implementation
Model Sizes and Configurations
GPT-2 Variants
| Model | Layers | d_model | Heads | Parameters |
|---|---|---|---|---|
| GPT-2 Small | 12 | 768 | 12 | 117M |
| GPT-2 Medium | 24 | 1024 | 16 | 345M |
| GPT-2 Large | 36 | 1280 | 20 | 774M |
| GPT-2 XL | 48 | 1600 | 25 | 1.5B |
Modern LLMs (2023-2024)
| Model | Layers | d_model | Heads | Parameters | Context |
|---|---|---|---|---|---|
| LLaMA-2 7B | 32 | 4096 | 32 | 7B | 4K |
| LLaMA-2 13B | 40 | 5120 | 40 | 13B | 4K |
| LLaMA-2 70B | 80 | 8192 | 64 | 70B | 4K |
| GPT-3 | 96 | 12288 | 96 | 175B | 2K-4K |
| GPT-4 (rumored) | 120+ | ? | ? | ~1.7T (MoE) | 8K-32K-128K |
Note: GPT-4's architecture details (parameter count, MoE structure) are unconfirmed rumors. OpenAI has not publicly disclosed these specifications.
Scaling pattern: Larger models generally have more layers, wider hidden dimensions, and more attention heads. The ratio d_model/num_heads is typically kept constant (64-128 per head).
Key Architectural Innovations Over Time
| Innovation | Description | Used In |
|---|---|---|
| Pre-LayerNorm | LayerNorm before sublayers (not after) | GPT-2, modern LLMs |
| GELU Activation | Smooth activation (instead of ReLU) | BERT, GPT-2+ |
| RoPE (Rotary Position Embedding) | Encodes relative positions via rotation | LLaMA, GPT-Neo |
| ALiBi (Attention with Linear Biases) | Adds linear bias to attention scores | BLOOM, MPT |
| SwiGLU | Gated FFN activation | PaLM, LLaMA |
| Grouped Query Attention (GQA) | Share K/V across query heads | LLaMA-2, Mistral |
| RMSNorm | Simplified LayerNorm (no bias/mean) | LLaMA, T5 |
Key Papers and Resources
Foundational Papers
- "Attention Is All You Need" (Vaswani et al., 2017) - The original Transformer
arxiv.org/abs/1706.03762 - "Improving Language Understanding by Generative Pre-Training" (Radford et al., 2018) - GPT-1
OpenAI GPT-1 Paper - "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019) - GPT-2
OpenAI GPT-2 Paper - "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
arxiv.org/abs/1810.04805 - "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
arxiv.org/abs/2302.13971
Architectural Improvements
- "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
arxiv.org/abs/2104.09864 - "GLU Variants Improve Transformer" (Shazeer, 2020) - SwiGLU
arxiv.org/abs/2002.05202 - "Train Short, Test Long: Attention with Linear Biases" (Press et al., 2021) - ALiBi
arxiv.org/abs/2108.12409
Interactive Resources
- The Illustrated Transformer by Jay Alammar
jalammar.github.io/illustrated-transformer - The Annotated Transformer by Harvard NLP - Line-by-line implementation
nlp.seas.harvard.edu/annotated-transformer - nanoGPT by Andrej Karpathy - Minimal GPT implementation
github.com/karpathy/nanoGPT - minGPT by Andrej Karpathy - Educational GPT implementation
github.com/karpathy/minGPT
Recommended Learning Path
- Read "The Illustrated Transformer" for visual understanding
- Study original "Attention Is All You Need" paper (focus on architecture)
- Work through "The Annotated Transformer" tutorial
- Implement a tiny Transformer (2-4 layers) from scratch in PyTorch
- Train on a small dataset (character-level language modeling)
- Study nanoGPT to see production-quality implementation
- Experiment with architectural variations (RoPE, SwiGLU, etc.)
The Attention Mechanism
Deep dive into the core innovation that powers Transformers
What is Attention?
Attention is a mechanism that allows models to dynamically focus on relevant parts of the input when processing each position. Unlike traditional models that process sequences word-by-word, attention allows the model to look at the entire sequence at once and decide what's important.
Intuition: When you read "The animal didn't cross the street because it was too tired", you know "it" refers to "animal" not "street". Attention allows models to make these connections by computing relevance scores between all word pairs.
Key Innovation from "Attention Is All You Need" (Vaswani et al., 2017)
The original Transformer paper introduced Scaled Dot-Product Attention, which replaced recurrence entirely with a parallelizable attention mechanism. This breakthrough enabled training on much longer sequences and led to the modern LLM revolution.
Paper: arxiv.org/abs/1706.03762
Scaled Dot-Product Attention: The Mathematics
The Complete Formula
Step-by-Step Breakdown with Concrete Example
Let's walk through attention for the sentence: "The cat sat" (3 tokens)
Step 1: Create Q, K, V from Input Embeddings
Step 2: Compute Attention Scores (QK^T)
Step 3: Scale by √d_k
Why scale? Without scaling, large dot products → large softmax inputs → near-zero gradients. Scaling by √d_k keeps values in a reasonable range.
Step 4: Apply Softmax (Get Attention Weights)
Step 5: Weighted Sum of Values (Get Output)
Key Insight: Attention is a differentiable lookup mechanism. Instead of hard-indexing (like a dictionary), we do a soft weighted average where the weights are learned based on content similarity (Q·K).
Complete PyTorch Implementation
Multi-Head Attention: Learning Multiple Relationships
Single attention head can only learn one type of relationship. Multi-Head Attention runs multiple attention mechanisms in parallel, each learning different patterns.
What Different Heads Learn
| Head | Pattern Learned | Example |
|---|---|---|
| Head 1 | Syntactic relationships | Verbs attend to subjects |
| Head 2 | Positional (next word) | Each word attends to next word |
| Head 3 | Rare words | Technical terms attend to definitions |
| Head 4 | Delimiter tokens | Commas, periods, quotes |
| Head 5-8 | Semantic relationships | Pronouns to antecedents, entities to attributes |
Multi-Head Attention Implementation
Causal (Masked) Attention for Autoregressive Models
In decoder-only models (GPT, LLaMA), we need causal masking to prevent positions from attending to future positions during training.
Why Causal Masking?
- Training: When predicting token at position i, model can only use tokens 0 to i-1
- Inference: Ensures model behavior during training matches generation (no future information)
- Efficiency: Allows parallel training (all positions trained simultaneously, but with appropriate masking)
Complexity Analysis
Time and Space Complexity
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Linear projections (Q, K, V) | O(n·d²) | O(d²) |
| Q·K^T (attention scores) | O(n²·d) | O(n²) |
| Softmax | O(n²) | O(n²) |
| Attn·V (weighted sum) | O(n²·d) | O(n²) |
| Total | O(n²·d + n·d²) | O(n² + d²) |
Where: n = sequence length, d = model dimension
The O(n²) Bottleneck
The quadratic complexity in sequence length is the main limitation of standard attention:
- 1K tokens: 1M attention scores to compute
- 10K tokens: 100M attention scores (100x more!)
- 100K tokens: 10B attention scores → OOM on most GPUs
This is why context length is expensive to scale!
Efficient Attention Variants
| Method | Complexity | Used In |
|---|---|---|
| Standard Attention | O(n²) | GPT-3, GPT-4, Claude |
| Flash Attention | O(n²) but memory-efficient | GPT-4, LLaMA-2, Modern LLMs |
| Sparse Attention | O(n√n) or O(n log n) | Longformer, BigBird |
| Linear Attention | O(n) | Linformer, Performer |
| State Space Models | O(n) | Mamba, RWKV |
Types of Attention Mechanisms
1. Self-Attention (Intra-Attention)
Each position attends to all positions in the same sequence. Used in encoder and decoder.
2. Cross-Attention (Encoder-Decoder Attention)
Decoder positions attend to encoder outputs. Used in seq2seq models (translation, summarization).
3. Masked Self-Attention (Causal Attention)
Prevents attention to future positions. Used in decoder-only models (GPT).
Common Pitfalls and Best Practices
Implementation Details That Matter
| Issue | Solution | Why It Matters |
|---|---|---|
| Numerical instability in softmax | Use scaling (√d_k) and stable softmax | Prevents vanishing/exploding gradients |
| Memory explosion for long sequences | Use Flash Attention or gradient checkpointing | O(n²) memory can OOM quickly |
| Forgetting positional information | Add positional encodings to input | Attention is permutation-invariant |
| Attention collapse (all weights uniform) | Use LayerNorm, dropout, proper init | Model fails to learn useful patterns |
| Information leakage in causal models | Strictly enforce causal masking | Training/inference mismatch |
Key Papers and Resources
Foundational Papers
- "Attention Is All You Need" (Vaswani et al., 2017) - The original Transformer
arxiv.org/abs/1706.03762 - "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014) - First attention mechanism
arxiv.org/abs/1409.0473 - "Flash Attention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
arxiv.org/abs/2205.14135 - "Longformer: The Long-Document Transformer" (Beltagy et al., 2020) - Sparse attention
arxiv.org/abs/2004.05150
Analysis and Interpretability
- "What Does BERT Look At?" (Clark et al., 2019) - Analyzing attention patterns
arxiv.org/abs/1906.04341 - "Analyzing Multi-Head Self-Attention" (Voita et al., 2019)
arxiv.org/abs/1905.09418
Interactive Resources
- The Illustrated Transformer by Jay Alammar
jalammar.github.io/illustrated-transformer - The Annotated Transformer by Harvard NLP
nlp.seas.harvard.edu/annotated-transformer - Attention? Attention! by Lil'Log
lilianweng.github.io/posts/2018-06-24-attention - BertViz - Visualize attention in BERT, GPT-2, etc.
github.com/jessevig/bertviz
Video Tutorials
- Andrej Karpathy - "Let's build GPT: from scratch, in code, spelled out"
- Stanford CS224N - Lecture on Attention Mechanisms
- 3Blue1Brown - "But what is a GPT?" (visual intuition)
Recommended Learning Path
- Read "The Illustrated Transformer" for visual intuition
- Read sections 3.2-3.3 of "Attention Is All You Need" paper
- Implement scaled dot-product attention from scratch in NumPy
- Implement multi-head attention in PyTorch
- Use BertViz to visualize attention patterns in real models
- Read "What Does BERT Look At?" to understand what attention learns
- Experiment with Flash Attention for efficient implementation
Pre-training: Foundation Learning
Deep dive into how LLMs learn language from massive text corpora
What is Pre-training?
Pre-training is an unsupervised learning process where models learn patterns, structures, and knowledge from vast amounts of raw text without explicit labels. This phase teaches the model fundamental language understanding before any task-specific fine-tuning.
Key Characteristics
- Self-supervised: Labels are created from the data itself (next token)
- Massive scale: Trillions of tokens, petabytes of data
- Expensive: Millions of dollars, months of GPU time
- One-time cost: Pre-trained models can be reused and fine-tuned
The Pre-training Objective: Next-Token Prediction
Modern decoder-only LLMs use causal language modeling: predict the next token given all previous tokens.
Mathematical Formulation
Concrete Example Walkthrough
Training on: "The cat sat on the mat"
Complete Training Loop Implementation
Scaling Laws: How Performance Scales
Research shows that LLM performance follows predictable power laws with respect to model size, dataset size, and compute.
The Scaling Law Equations (Kaplan et al., 2020)
Important Note: These are the original Kaplan et al. (2020) scaling laws. The Chinchilla paper (Hoffmann et al., 2022) later revised these findings, showing that training should balance model size and data more equally than originally thought.
Optimal Allocation (Chinchilla Scaling)
Chinchilla paper finding (Hoffmann et al., 2022): For optimal performance, model size and training tokens should scale equally.
Chinchilla Optimal Rule:
Reference: Training Compute-Optimal Large Language Models (arxiv.org/abs/2203.15556)
| Compute Budget | Optimal Model Size | Optimal Training Tokens |
|---|---|---|
| 1e20 FLOPs | 400M params | 8B tokens |
| 1e21 FLOPs | 1B params | 20B tokens |
| 1e22 FLOPs | 3B params | 60B tokens |
| 1e23 FLOPs | 10B params | 200B tokens |
| 1e24 FLOPs | 30B params | 600B tokens |
| 1e25 FLOPs | 100B params | 2T tokens |
Training Data: Sources and Processing
Data Composition (Common Patterns)
| Source | Proportion | Purpose | Quality |
|---|---|---|---|
| Web Crawl (Common Crawl) | 60-70% | Diverse language, broad knowledge | Variable (needs filtering) |
| Books | 10-15% | Long-form reasoning, narrative | High |
| Wikipedia | 3-5% | Factual knowledge, encyclopedic | Very High |
| Code (GitHub) | 5-10% | Programming, structured reasoning | High |
| ArXiv (Scientific Papers) | 2-3% | Technical knowledge, math | Very High |
| Reddit (filtered) | 5-10% | Conversational, Q&A format | Medium (filtered) |
Data Processing Pipeline
Computational Cost Breakdown
Training Cost Estimates
| Model | Parameters | Training Tokens | Compute (FLOPs) | GPU-Hours (A100) | Est. Cost |
|---|---|---|---|---|---|
| GPT-2 | 1.5B | 10B | ~3e19 | ~1K | ~$2K |
| GPT-3 | 175B | 300B | ~3.1e23 | ~1M | ~$4-5M |
| LLaMA-2 7B | 7B | 2T | ~2.8e22 | ~180K | ~$300K |
| LLaMA-2 70B | 70B | 2T | ~2.8e23 | ~1.7M | ~$3M |
FLOP calculation: ~6ND FLOPs, where N = parameters, D = tokens (forward + backward pass)
Why So Expensive?
- Matrix multiplications: Billions of parameters × billions of tokens
- Parallel training: Requires thousands of GPUs for reasonable time
- Memory bandwidth: Moving parameters and activations is costly
- Redundancy: Must train to convergence, can't stop early
Key Papers and Resources
Foundational Papers
- "Language Models are Few-Shot Learners" (Brown et al., 2020) - GPT-3
arxiv.org/abs/2005.14165 - "Scaling Laws for Neural Language Models" (Kaplan et al., 2020)
arxiv.org/abs/2001.08361 - "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) - Chinchilla
arxiv.org/abs/2203.15556 - "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
arxiv.org/abs/2302.13971 - "The Pile: An 800GB Dataset of Diverse Text" (Gao et al., 2020)
arxiv.org/abs/2101.00027
Data Processing
- "Deduplicating Training Data Makes Language Models Better" (Lee et al., 2021)
arxiv.org/abs/2107.06499 - "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (Kreutzer et al., 2021)
arxiv.org/abs/2103.12028
Practical Resources
- nanoGPT by Andrej Karpathy - Minimal training code
github.com/karpathy/nanoGPT - Scaling Laws Calculator - Estimate compute needs
github.com/Alignment-Lab-AI/scaling-laws - RedPajama - Open dataset for LLM training
github.com/togethercomputer/RedPajama-Data
Recommended Learning Path
- Read GPT-3 paper to understand scale and capabilities
- Study Scaling Laws paper (Kaplan) - fundamental insights
- Read Chinchilla paper for compute-optimal training
- Implement character-level LM on tiny dataset (Shakespeare)
- Study nanoGPT training code
- Experiment with data processing (filtering, deduplication)
- Train GPT-2 scale model using Hugging Face
Post-training: Alignment & Specialization
Deep dive into SFT, RLHF, and alignment techniques
What is Post-training?
After pre-training, models undergo post-training (also called alignment) to become helpful assistants, specialized tools, or domain experts. This phase transforms raw language models into usable, safe AI systems aligned with human values and preferences.
Why Post-training is Critical
- Pre-trained models complete text (next-token prediction) but don't follow instructions
- No safety guardrails: Can generate harmful, biased, or toxic content
- Poor task performance: Not optimized for specific applications
- Lack of alignment: Don't understand human preferences or values
Post-training fixes all of these issues, transforming completion models into assistants.
The Three-Stage Pipeline
| Stage | Goal | Data Needed | Cost |
|---|---|---|---|
| 1. SFT | Teach instruction following | 10K-100K prompt-response pairs | Low (hours on single GPU) |
| 2. Reward Modeling | Learn human preferences | 10K-100K comparison pairs | Low (hours on single GPU) |
| 3. RLHF | Optimize for preferences | Prompts only (RL generates responses) | Medium (days on multiple GPUs) |
Stage 1: Supervised Fine-Tuning (SFT)
Train the model on high-quality instruction-response pairs to teach it how to follow instructions and behave as an assistant.
SFT Data Format
SFT Training Objective
Complete SFT Implementation
Stage 2: Reward Modeling
Train a separate model to predict human preferences between different responses. This reward model will guide RLHF training.
Preference Data Collection
How Comparison Data is Created
- Sample prompts: Collect diverse user prompts
- Generate responses: SFT model generates 2-4 responses per prompt
- Human ranking: Raters rank responses (best to worst)
- Create pairs: All pairwise comparisons extracted
Reward Model Training (Bradley-Terry Model)
Reward Model Implementation
Stage 3: RLHF with PPO
Use reinforcement learning (specifically PPO - Proximal Policy Optimization) to optimize the model against the reward model while preventing it from deviating too far from the original.
RLHF Objective Function
The Complete RLHF Loop
Four Models Involved
- Policy (Actor): The model being trained (generates responses)
- Value Model (Critic): Estimates expected reward (helps compute advantages)
- Reward Model: Scores responses (frozen, from Stage 2)
- Reference Model: Original SFT model (frozen, prevents drift)
Simplified RLHF Training Loop
Modern Alternative: Direct Preference Optimization (DPO)
DPO is a simpler alternative to RLHF that optimizes preferences directly without training a separate reward model or using RL.
DPO Objective
Reference: Direct Preference Optimization (arxiv.org/abs/2305.18290)
Parameter-Efficient Fine-Tuning: LoRA
LoRA (Low-Rank Adaptation) allows fine-tuning large models efficiently by training small adapter matrices instead of all parameters.
LoRA Mathematical Formulation
LoRA Implementation
Key Papers and Resources
Foundational RLHF Papers
- "Training language models to follow instructions with human feedback" (Ouyang et al., 2022) - InstructGPT
arxiv.org/abs/2203.02155 - "Learning to summarize from human feedback" (Stiennon et al., 2020) - Early RLHF
arxiv.org/abs/2009.01325 - "Fine-Tuning Language Models from Human Preferences" (Ziegler et al., 2019)
arxiv.org/abs/1909.08593
DPO and Alternatives
- "Direct Preference Optimization" (Rafailov et al., 2023)
arxiv.org/abs/2305.18290 - "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022) - Anthropic
arxiv.org/abs/2212.08073 - "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Lee et al., 2023)
arxiv.org/abs/2309.00267
Parameter-Efficient Fine-Tuning
- "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
arxiv.org/abs/2106.09685 - "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
arxiv.org/abs/2305.14314
Practical Libraries
- TRL (Transformer Reinforcement Learning) - Hugging Face RLHF library
github.com/huggingface/trl - PEFT - Parameter-Efficient Fine-Tuning (LoRA, etc.)
github.com/huggingface/peft - DeepSpeed-Chat - Microsoft's RLHF implementation
github.com/microsoft/DeepSpeed
Recommended Learning Path
- Read InstructGPT paper (Ouyang et al.) - foundational RLHF
- Implement simple SFT on small instruction dataset
- Study LoRA paper and implement low-rank adaptation
- Read DPO paper - simpler alternative to RLHF
- Experiment with TRL library for RLHF
- Read Constitutional AI for AI feedback approaches
- Implement full RLHF pipeline on toy problem
Essential Concepts
Deep dive into tokens, embeddings, sampling, and more
Tokens & Tokenization
Tokens are the atomic units that LLMs process. Modern LLMs don't operate on characters or words directly - they use subword tokenization to balance vocabulary size and coverage.
Why Subword Tokenization?
- Character-level: Tiny vocabulary (~256), but very long sequences → expensive
- Word-level: Huge vocabulary (millions), can't handle unseen words (OOV problem)
- Subword-level: Balanced! ~50K vocab, handles rare words via subword splitting
Byte Pair Encoding (BPE)
The most common tokenization algorithm used by GPT, LLaMA, and most modern LLMs.
Tokenization Examples
| Text | Tokens (GPT-2 BPE) | Token Count |
|---|---|---|
| "Hello, world!" | ["Hello", ",", " world", "!"] | 4 |
| "ChatGPT" | ["Chat", "G", "PT"] | 3 |
| "antidisestablishmentarianism" | ["ant", "idis", "establish", "ment", "arian", "ism"] | 6 |
| " strawberry" (with space) | [" straw", "berry"] | 2 |
| "🤖🚀" | ["�", "�", "�", "�", "�", "�", "�", "�"] | 8 (UTF-8 bytes) |
Key insight: Spaces are part of tokens! " world" ≠ "world". This affects prompting and token counting.
Vocabulary Sizes
- GPT-2: 50,257 tokens
- GPT-3/GPT-4: ~50,000 tokens (cl100k_base)
- LLaMA: 32,000 tokens (SentencePiece)
- LLaMA-2: 32,000 tokens
Embeddings: From Tokens to Vectors
Embeddings convert discrete tokens into continuous vector representations that capture semantic meaning.
Embedding Layer
Embedding Properties
Semantic Similarity
Tokens with similar meanings have similar vectors (high cosine similarity).
Famous Word Analogies
Positional Embeddings
Since attention is permutation-invariant, we add positional information to embeddings.
Reference: "Attention Is All You Need" introduced sinusoidal positional encodings. Modern models (GPT, BERT) use learned positional embeddings. LLaMA uses RoPE (Rotary Position Embedding).
Context Window: The Attention Bottleneck
The context window (or context length) is the maximum number of tokens the model can process at once. This is a hard architectural limit.
Why Context Matters
- Hard limit: Model can only "see" last N tokens
- Affects memory: Can't reference information beyond context
- Quadratic cost: O(n²) attention computation
- Determines use cases: Short contexts → chat; Long contexts → document analysis
Context Lengths of Modern LLMs
| Model | Context Length | ~Word Count | ~Pages (250 words/page) |
|---|---|---|---|
| GPT-3 | 4,096 tokens | ~3,000 words | ~12 pages |
| GPT-4 (8K) | 8,192 tokens | ~6,000 words | ~24 pages |
| GPT-4 (32K) | 32,768 tokens | ~24,000 words | ~96 pages |
| GPT-4 Turbo | 128,000 tokens | ~96,000 words | ~384 pages |
| Claude 2 | 100,000 tokens | ~75,000 words | ~300 pages |
| Claude 3 | 200,000 tokens | ~150,000 words | ~600 pages |
| LLaMA-2 | 4,096 tokens | ~3,000 words | ~12 pages |
Context Window Trade-offs
Larger context = more memory and compute
- Memory: Storing attention matrix: O(n²) memory
- Compute: Computing attention: O(n²·d) FLOPs
- Latency: Longer sequences → slower generation
Sampling & Decoding Strategies
How models generate text by sampling from probability distributions over tokens.
Temperature Scaling
Temperature controls the randomness of predictions by scaling logits before softmax.
Sampling Strategies Comparison
| Strategy | How It Works | Use Cases |
|---|---|---|
| Greedy | Always pick highest probability token | Factual Q&A, translation (deterministic) |
| Beam Search | Keep top K sequences at each step | Machine translation, summarization |
| Top-K Sampling | Sample from top K most likely tokens | Creative writing (K=40-100) |
| Top-P (Nucleus) | Sample from smallest set with cumulative prob ≥ P | Most versatile (P=0.9-0.95) |
| Temperature + Top-P | Combine both for fine control | Production systems (T=0.7, P=0.9) |
Top-K vs Top-P (Nucleus) Sampling
Recommended Settings
| Task | Temperature | Top-P | Notes |
|---|---|---|---|
| Factual Q&A | 0.0-0.3 | 1.0 | Deterministic, accurate |
| Code generation | 0.2-0.5 | 0.95 | Mostly deterministic |
| Chat assistant | 0.7-0.9 | 0.9 | Balanced |
| Creative writing | 0.9-1.2 | 0.95 | Diverse, creative |
| Brainstorming | 1.0-1.5 | 1.0 | Very diverse |
Key Papers and Resources
Tokenization
- "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016) - BPE
arxiv.org/abs/1508.07909 - "SentencePiece: A simple and language independent approach" (Kudo & Richardson, 2018)
arxiv.org/abs/1808.06226
Embeddings
- "Efficient Estimation of Word Representations" (Mikolov et al., 2013) - Word2Vec
arxiv.org/abs/1301.3781 - "GloVe: Global Vectors for Word Representation" (Pennington et al., 2014)
nlp.stanford.edu/pubs/glove.pdf
Sampling Strategies
- "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019) - Nucleus sampling
arxiv.org/abs/1904.09751
Long Context
- "Flash Attention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
arxiv.org/abs/2205.14135 - "Extending Context Window of Large Language Models via RoPE" (Chen et al., 2023)
arxiv.org/abs/2309.16039
Practical Tools
- tiktoken - OpenAI's fast BPE tokenizer
github.com/openai/tiktoken - Hugging Face Tokenizers - Fast tokenizer library
github.com/huggingface/tokenizers
Recommended Learning Path
- Experiment with tiktoken to see how text is tokenized
- Read BPE paper (Sennrich et al.) to understand the algorithm
- Implement simple embedding layer in PyTorch
- Read Word2Vec paper for embedding intuition
- Experiment with temperature and top-p sampling
- Read Nucleus sampling paper (Holtzman et al.)
- Implement all sampling strategies from scratch
Popular LLM Families
Overview of major language models and their characteristics
Major LLM Families
GPT Series
OpenAI's Generative Pre-trained Transformer models (GPT-3, GPT-4)
Claude
Anthropic's constitutional AI assistant with strong reasoning
LLaMA
Meta's open-source models enabling research and development
BERT
Google's bidirectional encoder for understanding tasks
PaLM/Gemini
Google's latest multimodal AI models
Mistral
Efficient open-source models from Mistral AI
Model Comparison
| Model | Parameters | Context | Notable Feature |
|---|---|---|---|
| GPT-4 | Unknown | 128K tokens | Multimodal capabilities |
| Claude 3 | Unknown | 200K tokens | Long context, strong reasoning |
| LLaMA 2 | 7B-70B | 4K tokens | Open source |
| Mixtral 8x7B | 47B | 32K tokens | Mixture of Experts |
Test Your Knowledge
Assess your understanding of LLMs and transformers
Knowledge Quiz
Learning Resources
Continue your journey into LLMs and AI
Essential Papers
- "Attention Is All You Need" - The original Transformer paper (Vaswani et al., 2017)
- "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
- "Language Models are Few-Shot Learners" - GPT-3 paper (Brown et al., 2020)
- "Training language models to follow instructions with human feedback" - InstructGPT/RLHF (Ouyang et al., 2022)
- "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022)
Online Courses & Tutorials
- Stanford CS224N: Natural Language Processing with Deep Learning
- Hugging Face NLP Course: Free comprehensive course on transformers
- Fast.ai: Practical Deep Learning for Coders
- DeepLearning.AI: Courses on Transformers and LLMs
- Andrej Karpathy: YouTube tutorials on building GPT from scratch
Tools to Experiment
- Hugging Face: Pre-trained models, datasets, and the Transformers library
- OpenAI Playground: Experiment with GPT models
- Google Colab: Free GPU/TPU for training and experimenting
- PyTorch / TensorFlow: Deep learning frameworks
- LangChain: Framework for building LLM applications
Communities & Blogs
- r/MachineLearning: Reddit community for ML research
- Papers with Code: Latest research with implementations
- The Gradient: Online magazine about AI research
- Distill.pub: Clear explanations of ML concepts
- AI Alignment Forum: Discussions on AI safety and alignment
Next Steps
- Read the foundational Transformer paper
- Experiment with models on Hugging Face
- Build a simple project using LLM APIs
- Follow AI research communities
- Consider specializing in an area (safety, applications, research)
Sources & References
This guide draws from foundational papers and research in the field of large language models. Below are key citations and resources:
Foundational Papers
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need"
The original Transformer architecture paper
https://arxiv.org/abs/1706.03762
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Introduced bidirectional pre-training and masked language modeling
https://arxiv.org/abs/1810.04805
Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners"
GPT-2: Demonstrated zero-shot task transfer capabilities
OpenAI GPT-2 Paper
Scaling & Training
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models"
Original scaling laws for LLMs
https://arxiv.org/abs/2001.08361
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models"
Chinchilla paper: Revised scaling laws showing optimal compute allocation
https://arxiv.org/abs/2203.15556
Optimization & Efficiency
Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
Efficient attention mechanism using tiling and kernel fusion
https://arxiv.org/abs/2205.14135
Alignment & Safety
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback"
Self-improvement and alignment through AI-generated feedback
https://arxiv.org/abs/2212.08073
Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
DPO: Simplified alignment method without explicit reward modeling
https://arxiv.org/abs/2305.18290
Emergent Capabilities
Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models"
Documents qualitative improvements that emerge at scale
https://arxiv.org/abs/2206.07682
Note: This field moves rapidly. For the latest research, check arXiv.org, Papers with Code, and conference proceedings (NeurIPS, ICML, ACL, EMNLP).
How LLMs Understand and Generate Code
Deep dive into code comprehension, generation, and the patterns LLMs recognize
Code as a Language
LLMs treat programming code as just another language. During pre-training, models see billions of lines of code from GitHub, Stack Overflow, documentation, and tutorials. They learn:
- Syntax patterns: Language-specific grammar, keywords, operators
- Semantic relationships: How variables, functions, and classes relate
- Common idioms: Standard patterns like loops, recursion, error handling
- API usage: How to use libraries and frameworks correctly
- Code structure: File organization, imports, module patterns
- Documentation patterns: Docstrings, comments, type hints
What Makes Code Different from Natural Language
Precise Syntax
One missing semicolon breaks everything. No room for ambiguity.
Logical Structure
Code follows strict logical flow: control structures, state changes, algorithms.
Compositionality
Complex programs built from simple, reusable components.
Execution Semantics
Code must be executable. The computer is the ultimate judge.
Code Training Data
Modern code-capable LLMs are trained on massive code repositories:
| Source | Size | Content |
|---|---|---|
| GitHub | 100M+ repositories | Open source code in all languages |
| Stack Overflow | 20M+ questions | Q&A with code examples |
| Documentation | Billions of tokens | Official docs, tutorials, guides |
| LeetCode/HackerRank | 100K+ problems | Algorithmic problems with solutions |
Common Programming Patterns LLMs Learn
Through exposure to millions of code examples, LLMs develop strong intuitions about common patterns:
1. Data Structure Patterns
2. Algorithm Patterns
3. Edge Case Handling
How LLMs Generate Code
When generating code, LLMs follow a probabilistic process:
1. Understand the Problem
Parse the natural language description, identify key requirements, constraints, and expected inputs/outputs. Extract data structure needs and algorithmic requirements.
2. Recall Similar Problems
Attention mechanism activates memories of similar problems seen during training. Pattern matching against thousands of known solutions.
3. Select High-Level Approach
Choose algorithm family (greedy, DP, divide-and-conquer, etc.) based on problem characteristics. Select appropriate data structures.
4. Generate Structure
Start with function signature, parameter types. Add main logic scaffold (loops, conditionals, base cases). Structure follows learned templates.
5. Fill in Details
Token by token generation of the actual logic. Each token is predicted based on all previous context. Variable names follow conventions.
6. Add Error Handling
Insert edge case checks, validation, and error handling based on learned patterns from high-quality code.
Strengths of LLMs in Coding
- Pattern Recognition: Excellent at recognizing which algorithm pattern fits a problem
- Boilerplate: Great at generating standard code structures and setup
- Syntax: Strong knowledge of syntax across many programming languages
- Common Algorithms: Can implement standard algorithms (sorting, searching, graph traversal)
- Code Translation: Can convert code between programming languages
- Explanation: Can explain what code does line by line
- Debugging: Can spot common bugs and suggest fixes
Limitations in Coding
- Complex Logic: Struggles with highly nested or intricate logical conditions
- Novel Algorithms: Can't invent truly new algorithmic approaches
- Mathematical Proofs: Can't rigorously prove correctness
- Optimization: May not generate the most efficient solution
- Edge Cases: Might miss rare or subtle edge cases
- Testing: Generated code needs human verification
- Large Codebases: Context window limits understanding of large projects
Algorithmic Reasoning in LLMs
How LLMs solve algorithmic problems and approach LeetCode-style challenges
Understanding Algorithmic Reasoning
Algorithmic reasoning requires step-by-step logical thinking, understanding of complexity, and systematic problem-solving. LLMs develop these capabilities through exposure to millions of algorithm implementations and explanations.
How LLMs Learn Algorithms
During training, LLMs see algorithms presented in multiple forms:
Code Implementations
Working code in Python, Java, C++, etc.
Pseudocode
High-level algorithmic descriptions
Natural Language
Written explanations and tutorials
Examples & Traces
Step-by-step execution examples
Core Algorithm Categories LLMs Understand
1. Searching & Sorting (Mastery: High)
2. Dynamic Programming (Mastery: Medium-High)
3. Graph Algorithms (Mastery: Medium)
4. Tree Algorithms (Mastery: High)
Problem-Solving Strategies LLMs Learn
| Strategy | When to Use | Example Problems |
|---|---|---|
| Two Pointers | Sorted arrays, strings, linked lists | Two Sum II, Container With Most Water |
| Sliding Window | Subarray/substring problems | Longest Substring, Max Sum Subarray |
| Fast & Slow Pointers | Cycle detection, middle element | Linked List Cycle, Happy Number |
| Merge Intervals | Overlapping intervals | Merge Intervals, Meeting Rooms |
| Top K Elements | Finding largest/smallest K items | Kth Largest, Top K Frequent |
| Binary Search | Sorted data, search space reduction | Search in Rotated Array, Find Min |
| Backtracking | All combinations, permutations | Subsets, N-Queens, Sudoku |
| Dynamic Programming | Optimal substructure, overlapping | Coin Change, Longest Palindrome |
Time Complexity Recognition
LLMs develop intuition about complexity by seeing code annotated with Big-O notation:
LLM Reasoning Process for LeetCode Problems
Step 1: Pattern Recognition
Identify which algorithmic pattern the problem fits. Keywords trigger pattern matching: "sorted array" → binary search, "subarray" → sliding window, "paths" → DFS/BFS.
Step 2: Data Structure Selection
Choose appropriate data structures based on operations needed. HashMap for fast lookups, heap for priority, stack for LIFO, queue for BFS.
Step 3: Algorithm Template
Apply learned template for the identified pattern. Start with the scaffolding that fits the problem type.
Step 4: Implementation
Fill in problem-specific logic within the template structure. Handle indices, boundaries, update rules.
Step 5: Edge Cases
Add checks for empty inputs, single elements, negative numbers, duplicates, etc.
What LLMs Struggle With
- Novel Problem Types: Completely new problems without similar training examples
- Multi-step Mathematical Reasoning: Problems requiring careful mathematical derivation
- Subtle Optimizations: Finding the most space-efficient or time-efficient solution
- Complex State Management: Problems with many interacting state variables
- Proof of Correctness: Formally proving an algorithm is correct
- Rare Edge Cases: Unusual boundary conditions not well-represented in training
Prompting LLMs for Coding Problems
Master techniques for getting the best solutions from LLMs for LeetCode-style problems
Effective Prompting Strategies
How you ask an LLM to solve a coding problem dramatically affects the quality of the solution. Well-structured prompts lead to better, more reliable code.
The Anatomy of a Good Coding Prompt
Essential Prompting Techniques
1. Chain-of-Thought Prompting
Ask the LLM to think step-by-step before coding:
2. Few-Shot Learning
Provide examples of similar solved problems:
3. Specify Constraints and Requirements
4. Ask for Multiple Approaches
5. Iterative Refinement
Build on previous responses:
Prompting Patterns for Common Problem Types
For Array Problems:
For Tree Problems:
For Dynamic Programming:
For Graph Problems:
Advanced Prompting Techniques
Self-Consistency Prompting
Socratic Prompting
Explain-Then-Code Pattern
Debugging with LLMs
When Code Fails:
Code Review Prompt:
Anti-Patterns to Avoid
Don't:
- Give vague prompts like "help with this problem"
- Omit problem constraints and examples
- Accept first solution without testing
- Ask for code without understanding the approach first
- Trust LLM blindly for complex mathematical proofs
- Skip edge case discussion
- Forget to specify language and style preferences
Testing LLM-Generated Solutions
Best Practices for LeetCode with LLMs
| Practice | Why |
|---|---|
| Start with understanding | Ask LLM to explain problem before coding |
| Request multiple solutions | Learn different approaches and trade-offs |
| Verify complexity claims | LLMs can be wrong about Big-O analysis |
| Test edge cases | LLM might miss rare boundary conditions |
| Understand before copying | Learn the pattern for future problems |
| Iterate and refine | First solution may not be optimal |
Example: Complete LeetCode Session
Remember: LLMs are powerful tools for learning and problem-solving, but they work best when you:
- Provide clear, detailed prompts
- Think critically about responses
- Test thoroughly
- Understand rather than blindly copy
- Use them as learning aids, not crutches
Projects & Practice Ideas
Hands-on projects to practice pre-training, post-training, and RLHF techniques
Pre-training Projects
Learn how to train language models from scratch or continue pre-training existing models.
Beginner: Train a Character-Level Language Model
Goal: Understand the fundamentals of autoregressive language modeling
Technical Requirements:
- Dataset: Small text corpus (Shakespeare, Wikipedia subset, ~1-10MB)
- Model: Small transformer (2-4 layers, 128-256 dim, ~1M parameters)
- Hardware: CPU or single GPU (Google Colab free tier)
- Framework: PyTorch or TensorFlow, Hugging Face Transformers
- Time: 1-2 weeks
Key Learning Objectives:
- Implement tokenization (character or BPE)
- Build transformer architecture from scratch or adapt existing
- Implement cross-entropy loss and next-token prediction
- Track training metrics (loss, perplexity)
- Generate text samples during training
- Experiment with hyperparameters (learning rate, batch size, context length)
Deliverables: Trained model that generates coherent text in training domain style
Intermediate: Continue Pre-training a Small LLM
Goal: Adapt an existing pre-trained model to a specific domain
Technical Requirements:
- Base Model: GPT-2 (124M), DistilGPT-2, or similar (~100M-350M params)
- Dataset: Domain-specific corpus (medical papers, code, legal docs, 100MB-1GB)
- Hardware: Single GPU (V100, A100, or T4 with gradient accumulation)
- Framework: Hugging Face Transformers, DeepSpeed for optimization
- Time: 2-4 weeks
Key Learning Objectives:
- Load and freeze selective layers (optional)
- Prepare domain-specific training data pipeline
- Implement learning rate warmup and decay schedules
- Use gradient accumulation for larger effective batch sizes
- Evaluate on domain-specific benchmarks
- Compare performance before/after domain adaptation
- Implement checkpointing and resume training
Deliverables: Specialized model with measurably better domain performance
Advanced: Train a Small LLM from Scratch with Custom Tokenizer
Goal: Complete pre-training pipeline including data preprocessing and tokenizer training
Technical Requirements:
- Dataset: Large diverse corpus (The Pile subset, C4, Common Crawl, 10GB+)
- Model: Custom architecture (6-12 layers, 512-1024 dim, 100M-1B params)
- Hardware: Multi-GPU setup or TPU (4-8 GPUs, distributed training)
- Framework: PyTorch + DeepSpeed/FSDP, or JAX + Mesh TensorFlow
- Time: 2-3 months
Key Learning Objectives:
- Train custom BPE/WordPiece tokenizer on your corpus
- Implement distributed data loading and shuffling
- Use mixed precision training (FP16/BF16)
- Implement gradient checkpointing for memory efficiency
- Track scaling laws (loss vs compute)
- Implement evaluation on diverse benchmarks (LAMBADA, HellaSwag, etc.)
- Monitor training stability and implement interventions
- Optimize throughput (tokens/sec, MFU - Model FLOPS Utilization)
Deliverables: Fully pre-trained model with documented training run, loss curves, and benchmark results
Post-training Projects (SFT, Instruction Tuning)
Take pre-trained models and fine-tune them to follow instructions and perform specific tasks.
Beginner: Fine-tune for Single Task (Classification/Summarization)
Goal: Understand supervised fine-tuning on a specific downstream task
Technical Requirements:
- Base Model: FLAN-T5-small, GPT-2, or DistilGPT-2
- Dataset: Single-task dataset (IMDB sentiment, CNN/DailyMail summarization, ~10K examples)
- Hardware: Single GPU or Google Colab
- Framework: Hugging Face Transformers + PEFT (LoRA)
- Time: 1-2 weeks
Key Learning Objectives:
- Format data as input-output pairs
- Implement LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
- Use appropriate loss masking (only compute loss on outputs)
- Evaluate with task-specific metrics (accuracy, ROUGE, BLEU)
- Compare full fine-tuning vs LoRA
- Prevent overfitting with early stopping
Deliverables: Fine-tuned model that performs well on held-out test set
Intermediate: Multi-Task Instruction Tuning
Goal: Create a model that can follow diverse instructions across multiple task types
Technical Requirements:
- Base Model: LLaMA-2-7B, Mistral-7B, or Falcon-7B
- Dataset: Multi-task instruction dataset (FLAN, Dolly, OpenOrca, ~50K-500K examples)
- Hardware: Single A100/H100 or multiple smaller GPUs
- Framework: Axolotl, LLaMA-Factory, or custom training loop
- Time: 3-6 weeks
Key Learning Objectives:
- Curate and mix instruction datasets from multiple sources
- Implement proper instruction templates (chat format)
- Use QLoRA for memory-efficient training of larger models
- Balance dataset mixing ratios
- Evaluate across multiple tasks (reasoning, knowledge, coding)
- Implement generation sampling strategies (temperature, top-p, top-k)
- Measure instruction-following capability qualitatively
Deliverables: Instruction-tuned model that generalizes to unseen task types
Advanced: Build a Domain Expert with Specialized SFT
Goal: Create a highly capable domain-specific assistant (e.g., medical, legal, coding)
Technical Requirements:
- Base Model: LLaMA-2-13B/70B, Mistral-7B, or Code Llama
- Dataset: Custom domain dataset + synthetic data generation (50K-1M examples)
- Hardware: Multi-GPU setup (4-8 A100s) or cloud compute
- Framework: DeepSpeed, FSDP, or Megatron-LM for large models
- Time: 2-3 months
Key Learning Objectives:
- Use strong models (GPT-4) to generate synthetic training data
- Implement data quality filtering and deduplication
- Create domain-specific evaluation benchmarks
- Fine-tune with curriculum learning (easy → hard examples)
- Implement safety guardrails and refusal behavior
- Test for hallucinations and factual accuracy
- Compare against general-purpose baselines
Deliverables: Production-ready domain expert with comprehensive evaluation report
RLHF (Reinforcement Learning from Human Feedback) Projects
Implement the complete RLHF pipeline to align models with human preferences.
Beginner: Implement Reward Modeling
Goal: Build a reward model that predicts human preferences between responses
Technical Requirements:
- Base Model: Same architecture as your SFT model (smaller variant)
- Dataset: Comparison dataset (Anthropic HH-RLHF, OpenAssistant, ~10K-50K pairs)
- Hardware: Single GPU (T4/V100)
- Framework: Hugging Face Transformers + TRL (Transformer Reinforcement Learning)
- Time: 2-3 weeks
Key Learning Objectives:
- Understand pairwise comparison loss (Bradley-Terry model)
- Modify model head to output scalar reward score
- Implement training loop with paired examples
- Evaluate reward model accuracy on preference prediction
- Analyze what the reward model has learned (correlations with length, style, etc.)
- Test on out-of-distribution prompts
Deliverables: Trained reward model that accurately predicts human preferences (>60% accuracy)
Intermediate: Full RLHF Pipeline (PPO)
Goal: Implement complete RLHF with Proximal Policy Optimization
Technical Requirements:
- Base Model: Instruction-tuned model (GPT-2, LLaMA-7B)
- Reward Model: From previous step or pre-trained
- Hardware: 2-4 GPUs (need separate GPUs for policy, value, and reward models)
- Framework: TRL (trl library), or DeepSpeed-Chat
- Time: 4-8 weeks
Key Learning Objectives:
- Implement PPO algorithm for language models
- Manage four models: policy (actor), value (critic), reward, reference
- Compute advantages using Generalized Advantage Estimation (GAE)
- Add KL divergence penalty to prevent drift from reference model
- Sample generations and compute rewards in batches
- Monitor training stability (KL divergence, reward statistics, policy entropy)
- Evaluate qualitative improvements in response quality
Deliverables: RLHF-trained model with measurably improved preference ratings vs SFT baseline
Advanced: Constitutional AI & DPO (Direct Preference Optimization)
Goal: Implement modern RLHF alternatives without explicit reward models
Technical Requirements:
- Base Model: Large instruction-tuned model (LLaMA-2-13B+, Mistral-7B)
- Dataset: Generate synthetic preferences using AI feedback (self-critique)
- Hardware: 4-8 GPUs for larger models
- Framework: Custom implementation or TRL with DPO support
- Time: 2-3 months
Key Learning Objectives:
- Implement Constitutional AI: use model to critique and revise its own outputs
- Generate preference pairs through self-evaluation against principles
- Implement DPO algorithm (simpler than PPO, no reward model needed)
- Compare DPO vs PPO on same dataset
- Design constitution (set of principles for model behavior)
- Measure alignment on safety benchmarks (TruthfulQA, BBQ bias)
- Test robustness to adversarial prompts
Deliverables: Aligned model using modern techniques with comprehensive safety evaluation
End-to-End Application Projects
Build complete systems that combine pre-training, post-training, and RLHF concepts.
Project Idea: Personal Code Assistant
Pipeline:
- Pre-training: Continue pre-train Code Llama on your company's codebase
- SFT: Fine-tune on code completion and debugging examples
- RLHF: Use developer feedback (thumbs up/down) to refine suggestions
Technologies: VS Code extension, local model serving (vLLM), feedback collection UI
Project Idea: Custom Chatbot with Domain Expertise
Pipeline:
- Pre-training: Continue training on domain-specific documents (medical/legal/financial)
- SFT: Train on QA pairs and instruction examples from domain
- RLHF: Collect expert feedback on response quality and factuality
Technologies: RAG (Retrieval-Augmented Generation), vector database (Pinecone/Weaviate), web interface
Project Idea: Creative Writing Assistant
Pipeline:
- Pre-training: Train on large corpus of books, stories, creative writing
- SFT: Fine-tune on story completion, character development prompts
- RLHF: Use writer feedback to align on style, creativity, coherence
Technologies: React web app, streaming responses, style transfer controls
Learning Resources for Projects
Essential Libraries & Tools
| Tool | Purpose | Best For |
|---|---|---|
| Hugging Face Transformers | Pre-trained models, tokenizers, training utilities | All projects, especially SFT |
| TRL (Transformer RL) | RLHF, PPO, DPO implementations | RLHF projects |
| PEFT (LoRA, QLoRA) | Parameter-efficient fine-tuning | Limited compute scenarios |
| DeepSpeed | Distributed training, memory optimization | Large model training |
| Axolotl | Simplified training configs for LLMs | Quick experiments |
| vLLM / TGI | Fast inference serving | Production deployment |
| Weights & Biases | Experiment tracking, visualization | All projects |
Datasets for Practice
- Pre-training: The Pile, C4, RedPajama, mC4 (multilingual)
- Instruction Tuning: FLAN, Dolly-15k, OpenOrca, WizardLM, Alpaca
- Coding: The Stack, CodeSearchNet, APPS, HumanEval
- RLHF: Anthropic HH-RLHF, OpenAssistant Conversations, SHP (StackOverflow preferences)
- Alignment: TruthfulQA, BBQ (bias), RealToxicityPrompts
Compute Requirements
| Model Size | Training (Full FT) | Training (LoRA/QLoRA) | Inference |
|---|---|---|---|
| 1M params | CPU | CPU | CPU |
| 100M-350M | 1x T4/V100 (16GB) | 1x T4 (16GB) | CPU or small GPU |
| 1B-3B | 1-2x A100 (40GB) | 1x V100/T4 | 1x T4 or CPU (slow) |
| 7B | 4-8x A100 (80GB) | 1x A100 (40GB) | 1x A100 or 2x T4 |
| 13B-30B | 8-16x A100 | 2x A100 (80GB) | 2-4x A100 |
| 70B+ | 16-32x A100/H100 | 4-8x A100 | 8x A100 or quantization |
Getting Started Checklist
- Choose a project that matches your compute budget and timeline
- Set up development environment (PyTorch, Transformers, Weights & Biases)
- Start with smallest viable dataset to test pipeline end-to-end
- Establish baseline metrics before training
- Implement logging and checkpointing from day one
- Run small-scale experiments first (hyperparameter search on tiny model)
- Scale up only after validating pipeline works
- Document everything (model cards, training logs, evaluation results)
- Share your work (blog post, GitHub repo, Hugging Face model)