Stanford CS229 — Building LLMs

Core concepts from the actual lecture — click any section to expand

Section 1

What are LLMs and why do they matter?

Dubois opens by grounding the term. LLMs — Large Language Models — are the engines behind ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), and similar systems. The lecture is about how they actually work, not just what they can do.

He identifies five components that matter when training an LLM:

Architecture — the structure of the neural network (Transformers, in every current case).
Training loss and algorithm — how the model updates its weights.
Data — what you train it on.
Evaluation — how you know whether it's improving.
Systems — how you make it run on actual hardware at scale.

Dubois is direct: academia obsesses over architecture. Industry focuses on data, evaluation, and systems. He thinks academia has its priorities backwards.

He skips architecture almost entirely — not because it doesn't matter, but because there's already abundant material online about Transformers. The other four are where the real engineering happens and where there's far less clear guidance.

Section 2

Pre-training vs. Post-training

The lecture is structured around two phases of LLM development:

Pre-training is the classical phase. You train a language model on essentially all of the internet, teaching it to predict the next word in any text. GPT-2, GPT-3 — these are pre-training outputs. The model learns language, facts, and structure, but it has no conversational behavior.

Post-training is a newer paradigm. You take a pre-trained model and fine-tune it to behave like a helpful assistant — to follow instructions, to be safe, to give useful answers. ChatGPT is the product of post-training. This phase is what makes models useful to ordinary people.

The key distinction: pre-training teaches the model to model language. Post-training teaches the model to be an assistant.

Section 3

Language modeling: the core task

At its mathematical core, a language model is a probability distribution over sequences of tokens (words or sub-words). Given a sentence, it assigns a probability to that sentence appearing in human language.

Example: "The mouse ate the cheese" gets a high probability. "The cheese ate the mouse" gets a lower probability — the model encodes semantic knowledge about what eats what. "The the mouse at cheese" gets a very low probability — the model also encodes syntactic knowledge about grammar.

The specific flavor used in all current systems is the autoregressive language model. Instead of modeling the full sequence at once, it decomposes probability using the chain rule:

P(x₁, x₂, ..., xₙ) = P(x₁) × P(x₂|x₁) × P(x₃|x₁,x₂) × ...

This means: predict the first word, then predict the second word given the first, then predict the third given the first two, and so on. No approximation — this is just the chain rule of probability applied sequentially.

The downside: generating long sequences requires a full for-loop. Every token you add requires another pass. Longer output = more compute time. This is a fundamental architectural constraint of current LLMs.

In training, the task simplifies: given all the words so far, predict the next one, then compare your prediction to reality. In inference (actual use), you sample from the distribution and use the sampled token as input for the next step.

Section 4

Tokenization: the underrated foundation

Dubois flags this as something people rarely discuss but that matters enormously. Before text can go into a model, it must be converted into tokens — discrete numerical IDs that the model can process.

Why not just use words? Typos and misspellings produce words that have no entry in your vocabulary. Thai and other languages don't use spaces to separate words, so word-boundary tokenization fails entirely.

Why not use characters? You could, and it would technically work. The problem: character-level tokenization produces extremely long sequences, and Transformer complexity scales quadratically with sequence length. A 1,000-character sentence becomes a 1,000-step sequence. Prohibitively expensive.

The solution is a learned vocabulary of sub-word units. The dominant algorithm is Byte Pair Encoding (BPE):

1. Start with a large text corpus and split everything into individual characters.
2. Count all adjacent character pairs across the corpus.
3. Merge the most frequent pair into a single new token.
4. Repeat until you reach your target vocabulary size.

Result: common words become single tokens, rare words get split into common sub-word components. "Tokenizer" becomes ["token", "izer"] — two tokens instead of nine characters.

Dubois gives a concrete reason tokenization matters for math: 327 might be a single token, meaning the model doesn't "see" the individual digits. It can't compose numbers the way humans do. This is a core reason LLMs historically struggled with arithmetic.

GPT-4 changed how it tokenizes code — specifically Python's leading spaces — which meaningfully improved code generation. Tokenizer design has direct downstream effects on model capability.

The final output size of the model is equal to the vocabulary size. Every token in the vocabulary gets a probability score at each generation step.

Section 5

Training loss: cross-entropy and log-likelihood

How does the model learn? The training objective is simple in structure: given all preceding tokens, predict the next one. This is framed as a classification problem — classify the next token from a vocabulary of tens of thousands of options.

The loss function is cross-entropy loss. The target is a one-hot vector (the actual next token is 1, everything else is 0). The model outputs a probability distribution across the entire vocabulary. Cross-entropy loss measures the gap between those two distributions.

Loss = -log P(actual next token | context)

Minimizing this loss is mathematically equivalent to maximizing the log-likelihood of the training text. The model is being pushed, with every gradient update, to assign higher probability to text that actually appears in the real world.

The pipeline in full: embed each token as a vector → pass through Transformer layers → apply a linear projection to vocabulary size → softmax to get probabilities → compute cross-entropy against ground truth → backprop and update weights.

Section 6

Evaluation: perplexity and benchmarks

Perplexity is the primary metric during pre-training. It's essentially your validation loss, exponentiated to be more interpretable. The formula:

Perplexity = 2^(average per-token loss)

The intuition is concrete: perplexity tells you how many tokens your model is "hesitating between" at each step. A perplexity of 10 means the model is effectively choosing between 10 equally plausible options at every word. Perplexity of 1 means perfect prediction every time.

From 2017 to 2023, perplexity on standard datasets dropped from ~70 to below 10. That's a massive improvement — models went from being genuinely uncertain across dozens of words to being confident within a handful.

Perplexity is no longer used in academic benchmarking because it's tokenizer-dependent. A model with a 10,000-token vocabulary and a model with a 100,000-token vocabulary aren't directly comparable on perplexity. It remains the go-to metric for internal development.

Academic benchmarks now dominate external comparison. MMLU (Massive Multitask Language Understanding) is the most common — a collection of multiple-choice questions across medicine, physics, astronomy, law, and dozens of other domains. Evaluation is constrained: you look at which answer (A/B/C/D) gets the highest probability from the model. No free-form generation required.

Critical problem Dubois raises: even on a well-established benchmark like MMLU, different organizations were using different evaluation protocols and getting dramatically different numbers. Llama 65B scored 63.7% on one standard and 48.8% on another — for the same model.

He also flags train-test contamination: if benchmark questions appeared in the training data, scores are inflated and meaningless. One method to detect this: since internet datasets aren't randomized, check whether the model predicts benchmark questions more confidently in their original order than in shuffled order. If yes, it probably memorized them.

Section 7

Data: what "training on the internet" actually means

The common phrase "trained on the internet" obscures an enormous amount of engineering. Dubois walks through what it actually involves.

Step 1 — Download the internet. Web crawlers (Common Crawl being one major source) continuously index the public web. Current scale: approximately 250 billion pages, roughly one petabyte of data.

Step 2 — Extract text from HTML. Raw crawl data is HTML — full of tags, scripts, ads, and boilerplate. Extracting clean text from HTML is non-trivial, especially for math notation and structured data.

Step 3 — Filter undesirable content. Companies maintain blacklists of sites to exclude. Classifiers detect and remove NSFW content, hate speech, and personally identifiable information (PII). Every point in this pipeline represents significant engineering work.

Step 4 — Deduplication. Forums repeat headers and footers across thousands of pages. Books appear verbatim on hundreds of different sites. The same URL resolves to different content at different times. Deduplication at petabyte scale is a substantial systems problem.

Step 5 — Heuristic filtering. Low-quality documents are removed using rule-based signals: unusual token distributions, extremely short or extremely long documents, unusual character ratios, and similar signals.

Dubois' core point: "clean internet" is not a natural category. It's the result of dozens of filtering and cleaning decisions, each of which shapes what the model learns. Data quality is not downstream of architecture — it often matters more.

Terms as defined in the lecture — not generic AI glossary

Autoregressive Language Model

A model that predicts each token in a sequence conditioned on all previous tokens, using the chain rule of probability. All current production LLMs use this approach. The downside is sequential generation: longer outputs require proportionally more compute.

Byte Pair Encoding (BPE)

A tokenization algorithm that starts with individual characters and iteratively merges the most frequently co-occurring adjacent pairs into single tokens. Produces a vocabulary of sub-word units that balances sequence length against vocabulary coverage.

Cross-Entropy Loss

The training objective for LLMs. Given the model's predicted probability distribution over the next token and the actual next token (represented as a one-hot vector), cross-entropy measures their divergence. Minimizing it is equivalent to maximizing the log-likelihood of the training text.

Perplexity

A measure of model uncertainty: 2^(average per-token loss). Intuitively, the number of tokens the model is "hesitating between" at each prediction step. Perplexity of 1 = perfect prediction. Perplexity = vocabulary size = total uncertainty. Dropped from ~70 to below 10 between 2017 and 2023. Not used for external benchmarking due to tokenizer dependence.

Pre-training

The phase where a model is trained on raw text (effectively all of the public internet) to predict the next token. Produces a model with broad linguistic and factual knowledge but no instruction-following behavior. GPT-2 and GPT-3 are the output of pre-training.

Post-training

The phase after pre-training that shapes a model into an AI assistant. Involves supervised fine-tuning on human demonstrations and reinforcement learning from human feedback (RLHF). ChatGPT is the most famous post-training result.

MMLU (Massive Multitask Language Understanding)

The most widely used academic benchmark for LLMs. A collection of multiple-choice questions across 57 domains including medicine, law, physics, and history. Evaluation: determine which answer option (A/B/C/D) the model assigns the highest probability to.

Train-Test Contamination

The problem that arises when benchmark questions appear in the training data, inflating benchmark scores artificially. One detection method: check whether the model assigns higher probability to benchmark answers in their original ordered sequence vs. shuffled — memorized content typically preserves original order.

Common Crawl

One of the primary sources of web data for LLM training. A continuously updated dataset of the public web — approximately 250 billion pages (~1 petabyte) in current form. Raw Common Crawl data is HTML and requires extensive cleaning before it's usable as training data.

Tokenizer

The component that converts raw text into a sequence of integer token IDs before feeding it to the model. The vocabulary size of the tokenizer determines the output dimensionality of the model. A critical and often overlooked design decision with direct downstream effects on capability.

Stanford CS229 — Building LLMs

What are LLMs and why do they matter?

Pre-training vs. Post-training

Language modeling: the core task

Tokenization: the underrated foundation

Training loss: cross-entropy and log-likelihood

Evaluation: perplexity and benchmarks

Data: what "training on the internet" actually means

Why LLMs exist and what they are

The two phases: pre-training and post-training

Language modeling as probability

Tokenization

Training loss

Evaluation

Data pipelines

Post-training and RLHF