Dubois opens by grounding the term. LLMs — Large Language Models — are the engines behind ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), and similar systems. The lecture is about how they actually work, not just what they can do.
He identifies five components that matter when training an LLM:
Architecture — the structure of the neural network (Transformers, in every current case).
Training loss and algorithm — how the model updates its weights.
Data — what you train it on.
Evaluation — how you know whether it's improving.
Systems — how you make it run on actual hardware at scale.
He skips architecture almost entirely — not because it doesn't matter, but because there's already abundant material online about Transformers. The other four are where the real engineering happens and where there's far less clear guidance.
The lecture is structured around two phases of LLM development:
Pre-training is the classical phase. You train a language model on essentially all of the internet, teaching it to predict the next word in any text. GPT-2, GPT-3 — these are pre-training outputs. The model learns language, facts, and structure, but it has no conversational behavior.
Post-training is a newer paradigm. You take a pre-trained model and fine-tune it to behave like a helpful assistant — to follow instructions, to be safe, to give useful answers. ChatGPT is the product of post-training. This phase is what makes models useful to ordinary people.
At its mathematical core, a language model is a probability distribution over sequences of tokens (words or sub-words). Given a sentence, it assigns a probability to that sentence appearing in human language.
Example: "The mouse ate the cheese" gets a high probability. "The cheese ate the mouse" gets a lower probability — the model encodes semantic knowledge about what eats what. "The the mouse at cheese" gets a very low probability — the model also encodes syntactic knowledge about grammar.
The specific flavor used in all current systems is the autoregressive language model. Instead of modeling the full sequence at once, it decomposes probability using the chain rule:
This means: predict the first word, then predict the second word given the first, then predict the third given the first two, and so on. No approximation — this is just the chain rule of probability applied sequentially.
In training, the task simplifies: given all the words so far, predict the next one, then compare your prediction to reality. In inference (actual use), you sample from the distribution and use the sampled token as input for the next step.
Dubois flags this as something people rarely discuss but that matters enormously. Before text can go into a model, it must be converted into tokens — discrete numerical IDs that the model can process.
Why not just use words? Typos and misspellings produce words that have no entry in your vocabulary. Thai and other languages don't use spaces to separate words, so word-boundary tokenization fails entirely.
Why not use characters? You could, and it would technically work. The problem: character-level tokenization produces extremely long sequences, and Transformer complexity scales quadratically with sequence length. A 1,000-character sentence becomes a 1,000-step sequence. Prohibitively expensive.
The solution is a learned vocabulary of sub-word units. The dominant algorithm is Byte Pair Encoding (BPE):
1. Start with a large text corpus and split everything into individual characters.
2. Count all adjacent character pairs across the corpus.
3. Merge the most frequent pair into a single new token.
4. Repeat until you reach your target vocabulary size.
Result: common words become single tokens, rare words get split into common sub-word components. "Tokenizer" becomes ["token", "izer"] — two tokens instead of nine characters.
GPT-4 changed how it tokenizes code — specifically Python's leading spaces — which meaningfully improved code generation. Tokenizer design has direct downstream effects on model capability.
The final output size of the model is equal to the vocabulary size. Every token in the vocabulary gets a probability score at each generation step.
How does the model learn? The training objective is simple in structure: given all preceding tokens, predict the next one. This is framed as a classification problem — classify the next token from a vocabulary of tens of thousands of options.
The loss function is cross-entropy loss. The target is a one-hot vector (the actual next token is 1, everything else is 0). The model outputs a probability distribution across the entire vocabulary. Cross-entropy loss measures the gap between those two distributions.
Minimizing this loss is mathematically equivalent to maximizing the log-likelihood of the training text. The model is being pushed, with every gradient update, to assign higher probability to text that actually appears in the real world.
The pipeline in full: embed each token as a vector → pass through Transformer layers → apply a linear projection to vocabulary size → softmax to get probabilities → compute cross-entropy against ground truth → backprop and update weights.
Perplexity is the primary metric during pre-training. It's essentially your validation loss, exponentiated to be more interpretable. The formula:
The intuition is concrete: perplexity tells you how many tokens your model is "hesitating between" at each step. A perplexity of 10 means the model is effectively choosing between 10 equally plausible options at every word. Perplexity of 1 means perfect prediction every time.
From 2017 to 2023, perplexity on standard datasets dropped from ~70 to below 10. That's a massive improvement — models went from being genuinely uncertain across dozens of words to being confident within a handful.
Perplexity is no longer used in academic benchmarking because it's tokenizer-dependent. A model with a 10,000-token vocabulary and a model with a 100,000-token vocabulary aren't directly comparable on perplexity. It remains the go-to metric for internal development.
Academic benchmarks now dominate external comparison. MMLU (Massive Multitask Language Understanding) is the most common — a collection of multiple-choice questions across medicine, physics, astronomy, law, and dozens of other domains. Evaluation is constrained: you look at which answer (A/B/C/D) gets the highest probability from the model. No free-form generation required.
He also flags train-test contamination: if benchmark questions appeared in the training data, scores are inflated and meaningless. One method to detect this: since internet datasets aren't randomized, check whether the model predicts benchmark questions more confidently in their original order than in shuffled order. If yes, it probably memorized them.
The common phrase "trained on the internet" obscures an enormous amount of engineering. Dubois walks through what it actually involves.
Step 1 — Download the internet. Web crawlers (Common Crawl being one major source) continuously index the public web. Current scale: approximately 250 billion pages, roughly one petabyte of data.
Step 2 — Extract text from HTML. Raw crawl data is HTML — full of tags, scripts, ads, and boilerplate. Extracting clean text from HTML is non-trivial, especially for math notation and structured data.
Step 3 — Filter undesirable content. Companies maintain blacklists of sites to exclude. Classifiers detect and remove NSFW content, hate speech, and personally identifiable information (PII). Every point in this pipeline represents significant engineering work.
Step 4 — Deduplication. Forums repeat headers and footers across thousands of pages. Books appear verbatim on hundreds of different sites. The same URL resolves to different content at different times. Deduplication at petabyte scale is a substantial systems problem.
Step 5 — Heuristic filtering. Low-quality documents are removed using rule-based signals: unusual token distributions, extremely short or extremely long documents, unusual character ratios, and similar signals.
ChatGPT, Claude, Gemini — what's inside all of them. Establishes the five components: architecture, training loss, data, evaluation, systems.
Pre-training = learning language from raw internet text. Post-training = turning that into an assistant. GPT-2 is pre-training. ChatGPT is post-training.
LLMs assign probabilities to sequences of text. Autoregressive decomposition using the chain rule. Predicting the next token is the entire training task.
Why words and characters both fail. Byte Pair Encoding as the dominant solution. Why tokenization affects math ability and code generation.
Cross-entropy loss. Minimizing loss = maximizing log-likelihood of real text. The mechanics: embed → Transformer → linear → softmax → loss → backprop.
Perplexity as internal training metric. MMLU and benchmark aggregators for academic comparison. Contamination detection. Why same model, different eval = wildly different scores.
What "trained on the internet" actually requires: crawling, HTML extraction, filtering, deduplication, heuristic quality filtering. 1 petabyte of raw data → curated training corpus.
Supervised fine-tuning on human-written demonstrations. Reinforcement learning from human feedback (RLHF). Reward modeling. How ChatGPT's behavior was shaped.