
Metadata#
- Author: Chip Huyen
- Number of pages: 532
- Year published: 2025
- Year read: 2026
Review#
Done! OK, that took a while. Overall, distilling it to a singular purpose - did I learn from it? - the answer is, “ya.” But did I learn a sufficiently satisfying amount from it? No, not at all. This was only a starter!
Quick overview#
The problem with the book - as Jake said - is that it’s trying to cast a very wide net. From the intro, it seems the author, Chip Huyen, is trying to write an accessible “intro” to “AI” for software engineers, data scientists, and would-be founders. She assumes basically no special background in the technology, but - at the same time - is not aiming to write a proper undergraduate-level textbook on the subject (which is kinda what the subject requires!). As a result, it all feels very quick and surface-level.
So what’s covered? Lemme jog my memory and see how much I retained (lol):
- “AI is cool and useful” (soooo many highlights and spicy marginalia here)
- AI mostly means large language models (LLMs), but it can also incorporate computer vision, speech-to-text, text-to-speech, and so on.
- Brief dive into LLMs and natural language processing: like, embeddings are cool, LLMs are big neural nets, the attention mechanism was a big step up, the transfer mechanism also. Even briefer mention of how all the cutting edge LLMs are basically closed-source, profit-driven moats that don’t share the neural net architecture, the weights, the training data, or the post-training steps. But I’m not bitter…
- Long stretch about “evaluating AI models”, and how difficult that is. Why is it difficult? Because human knowledge is sooooo expensive and often inconveniently locked between the ears of actual humans. As a result, Huyen advises (and maybe the industry is coagulating around?) “AI judges” - aka, use an LLM to evaluate the answers of another LLM. Also, defining “better” is hard when you’re subjectively grading generated slop.
- Some cool stuff about “post-training”, fine-tuning, RAG (retrieval augmented… something… generation?). Aka, building on top of the “foundational models” (aka, OpenAI, Anthropic, Gemini) by either, in order of difficulty, (a) prompt “engineering” (aka, ask it nicely, ask it a lot), (b) RAG (tell it to look things up), or (c) post-training (actually train the last few layers of the neural net again). Also some intriguing stuff about smushing models together (mixture of experts). No statistical concerns here about correlated training corpuses (corpi! yay Latin of meatspace) and orthogonality and, like, what we learned with ensemble methods in traditional, pre-LLM machine learning (e.g. a Random Forest vs. a bootstrap aggregated model with correlated trees)?
- Data engineering issues - again, an alarming section about using LLMs to generate synthetic data so that we can make more LLMs. Insufficient statistical concern here!!
- Full-stack, UX research-type discussion about how to integrate your “AI” into an app with actual, human users. Some very interesting, but very brief stuff about human-computer interaction, basically?
OK, so the good stuff: What did I learn?#
This was definitely a helpful “overview” book: it gave me a lot of structure to the otherwise disparate ideas I have had about LLMs. For example, I found it very helpful to disambiguate the post-training step into specific workflows like prompt “engineering”, RLHF (reinforcement learning with human feedback), RAG, teacher-student models, and so on.
I liked the very brief history of the field, how the “Attention is All You Need” paper/attention mechanism is situated within that field (I thought it was THE thing that unlocked this strange alien intelligence, but it is apparently only A thing). I was VERY tantalized by the discussion of embeddings.
I found the evaluation section super useful to, again, structure this otherwise ambiguous space.
Oh yeah - and I love that this book has a repo with tons of additional resources.
The not-good stuff#
I think I would have felt very differently about this book had I read it last year, pre-AI Con, pre-AI bubble. Living in this bog of cynicism now, I couldn’t help but roll my eyes at the relatively uncritical discussions around, e.g., the fact that all the cutting edge models are closed weight. That’s not science!
Relatedly, the “how to evaluate your model for your needs” section was frustratingly vague and unprincipled - and never called out the elephant in the room: If I can’t do any descriptive stats of the text corpuses in the training dataset, how can I, well, understand or trust anything? (I did like Huyen’s table on languages which are most disproportionately underrepresented in the main training corpuses out there: apparently Punjabi is like CRAZY underrepresented!) But the evaluation section felt like it boiled down to Vicki Boykis’s joke, that all AI evaluation is a “vibe check”.
Similarly, the discussion of using AI judges, for example, or using synthetic training data, without a discussion of the statistical implications of that, felt really unintuitive. For example: iirc Huyen mentions how training data biases will be amplified in synthetic data creation. That’s true - and it’s important to acknowledge. But it’s only one aspect of the wider problem: when your data is highly correlated (even duplicated), then how can you extract more information from it? For example, if you use Claude 4.5 as your LLM base model - and then use Claude 4.5 (or another Claude, for that matter) prompted to be your “AI judge” for the original model - isn’t that, by definition, uninformative? Like how bootstrap aggregating (bagging, the machine learning method) can lead to correlated trees?
Basically, I would have appreciated more foundational statistical thinking throughout. (And, in Huyen’s defense, this might be something to do with the field. For example: this paper, which is basically like, “Hey Anthropic/OpenAI, you should probably put error bars on your evals…” Basic stuff!!!)
One more thing#
Having a gorgeous barn owl on the cover is apropos as well - owls are majestic, almost mystical creatures. Like LLMs, they’re hypnotic and affecting. But let’s plz always keep in mind: when plucked of their feathers, they do look very, very silly.