Lenny's Podcast: Product | Growth | Career cover

Lenny's Podcast: Product | Growth | Career · September 25, 2025

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Highlights from the Episode

Hamel HusainTeaches AI evals course
00:00:49 - 00:01:07
Benevolent dictator for open coding
When performing open coding, many teams get bogged down by having a committee handle it. In most situations, this is entirely unnecessary. You don't want to make the process so expensive that it becomes unfeasible. Instead, you can appoint one person whose judgment you trust. This individual should possess domain expertise, and often, it's the product manager.
Hamel HusainTeaches AI evals course
00:06:36 - 00:07:44
Evals provide confidence in AI application improvement
Imagine you have a real estate assistant application that isn't performing as expected. Perhaps it's not writing customer emails correctly or isn't utilizing the right tools, leading to various errors. Before evaluation tools (evals), you'd be left guessing. You might tweak a prompt, hoping it doesn't disrupt other functions. While initial "vibe checks" are useful, they quickly become unmanageable as your application scales. You'd feel lost. Evals, however, help you establish metrics to measure your application's performance. This provides a confident way to improve your application, as you have a clear feedback signal for iteration.
Shreya ShankarTeaches AI evals course
00:08:35 - 00:09:55
Evals are a spectrum beyond unit tests
I agree with your initial point: we had a very broad definition. Evals encompass a wide spectrum of methods to measure application quality. Unit tests are one such method. For instance, if you have non-negotiable functionalities for your AI assistant, unit tests can verify them. However, since AI assistants perform open-ended tasks, you also need to measure their performance on vague or ambiguous things, like responding to new user requests or adapting to new data distributions. For example, new users might emerge whom you hadn't anticipated, requiring you to accommodate this new group. Evals can help identify these new cohorts by regularly analyzing your data. Additionally, evals can track metrics over time, such as positive user feedback. These basic, non-AI-specific metrics can feed back into the product improvement cycle. Ultimately, unit tests are a small piece of this much larger puzzle.
Shreya ShankarTeaches AI evals course
00:24:04 - 00:24:54
LLMs lack context for nuanced error analysis
I loved Hamel's recent example. When we ask an LLM to perform error analysis, it often states the trace looks good. This is because it lacks the context to understand if something indicates a bad product smell, like the hallucination about scheduling a tour. I can guarantee that if I put that into ChatGPT and asked if there was an error, it would say no, that it did a great job. However, Hamel knew we don't actually have virtual tour functionality. Therefore, in these cases, it's crucial to manually perform this analysis yourself. We can discuss when to use LLMs in the process later, but a major pitfall is people immediately trying to automate this with an LLM.
Hamel HusainTeaches AI evals course
00:44:40 - 00:48:31
Prioritizing errors and choosing eval types
We found 17 conversational flow issues on the categorized traces. Pivot tables are useful for exploring these, allowing you to double-click and examine specific issues. This gives us a clear, initial understanding of our problems, moving us from chaos to a structured approach. We can now identify our biggest challenges, such as conversational issues or human handoff problems. While the count isn't the only factor, some critical issues might demand immediate attention. Now that we have a way to view the problem, we can consider whether evaluations (evals) are necessary. Some issues might be simple engineering errors that don't require an eval, as the fix is obvious. For instance, a formatting error might just mean you forgot to specify the desired format in the LLM prompt. Fixing the prompt might be enough. You could still write an eval for this, as it might be testable with code, checking the string format without running an LLM. There's a cost-benefit trade-off with evals; don't get carried away, but ground yourself in actual errors. Don't skip this step, as it's where many get lost, jumping straight to evals. For example, if we want to tackle a human handoff issue but are unsure how to fix it, as it involves subjective judgment, an LLM as a judge might be useful. There are different types of evals: code-based, which are cheaper and preferred if possible, and LLM as a judge. An LLM as a judge is a meta-eval; you need to evaluate the judging LLM to ensure it's performing correctly. We'll discuss this shortly. So, how do you build an LLM as a judge?
Shreya ShankarTeaches AI evals course
00:50:47 - 00:51:17
LLM judges are reliable for tightly scoped, binary evaluations
Absolutely, you've nailed it. People often think that this task is as difficult as creating the original agent. However, it's not. You're asking the judge to do one specific thing: evaluate a single failure mode. The scope of the problem is very small, and the LLM judge's output is simply "pass" or "fail." This makes it a very tightly scoped task that LLM judges can perform reliably.
Shreya ShankarTeaches AI evals course
01:03:34 - 01:04:49
Rubrics and failure modes evolve with LLM outputs
We conducted a fascinating study with users who were trying to write LLM judges or validate their own LLM outputs. This was before evaluations became extremely popular online; we started this project in late 2023. As a researcher, I kept wondering why this problem was so difficult. Machine learning and AI have existed for a long time, so it's not new, but suddenly everything is challenging. We performed a user study with many developers and realized that the novelty lies in the inability to define rubrics upfront. People's opinions of good and bad evolve as they review more outputs. They only identify failure modes after seeing ten outputs they never would have anticipated initially. These are experts who have built numerous LLM pipelines and agents before. You simply cannot foresee every possibility from the outset, and I believe that's crucial in today's AI development.
Shreya ShankarTeaches AI evals course
01:25:54 - 01:26:27
No single correct way to do evals
There's no single correct way to conduct evaluations. While many incorrect approaches exist, there are also numerous valid methods. You must consider your product's current stage, available resources, and then devise the most suitable plan. This will always involve some form of error analysis, as demonstrated today. However, how you operationalize those metrics will vary depending on your specific circumstances.

Get weekly highlights

Subscribe to get the best podcast highlights delivered to your inbox every week.

00:00:0000:00:00