Simon P. Couch - Posit PBC
[evaluating your LLM product]
Like software engineering, success with AI hinges on how fast you can iterate. - Hamel Husain
https://hamel.dev/blog/posts/evals/
In SWE:
(coding)
(unit testing)
In AI:
(prompt engineering, RAG, …)
In AI:
(prompt engineering, RAG, …)
AI 🤝 SWE tooling
(prompt engineering, RAG, …)
Evaluations of ellmer-based apps are plug-and-play:
Evaluations of ellmer-based apps are plug-and-play:
Evaluations have three pieces:
A few:
input
s: Prompts that users might providetarget
s: Corresponding grading guidancee.g.
input
: “Are there any talks about evals?”target
: “Yes, Simon Couch will be giving a talk called ‘Is that LLM feature any good?’”The conf chat app looks like this:
Your solver is the client
✅
Wait, we’re using an LLM to grade an LLM’s output?
Wait, we’re using an LLM to grade an LLM’s output?
Yes.
✅ Solving [7s]
✅ Scoring [5.5s]
github.com/simonpcouch/conf-25