Is that LLM feature any good?






Simon P. Couch - Posit PBC




When it comes to running evals on your LLM apps…

You really should be!

You can do it!

You really should be!

[evaluating your LLM product]





Like software engineering, success with AI hinges on how fast you can iterate. - Hamel Husain

In SWE:

  1. Make changes
  2. Evaluate quality
  3. Debug issues

(coding)

(unit testing)

In AI:

  1. Make changes

(prompt engineering, RAG, …)

In AI:

  1. Make changes
  2. Have a looksie
  3. Whack-a-mole

(prompt engineering, RAG, …)

AI 🤝 SWE tooling

  1. Make changes
  2. Evaluate quality
  3. Debug issues

(prompt engineering, RAG, …)

You can do it!

Evaluations of ellmer-based apps are plug-and-play:

Evaluations of ellmer-based apps are plug-and-play:



Evaluations have three pieces:

  • Dataset
  • Solver
  • Scorer





A few:

  • inputs: Prompts that users might provide
  • targets: Corresponding grading guidance

e.g.

  • input: “Are there any talks about evals?”
  • target: “Yes, Simon Couch will be giving a talk called ‘Is that LLM feature any good?’”








The conf chat app looks like this:

library(ellmer)

client <- chat_openai()

# add prompting, a RAG tool, etc...

live_browser(client)

Your solver is the client







client$chat(glue::glue("
  You are assessing a submitted answer on a given task 
  based on a criterion.

  [Task]: {input}

  [Submission]: {solver_response}

  [Criterion]: {target}

  Does the submission meet the criterion?
"))



Wait, we’re using an LLM to grade an LLM’s output?



Wait, we’re using an LLM to grade an LLM’s output?








Yes.



library(vitals)

spreadsheet <- googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1Pm0_itdB61H539zf8Ksm8l_6r-8tqspuvZUdFVoLwKg/edit?usp=sharing")

tsk <- Task$new(
  dataset = spreadsheet,
  solver = generate(client),
  scorer = model_graded_qa()
)

tsk$eval()
✅ Solving [7s]                                                    
✅ Scoring [5.5s]

When it comes to running evals on your LLM apps…

You really should be!

You can do it!