Is that LLM feature any good?

Simon P. Couch - Posit PBC

When it comes to running evals on your LLM apps…

You really should be!

You can do it!

You really should be!

[evaluating your LLM product]

Like software engineering, success with AI hinges on how fast you can iterate. - Hamel Husain

In SWE:

Make changes
Evaluate quality
Debug issues

(coding)

(unit testing)

In AI:

Make changes

(prompt engineering, RAG, …)

In AI:

Make changes
Have a looksie
Whack-a-mole

(prompt engineering, RAG, …)

AI 🤝 SWE tooling

Make changes
Evaluate quality
Debug issues

(prompt engineering, RAG, …)

You can do it!

Evaluations of ellmer-based apps are plug-and-play:

Evaluations have three pieces:

Dataset
Solver
Scorer

A few:

inputs: Prompts that users might provide
targets: Corresponding grading guidance

e.g.

input: “Are there any talks about evals?”
target: “Yes, Simon Couch will be giving a talk called ‘Is that LLM feature any good?’”

The conf chat app looks like this:

library(ellmer)

client <- chat_openai()

# add prompting, a RAG tool, etc...

live_browser(client)

Your solver is the client✅

client$chat(glue::glue("
  You are assessing a submitted answer on a given task 
  based on a criterion.

  [Task]: {input}

  [Submission]: {solver_response}

  [Criterion]: {target}

  Does the submission meet the criterion?
"))

Wait, we’re using an LLM to grade an LLM’s output?

Yes.

library(vitals)

spreadsheet <- googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1Pm0_itdB61H539zf8Ksm8l_6r-8tqspuvZUdFVoLwKg/edit?usp=sharing")

tsk <- Task$new(
  dataset = spreadsheet,
  solver = generate(client),
  scorer = model_graded_qa()
)

tsk$eval()

✅ Solving [7s]                                                    
✅ Scoring [5.5s]