Evaluate LLM performance — evaluate • evalthat

evaluate() and evaluate_active_file() are roughly analogous to devtools::test() and devtools::test_active_file(), though note that evaluate() can take either a directory of files or a single file.

Usage

evaluate(path = ".", across = tibble(), repeats = 1L, ...)

evaluate_active_file(
  path = active_eval_file(),
  across = tibble(),
  repeats = 1L,
  ...
)

Arguments

path: Path to the directory or file containing the evaluation code.
across: A data frame where each column represents an option to be set when evaluating the file at path and each row represents a pass through that file.
repeats: A single positive integer specifying the number of evaluation repeats, or runs over the same test files. Assuming that the models you're evaluating provide non-deterministic output, running the same test files multiple times by setting repeats > 1 will help you quantify the variability of your evaluations.
...: Additional arguments passed to internal functions.

Value

Results of the evaluation, invisibly. Evaluation results contain information on the eval metadata as well as numbers of failures and passes, input and output, and descriptions of each failure.

The function also has side-effects:

An interactive progress interface tracking results in real-time.
Result files are stored in dirname(path)/_results. Result files contain persistent, fine-grained evaluation results and can be interfaced with via results_read() and friends.

Examples

if (FALSE) {
library(ellmer)

# evaluate with the default model twice
evaluate("tests/evalthat/test-ggplot2.R", repeats = 2)

# evaluate a directory of evals across several models,
# repeating each eval twice
eval <- evaluate(
  "tests/evalthat/test-ggplot2.R",
  across = tibble(chat = c(
    chat_openai(model = "gpt-4o-mini", echo = FALSE),
    chat_claude(model = "claude-3-5-sonnet-latest", echo = FALSE))
  ),
  repeats = 2
)
}