evaluate()
and evaluate_active_file()
are roughly analogous to
devtools::test()
and devtools::test_active_file()
, though note that
evaluate()
can take either a directory of files or a single file.
Usage
evaluate(path = ".", across = tibble(), repeats = 1L, ...)
evaluate_active_file(
path = active_eval_file(),
across = tibble(),
repeats = 1L,
...
)
Arguments
- path
Path to the directory or file containing the evaluation code.
- across
A data frame where each column represents an option to be set when evaluating the file at
path
and each row represents a pass through that file.- repeats
A single positive integer specifying the number of evaluation repeats, or runs over the same test files. Assuming that the models you're evaluating provide non-deterministic output, running the same test files multiple times by setting
repeats > 1
will help you quantify the variability of your evaluations.- ...
Additional arguments passed to internal functions.
Value
Results of the evaluation, invisibly. Evaluation results contain information on the eval metadata as well as numbers of failures and passes, input and output, and descriptions of each failure.
The function also has side-effects:
An interactive progress interface tracking results in real-time.
Result files are stored in
dirname(path)/_results
. Result files contain persistent, fine-grained evaluation results and can be interfaced with viaresults_read()
and friends.
Examples
if (FALSE) {
library(ellmer)
# evaluate with the default model twice
evaluate("tests/evalthat/test-ggplot2.R", repeats = 2)
# evaluate a directory of evals across several models,
# repeating each eval twice
eval <- evaluate(
"tests/evalthat/test-ggplot2.R",
across = tibble(chat = c(
chat_openai(model = "gpt-4o-mini", echo = FALSE),
chat_claude(model = "claude-3-5-sonnet-latest", echo = FALSE))
),
repeats = 2
)
}