Evaluation results — chores • choreseval

The chores data contains "confirmed" evaluation results generated by running the following:

tsk <- chores_task()
tsk$eval(
  solver_chat = ellmer::chat_*(model = "some-model")
)

See chores_task() for more on how the evaluation task works. Notably:

The solver carries out 34 refactorings using the cli chore helper, each repeated 3 times.
Each refactoring is then graded according to a rubric using Claude 4 Sonnet. The grading results in a score between 0 and 1 and incorporates measures of code quality as well as execution time. The score on the eval is the mean of the per-sample scores multiplied by 100.
Grading costs something like $2.50; the cost of solving depends on the model pricing.

Usage

chores

Format

An object of class tbl_df (inherits from tbl, data.frame) with 8 rows and 6 columns.

Columns

name: An identifier for the experiment.
provider: The ellmer provider name.
model: The model name.
score: The score on the eval, from 0 to 100. Scores above 80 are great, indicating a model is a good fit for use with chores. For reference, Claude 4 Sonnet scores
cost: The total cost to run the solving across the 102 samples (estimated by ellmer).
metadata: The full evaluation samples.

Examples

library(tibble)

chores
#> # A tibble: 8 × 8
#>   name                provider model score price tokens_per_s local n_parameters
#>   <chr>               <chr>    <chr> <dbl> <chr>        <dbl> <chr> <chr>       
#> 1 claude_opus_4       Anthrop… clau… 0.917 $7.52        91.5  No    NA          
#> 2 claude_sonnet_4     Anthrop… clau… 0.939 $1.51       183.   No    NA          
#> 3 gemini_2_5_flash_n… Google/… gemi… 0.874 $0.11       240.   No    NA          
#> 4 gemini_2_5_flash_t… Google/… gemi… 0.897 $0.10        45.2  No    NA          
#> 5 gemini_2_5_pro      Google/… gemi… 0.918 $0.44         8.20 No    NA          
#> 6 gpt_4_1_mini        OpenAI   gpt-… 0.855 $0.06        40.9  No    NA          
#> 7 gpt_4_1_nano        OpenAI   gpt-… 0.774 $0.01        46.6  No    NA          
#> 8 gpt_4_1             OpenAI   gpt-… 0.904 $0.29         4.22 No    NA          
dplyr::glimpse(chores)
#> Rows: 8
#> Columns: 8
#> $ name         <chr> "claude_opus_4", "claude_sonnet_4", "gemini_2_5_flash_non…
#> $ provider     <chr> "Anthropic", "Anthropic", "Google/Gemini", "Google/Gemini…
#> $ model        <chr> "claude-opus-4-20250514", "claude-sonnet-4-20250514", "ge…
#> $ score        <dbl> 0.9173005, 0.9389540, 0.8741558, 0.8972495, 0.9183629, 0.…
#> $ price        <chr> "$7.52", "$1.51", "$0.11", "$0.10", "$0.44", "$0.06", "$0…
#> $ tokens_per_s <dbl> 91.494264, 183.210184, 239.727754, 45.185286, 8.203524, 4…
#> $ local        <chr> "No", "No", "No", "No", "No", "No", "No", "No"
#> $ n_parameters <chr> NA, NA, NA, NA, NA, NA, NA, NA