The chores
data contains "confirmed" evaluation results generated by
running the following:
See chores_task()
for more on how the evaluation task works. Notably:
The solver carries out 34 refactorings using the cli chore helper, each repeated 3 times.
Each refactoring is then graded according to a rubric using Claude 4 Sonnet. The grading results in a score between 0 and 1 and incorporates measures of code quality as well as execution time. The score on the eval is the mean of the per-sample scores multiplied by 100.
Grading costs something like $2.50; the cost of solving depends on the model pricing.
Columns
name
: An identifier for the experiment.provider
: The ellmer provider name.model
: The model name.score
: The score on the eval, from 0 to 100. Scores above 80 are great, indicating a model is a good fit for use with chores. For reference, Claude 4 Sonnet scorescost
: The total cost to run the solving across the 102 samples (estimated by ellmer).metadata
: The full evaluation samples.
Examples
library(tibble)
chores
#> # A tibble: 8 × 8
#> name provider model score price tokens_per_s local n_parameters
#> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
#> 1 claude_opus_4 Anthrop… clau… 0.917 $7.52 91.5 No NA
#> 2 claude_sonnet_4 Anthrop… clau… 0.939 $1.51 183. No NA
#> 3 gemini_2_5_flash_n… Google/… gemi… 0.874 $0.11 240. No NA
#> 4 gemini_2_5_flash_t… Google/… gemi… 0.897 $0.10 45.2 No NA
#> 5 gemini_2_5_pro Google/… gemi… 0.918 $0.44 8.20 No NA
#> 6 gpt_4_1_mini OpenAI gpt-… 0.855 $0.06 40.9 No NA
#> 7 gpt_4_1_nano OpenAI gpt-… 0.774 $0.01 46.6 No NA
#> 8 gpt_4_1 OpenAI gpt-… 0.904 $0.29 4.22 No NA
dplyr::glimpse(chores)
#> Rows: 8
#> Columns: 8
#> $ name <chr> "claude_opus_4", "claude_sonnet_4", "gemini_2_5_flash_non…
#> $ provider <chr> "Anthropic", "Anthropic", "Google/Gemini", "Google/Gemini…
#> $ model <chr> "claude-opus-4-20250514", "claude-sonnet-4-20250514", "ge…
#> $ score <dbl> 0.9173005, 0.9389540, 0.8741558, 0.8972495, 0.9183629, 0.…
#> $ price <chr> "$7.52", "$1.51", "$0.11", "$0.10", "$0.44", "$0.06", "$0…
#> $ tokens_per_s <dbl> 91.494264, 183.210184, 239.727754, 45.185286, 8.203524, 4…
#> $ local <chr> "No", "No", "No", "No", "No", "No", "No", "No"
#> $ n_parameters <chr> NA, NA, NA, NA, NA, NA, NA, NA