bluffbench

Effective agents need to prioritize evidence over their preconceptions.

At Posit, we’ve observed that many LLMs fail to incorporate evidence when it’s at odds with what an agent expects to see in data. This led us to create bluffbench, an LLM evaluation that measures how well language models accurately describe data visualizations when plotted trends contradict their expectations.

Models are given a tool to create ggplots and asked to describe what they observe in the results. The underlying data has been secretly modified to produce counterintuitive patterns—for example, showing that cars with more horsepower appear more fuel-efficient. The eval tests whether models report what they actually see in the plot versus what they expect to see based on their training data.

Mocking base datasets

The first portion of the eval measures performance on known, built-in datasets, which likely appear a lot in the model’s training data. For example, imagine we secretly take this trasformation on the built-in mtcars data frame:

mtcars$hp <- max(mtcars$hp) - mtcars$hp

Then, we ask an LLM to:

plot mpg vs hp in mtcars and tell me what you see.

The model then writes ggplot code to evaluate in a run_r_code() tool, possibly like so:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()

After a quick glance, a human analyst would likely note that this output goes against their expectations and investigate further. Would an LLM do the same, though? Generally, no; when plotting, LLMs see what they expect to see:

A horizontal bar chart comparing AI models' performance on bluffbench. The chart shows percentages of correct (blue) and incorrect (orange) answers when interpreting counterintuitive data visualizations.

Counterintuitive data

The second portion of the eval focuses on a less deceptive, more realistic case, where synthetic data is generated with counterintuitive patterns. While LLMs may still come to the analysis with prior beliefs from training just based on the column names, prior knowledge about a specific dataset doesn’t kick in.

So, imagine we generate a dataset on weekly study time and exam scores.

set.seed(1010)
n <- 75

study_hours_weekly <- runif(n, 2, 35)

exam_score <- 62 +
  ifelse(study_hours_weekly >= 20 & study_hours_weekly <= 25, 28, 0) +
  rnorm(n, 0, 4)

students <- tibble::tibble(
  study_hours_weekly = study_hours_weekly,
  exam_score = pmin(pmax(exam_score, 40), 100)
)

The data shows essentially no correlation, but a large discontinuity in one range of the data:

ggplot(students, aes(x = study_hours_weekly, y = exam_score)) +
  geom_point()

When provided the prompt:

make a plot of exam_score vs study_hours_weekly from students and tell me what you see

…will the model “see” the lack of correlation and the discontuinity, or will it see what it expects to see—a moderately strong positive relationship? In this less adversarial setting, models are more performant:

Implemented in R with vitals.