Title slide, reading "Fairness in machine learning with tidymodels," my name, Simon P. Couch, and my affiliation, Posit PBC. To the right of the text are six hexagonal stickers showing packages from the tidymodels.

github.com/simonpcouch/slc-rug-23

Overview

  • What is fair machine learning?
  • Applied example
    • Problem context
    • Exploratory analysis
    • Fairness assessment
  • Model selection
  • Resources

Fairness in machine learning

A screenshot of a webpage from ProPublica. The hero image is composed of a picture of two men, side by side, one black and one white, with the title "Machine bias."

A screenshot of a web article with the title "Predictive policing algorithms are racist. They need to be dismantled."

A screenshot of a web article with the title "Amazon scraps secret AI recruiting engine that showed biases against women."

ggplot2 line plot with the title 'Predicted versus Actual Vase Weight.' The subtitle reads 'Machine learning model predicts lighter vases are too heavy, and vice versa.'

Is this fair?

The same exact line plot with the title and subtitle switched. They now read 'Predicted vs. Actual Home Value.' and 'Tax assessment model overvalues cheaper homes, and vice versa.'

Is this fair?

Fairness is morally defined

Corollary: machine learning fairness is not simply a mathematical optimization problem

Applied example: ChatGPT detectors🕵️

ChatGPT detectors

A screenshot of a web article with title "Cheaters  beware: CHATGPT maker releases AI detection tool."

A screenshot of a web article with title "The AI detection arms race is on."

A screenshot of a web article with title "ChatGPT detector catches AI-generated papers with unprecedented accuracy."

ChatGPT detectors😄

ChatGPT detectors🤨

A screenshot of a web article with title "OpenAI abruptly shuts down ChatGPT plagiarism detector, and educators are worried."

A screenshot of a web article with title "Professors are using ChatGPT detector tools to accuse students of cheating. But what if the software is wrong?"

A screenshot of a web article with title "AI detectors biased against non-native English writers."

Study design1

  • Collect many human-written essays
    • Some written by “native English writers”
    • Others by writers who do not write English “natively”
  • Generate many essays based on the same prompts
  • Pass all of the essays to marketed GPT detectors

Getting set up

library(tidymodels)

Getting set up

library(detectors)

str(detectors)
#> tibble [6,185 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ kind       : Factor w/ 2 levels "AI","Human": 2 2 2 1 1 2 1 1 2 2 ...
#>  $ .pred_AI   : num [1:6185] 0.999994 0.828145 0.000214 0 0.001784 ...
#>  $ .pred_class: Factor w/ 2 levels "AI","Human": 1 1 2 2 2 2 1 2 2 1 ...
#>  $ detector   : chr [1:6185] "Sapling" "Crossplag" "Crossplag" "ZeroGPT" ...
#>  $ native     : chr [1:6185] "No" "No" "Yes" NA ...
#>  $ name       : chr [1:6185] "Real TOEFL" "Real TOEFL" "Real College Essays" "Fake CS224N - GPT3" ...
#>  $ model      : chr [1:6185] "Human" "Human" "Human" "GPT3" ...
#>  $ document_id: num [1:6185] 497 278 294 671 717 855 533 484 781 460 ...
#>  $ prompt     : chr [1:6185] NA NA NA "Plain" ...

Exploratory analysis

Exploratory analysis

Exploratory analysis

Fairness assessment with tidymodels

Fairness assessment with tidymodels

How does a GPT detector behave fairly?

Three perspectives:

  • Effective detection, group-blind
  • Fair prediction on human-written essays
  • Balancing both notions of fairness

Effective detection, group-blind

Position: it is unfair to pass on an essay written by a GPT as one’s own work.

Stakeholders:

  • A detector author
  • A student
  • An instructor

Effective detection, group-blind

detectors %>%
  group_by(detector) %>%
  roc_auc(truth = kind, .pred_AI) %>%
  arrange(desc(.estimate)) %>%
  head(3)
#> # A tibble: 3 × 4
#>   detector      .metric .estimator .estimate
#>   <chr>         <chr>   <chr>          <dbl>
#> 1 GPTZero       roc_auc binary         0.750
#> 2 OriginalityAI roc_auc binary         0.682
#> 3 HFOpenAI      roc_auc binary         0.614

Note

This code makes no mention of the native variable.

Fair prediction on human-written essays

Position: it is unfair to disproportionately classify human-written text as AI-generated

Stakeholders:

  • Another student
  • Another instructor

Fair prediction on human-written essays

The fairness metric equal opportunity quantifies this definition of fairness.

equal_opportunity_by_native <- equal_opportunity(by = native)

Note

equal_opportunity() is one of several fairness metrics in the developmental version of yardstick.

Fair prediction on human-written essays

detectors %>%
  filter(kind == "Human") %>%
  group_by(detector) %>%
  equal_opportunity_by_native(
    truth = kind, estimate = .pred_class, event_level = "second"
  ) %>%
  arrange(.estimate) %>%
  head(3)
#> # A tibble: 3 × 5
#>   detector  .metric           .by    .estimator .estimate
#>   <chr>     <chr>             <chr>  <chr>          <dbl>
#> 1 Crossplag equal_opportunity native binary         0.464
#> 2 ZeroGPT   equal_opportunity native binary         0.477
#> 3 GPTZero   equal_opportunity native binary         0.510

The detectors with estimates closest to zero are most fair, by this definition of fairness.

Balancing two notions of fairness

Position: it is unfair to pass on an essay written by a GPT as one’s own work and it is unfair to disportionately classify human-written text as AI-generated.

Stakeholders:

  • Another instructor

Balancing two notions of fairness

Workflow:

  1. Ensure that a model detects GPT-generated work with some threshold of performance, and then
  2. Choose the model among that set that predicts most fairly on human-written essays

Question

By this workflow, which of the first definitions of fairness is encoded as more important?

Balancing two notions of fairness

Find the most performant detectors:

performant_detectors <- 
  detectors %>%
  group_by(detector) %>%
  roc_auc(truth = kind, .pred_AI) %>%
  arrange(desc(.estimate)) %>%
  head(3)

Balancing two notions of fairness

Among the most performant detectors, choose the model that predicts most fairly on human-written essays:

detectors %>%
  filter(kind == "Human", detector %in% performant_detectors$detector) %>%
  group_by(detector) %>%
  equal_opportunity_by_native(
    truth = kind, 
    estimate = .pred_class, 
    event_level = "second"
  ) %>%
  arrange(.estimate)
#> # A tibble: 3 × 5
#>   detector      .metric           .by    .estimator .estimate
#>   <chr>         <chr>             <chr>  <chr>          <dbl>
#> 1 GPTZero       equal_opportunity native binary         0.510
#> 2 HFOpenAI      equal_opportunity native binary         0.549
#> 3 OriginalityAI equal_opportunity native binary         0.709

Balancing two notions of fairness

Take-home📝

Switch the order of these steps. Does this result in a different set of recommended models?

Model selection: choosing a detector

How do I choose a detector?




What do you value?

Resources

Resources

  • tidyverse: r4ds.hadley.nz

The book cover for "R for Data Science."

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org

The book cover for "Tidy Modeling with R."

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org
  • Slides and example notebooks:
github.com/simonpcouch/slc-rug-23