End-to-End Machine Learning with

What is a machine learning model?

  • Netflix movie recommendations
  • Zillow property value estimates
  • E-mail spam filters

Guess the value of outcomes using predictors

Machine learning with Posit Team

  • Understand and clean data with tidyverse
  • Train and evaluate models with tidymodels and Workbench
  • Share the final model with vetiver and Connect

Understand and clean data

Understand and clean data

  • Mutagenicity refers to a drug’s tendency to increase the rate of mutations
  • Mutagenicity can be evaluated with a lab test, though those tests are costly and time-intensive

What if a machine learning model could predict mutagenicity using known drug information?

mutagen_tbl
#> # A tibble: 4,335 × 1,580
#>    outcome    MW   AMW    Sv    Se    Sp    Ss    Mv    Me    Mp    Ms
#>    <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 mutagen  326.  7.59  29.3  42.6  30.6  50.7  0.68  0.99  0.71  2.03
#>  2 mutagen  174.  9.17  13.2  19.6  13.4  38    0.7   1.03  0.71  2.92
#>  3 nonmut…  300.  9.39  20.0  33.6  21.0  61.2  0.63  1.05  0.66  3.06
#>  4 nonmut…  143.  6.23  12.6  23.1  13.5  26.2  0.55  1     0.59  2.62
#>  5 nonmut…  216. 18.0   10.6  13.0  11.7  27.1  0.88  1.08  0.98  2.71
#>  6 mutagen  190.  7.93  15.4  24.4  16.0  36    0.64  1.02  0.67  2.57
#>  7 mutagen  328. 12.6   18.8  27.1  20.0  49.4  0.72  1.04  0.77  2.75
#>  8 nonmut…  324.  8.11  26.3  40.7  27.4  59.2  0.66  1.02  0.68  2.47
#>  9 mutagen  136.  7.56  11.3  18.2  11.8  25.7  0.63  1.01  0.65  2.57
#> 10 mutagen  323.  7.89  26.8  41.5  27.9  54.9  0.65  1.01  0.68  2.29
#> # ℹ 4,325 more rows
#> # ℹ 1,569 more variables: nAT <int>, nSK <int>, nBT <int>, nBO <int>,
#> #   nBM <int>, SCBO <dbl>, ARR <dbl>, nCIC <int>, nCIR <int>,
#> #   RBN <int>, RBF <dbl>, nDB <int>, nTB <int>, nAB <int>, nH <int>,
#> #   nC <int>, nN <int>, nO <int>, nP <int>, nS <int>, nF <int>,
#> #   nCL <int>, nBR <int>, nI <int>, nX <int>, nR03 <int>, nR04 <int>,
#> #   nR05 <int>, nR06 <int>, nR07 <int>, nR08 <int>, nR09 <int>, …

Goal: separate the blue and yellow

A ggplot2 dot-plot, with predictors molecular weights and partition coefficient on the x and y axes. Points are colored depending on the outcome, with red denoting mutagens and green denoting nonmutagens. The red and green clouds of points are largely intermixed, showing that these two predictors do not separate these classes well on their own.

Goal: separate the blue and yellow

A similar ggplot2 dot-plot, with predictors heavy atoms and average valence connectivity on the x and y axes.

Goal: separate the blue and yellow😱😱😱

A similar ggplot2 dot-plot, with predictors heavy atoms and average valence connectivity on the x and y axes.

Train and evaluate models

Why tidymodels?

Why tidymodels?  Consistency

With lm():

model <- 
  lm(mpg ~ ., mtcars)

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Consistency

With glmnet:

model <- 
  glmnet(
    as.matrix(mtcars[2:11]),
    mtcars$mpg
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("glmnet") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Consistency

With h2o:

h2o::h2o.init()
as.h2o(mtcars, "mtcars")

model <- 
  h2o.glm(
    x = colnames(mtcars[2:11]), 
    y = "mpg",
    "mtcars"
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("h2o") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Consistency

Why tidymodels?  Safety1

Why tidymodels?  Safety1

  • A 2023 review found data leakage to be “a widespread failure mode in machine-learning (ML)-based science.”
  • Overfitting leads to analysts believing models are more performant than they actually are.
  • Implementations of the same machine learning model give differing results, resulting in irreproducibility of modeling results.

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(roc_curve)

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(random_forest_tuning_result)

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(iterative_search_result)

Why tidymodels?  Completeness

Built-in support for 99 machine learning models!

#> # A tibble: 99 × 2
#>    name       engine   
#>    <chr>      <chr>    
#>  1 boost_tree C5.0     
#>  2 boost_tree h2o      
#>  3 boost_tree h2o_gbm  
#>  4 boost_tree lightgbm 
#>  5 boost_tree mboost   
#>  6 boost_tree spark    
#>  7 boost_tree xgboost  
#>  8 null_model parsnip  
#>  9 svm_linear LiblineaR
#> 10 svm_linear kernlab  
#> # ℹ 89 more rows

Why tidymodels?  Completeness

Built-in support for 102 data pre-processing techniques!

#> # A tibble: 102 × 1
#>    name               
#>    <chr>              
#>  1 step_rename_at     
#>  2 step_scale         
#>  3 step_kpca          
#>  4 step_percentile    
#>  5 step_depth         
#>  6 step_poly_bernstein
#>  7 step_impute_linear 
#>  8 step_novel         
#>  9 step_nnmf_sparse   
#> 10 step_slice         
#> # ℹ 92 more rows

Why tidymodels?  Extensibility

Can’t find the technique you need?

Why tidymodels? Deployability

Tightly integrated with vetiver and Posit Team.

Let’s go see how!😎

Resources

  • tidyverse: r4ds.hadley.nz

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org
  • Posit Team: posit.co/team

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org
  • Posit Team: posit.co/team
  • Slides and source code:
github.com/simonpcouch/mutagen

Thank you!