End-to-End Machine Learning with

What is a machine learning model?

Netflix movie recommendations
Zillow property value estimates
E-mail spam filters

Guess the value of outcomes using predictors

Machine learning with Posit Team

Understand and clean data with tidyverse
Train and evaluate models with tidymodels and Workbench
Share the final model with vetiver and Connect

Understand and clean data

Mutagenicity refers to a drug’s tendency to increase the rate of mutations
Mutagenicity can be evaluated with a lab test, though those tests are costly and time-intensive

What if a machine learning model could predict mutagenicity using known drug information?

mutagen_tbl
#> # A tibble: 4,335 × 1,580
#>    outcome    MW   AMW    Sv    Se    Sp    Ss    Mv    Me    Mp    Ms
#>    <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 mutagen  326.  7.59  29.3  42.6  30.6  50.7  0.68  0.99  0.71  2.03
#>  2 mutagen  174.  9.17  13.2  19.6  13.4  38    0.7   1.03  0.71  2.92
#>  3 nonmut…  300.  9.39  20.0  33.6  21.0  61.2  0.63  1.05  0.66  3.06
#>  4 nonmut…  143.  6.23  12.6  23.1  13.5  26.2  0.55  1     0.59  2.62
#>  5 nonmut…  216. 18.0   10.6  13.0  11.7  27.1  0.88  1.08  0.98  2.71
#>  6 mutagen  190.  7.93  15.4  24.4  16.0  36    0.64  1.02  0.67  2.57
#>  7 mutagen  328. 12.6   18.8  27.1  20.0  49.4  0.72  1.04  0.77  2.75
#>  8 nonmut…  324.  8.11  26.3  40.7  27.4  59.2  0.66  1.02  0.68  2.47
#>  9 mutagen  136.  7.56  11.3  18.2  11.8  25.7  0.63  1.01  0.65  2.57
#> 10 mutagen  323.  7.89  26.8  41.5  27.9  54.9  0.65  1.01  0.68  2.29
#> # ℹ 4,325 more rows
#> # ℹ 1,569 more variables: nAT <int>, nSK <int>, nBT <int>, nBO <int>,
#> #   nBM <int>, SCBO <dbl>, ARR <dbl>, nCIC <int>, nCIR <int>,
#> #   RBN <int>, RBF <dbl>, nDB <int>, nTB <int>, nAB <int>, nH <int>,
#> #   nC <int>, nN <int>, nO <int>, nP <int>, nS <int>, nF <int>,
#> #   nCL <int>, nBR <int>, nI <int>, nX <int>, nR03 <int>, nR04 <int>,
#> #   nR05 <int>, nR06 <int>, nR07 <int>, nR08 <int>, nR09 <int>, …

Goal: separate the blue and yellow

A ggplot2 dot-plot, with predictors molecular weights and partition coefficient on the x and y axes. Points are colored depending on the outcome, with red denoting mutagens and green denoting nonmutagens. The red and green clouds of points are largely intermixed, showing that these two predictors do not separate these classes well on their own.

Goal: separate the blue and yellow

A similar ggplot2 dot-plot, with predictors heavy atoms and average valence connectivity on the x and y axes.

Goal: separate the blue and yellow😱😱😱

Train and evaluate models

Why tidymodels?

Why tidymodels? Consistency

With lm():

model <- 
  lm(mpg ~ ., mtcars)

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Consistency

With glmnet:

model <- 
  glmnet(
    as.matrix(mtcars[2:11]),
    mtcars$mpg
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("glmnet") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Consistency

With h2o:

h2o::h2o.init()
as.h2o(mtcars, "mtcars")

model <- 
  h2o.glm(
    x = colnames(mtcars[2:11]), 
    y = "mpg",
    "mtcars"
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("h2o") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Consistency

Why tidymodels? Safety¹

A 2023 review found data leakage to be “a widespread failure mode in machine-learning (ML)-based science.”

Overfitting leads to analysts believing models are more performant than they actually are.

Implementations of the same machine learning model give differing results, resulting in irreproducibility of modeling results.

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(roc_curve)

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(random_forest_tuning_result)

Why tidymodels? Communicability

tidymodels objects can easily be visualized

autoplot(iterative_search_result)

Why tidymodels? Completeness

Built-in support for 99 machine learning models!

#> # A tibble: 99 × 2
#>    name       engine   
#>    <chr>      <chr>    
#>  1 boost_tree C5.0     
#>  2 boost_tree h2o      
#>  3 boost_tree h2o_gbm  
#>  4 boost_tree lightgbm 
#>  5 boost_tree mboost   
#>  6 boost_tree spark    
#>  7 boost_tree xgboost  
#>  8 null_model parsnip  
#>  9 svm_linear LiblineaR
#> 10 svm_linear kernlab  
#> # ℹ 89 more rows

Why tidymodels? Completeness

Built-in support for 102 data pre-processing techniques!

#> # A tibble: 102 × 1
#>    name               
#>    <chr>              
#>  1 step_rename_at     
#>  2 step_scale         
#>  3 step_kpca          
#>  4 step_percentile    
#>  5 step_depth         
#>  6 step_poly_bernstein
#>  7 step_impute_linear 
#>  8 step_novel         
#>  9 step_nnmf_sparse   
#> 10 step_slice         
#> # ℹ 92 more rows

Why tidymodels? Extensibility

Can’t find the technique you need?

Why tidymodels? Deployability

Tightly integrated with vetiver and Posit Team.

Let’s go see how!😎

Resources

tidyverse: r4ds.hadley.nz

Resources

tidyverse: r4ds.hadley.nz
tidymodels: tmwr.org

Resources

tidyverse: r4ds.hadley.nz
tidymodels: tmwr.org
Posit Team: posit.co/team

Resources

tidyverse: r4ds.hadley.nz
tidymodels: tmwr.org
Posit Team: posit.co/team
Slides and source code:

github.com/simonpcouch/mutagen

Thank you!

End-to-End Machine Learning with

What is a machine learning model?

Machine learning with Posit Team

Understand and clean data

Understand and clean data

Train and evaluate models

Why tidymodels?

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Safety1

Why tidymodels? Safety1

Why tidymodels? Communicability

Why tidymodels? Communicability

Why tidymodels? Communicability

Why tidymodels? Completeness

Why tidymodels? Completeness

Why tidymodels? Extensibility

Why tidymodels? Deployability

Resources

Resources

Resources

Resources

Why tidymodels? Safety¹

Why tidymodels? Safety¹