Benchmarks • rinfa

The rinfa parsnip extension implements bindings to linfa, a machine learning library written entirely in Rust. While linfa is somewhat less flexible than other modeling engines that tidymodels supports, it can often fit a good bit faster than other available engines. We’ll use this article to back that claim up.

While this article will benchmark fit times, it doesn’t measure model performance nor vary the kinds of features and correlation structures among them; in reality, the choice of modeling engine must weigh both computational and predictive performance in context.

First, loading tidymodels and the rinfa extension package:

library(tidymodels)
library(rinfa)

rinfa supports a number of model types:

Linear regression
Logistic regression
Multinomial regression
Decision trees
Naive bayes
Support vector machines

For each of those model types, we’ll benchmark the time to fit the model to a dataset of varying size, and compare the time-to-fit across each modeling engine.

Linear regression

For this first model type, I’ll write out the code that I’m using to benchmark the model fits. After that, I’ll just plot the results without showing the code.

First, setting parameters for the experiment—each modeling engine will fit models for each of the following number of rows:

n_rows <- round(10^seq(2, 6, .5))

n_rows
#> [1]     100     316    1000    3162   10000   31623  100000  316228 1000000

Now, we’ll grab the different possible engines from parsnip’s model environment. parsnip supports fitting models with the following engines:

engines <- unique(get_model_env()[["linear_reg_fit"]]$engine)
engines
#> [1] "lm"     "glm"    "glmnet" "stan"   "spark"  "keras"  "brulee" "linfa"

Now, we set up a data.frame() to drop experimental results in and load the packages needed to fit all of these different models.

linear_reg_timings <- 
  data.frame(
    engine = character(), 
    n_row = integer(), 
    timing = numeric()
  )

x <- lapply(engines, function(engine) {
  pkgs <- required_pkgs(linear_reg(engine = engine))
  lapply(pkgs, require, character.only = TRUE)
})

For each engine, we put together a model specification, and then iterate through various numbers of rows in training data. After generating a dataset with that many rows, we measure how long each model fit takes and store it in the linear_reg_timings data frame.

for (engine in engines) {
  spec <- linear_reg(engine = engine)
  if (engine == "glmnet") {
    spec <- spec %>% set_args(penalty = 0)
  }
  if (engine %in% c("spark", "keras")) {
    next
  }
  for (n_row in n_rows) {
    # first, check that a previous run didn't take too long; 
    # if it did, go ahead and skip it
    longest_fit <- linear_reg_timings[linear_reg_timings$engine == engine,]
    longest_fit <- max(longest_fit$timing)
    
    if (longest_fit > 600) next

    set.seed(1)
    d <- sim_regression(n_row)
    
    # The following blurb is atypical in a tidymodels workflow, but
    # feels worth including to give each engine a fair shot. We extract the
    # "encoding," i.e. whether the model takes (matrix) x and y arguments or
    # a data.frame and a formula. The elapsed time to convert from data.frame
    # to matrix or vice versa is non-negligible for quicker model fits.
    #
    # Normally, a user would just call whichever fit interface feels more
    # comfortable for them and tidymodels would take care of the conversion.
    fit_encoding <- get_fit("linear_reg")
    fit_encoding <- fit_encoding[fit_encoding$engine == engine, "value"]
    fit_encoding <- fit_encoding$value[[1]]$interface
    
    if (!identical(fit_encoding, "matrix")) {
      timing <- system.time(fit(spec, outcome ~ ., d))
    } else {
      x <- as.matrix(d[-1])
      timing <- system.time(fit_xy(spec, x = x, y = d$outcome))
    }
    
    linear_reg_timings <-
      bind_rows(
        linear_reg_timings,
        data.frame(engine = engine, n_row = n_row, timing = timing[["elapsed"]])
      )
  }
}

linear_reg_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "Log(# rows)", y = "Log(Fit time, seconds)")

Logistic regression

logistic_reg_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "# rows", y = "Log(Fit time, seconds)")

Multinomial regression

multinom_reg_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "# rows", y = "Log(Fit time, seconds)")

Decision trees

decision_tree_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "# rows", y = "Log(Fit time, seconds)")

Naive bayes

naive_Bayes_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "# rows", y = "Log(Fit time, seconds)")

Support vector machines

svm_linear_timings %>%
  ggplot() +
  aes(x = n_row, y = timing, colour = engine, group = engine) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "# rows", y = "Log(Fit time, seconds)")