Warm tip: This article is reproduced from serverfault.com, please click

Recommended way of creating reusable objects within an R function

发布于 2020-12-01 18:16:27

Suppose we have the following data:

# simulate data to fit
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))

Let's also suppose the user has fit a model:

# user fits a lm
mod = lm(y~x)

Now suppose I have an R package designed to perform several operations on the object mod. Just for simplicify, suppose we have two functions, one that plots the data, and one that computes the coefficients. However, as an intermediary, suppose we want to perform some operation on the data (in this example, add ten).

Example:

# function that adds ten to all scores
add_ten = function(model) {
  data = model$model
  data = data + 10
  return(data)
}

# functions I defined that do something to the "add_ten" dataset
plot_ten = function(model) {
  new_data = data.frame(add_ten(model))
  x = all.vars(formula(model))[2]
  y = all.vars(formula(model))[1]
  ggplot2::ggplot(new_data, aes_string(x=x, y=y)) + geom_point() + geom_smooth()
}

coefs_ten = function(model) {
  new_data = data.frame(add_ten(model))
  coef(lm(formula(model), new_data))
}

(Obviously, this is pretty silly to do. In actuality, the operation I want to perform is multiple imputation, which is computationally intensive).

Notice in the above example I have to call the add_ten function twice, once for plot_ten and once for coefs_ten. This is inefficient.

So, now to my question, what is the best way to create a reusable object within a function?

I could, of course, create an object to be placed in the user's global environment:

add_ten = function(model) {
  # check for add_ten_data in the global environment
  if (exists("add_ten_data", where = .GlobalEnv)) return(get("add_ten_data", envir = .GlobalEnv))
  data = model$model
  data = data + 10
  # assign add_ten_data to the global environment
  assign('add_ten_data', data, envir = .GlobalEnv)
  return(data)
}

I'm happy to do so, but worry about the "netiquette" of putting something in the user's environment. There's also a potential problem if users happen to have an object called "add_ten_data" in their environment.

So, what is the best way of accomplishing this?

Thanks in advance!

Questioner
dfife
Viewed
0
Allan Cameron 2020-12-02 02:45:02

You should certainly avoid writing an object to the global environment. If you find that you have to repeat the same computationally expensive task at the top of a number of different functions, it means you are carrying out the computationally expensive task too late.

For example, you could create an S3 class that holds the necessary components to produce a "cheap" plot and a "cheap" extraction of the coefficients. It even has the benefits of generic dispatch:

add_ten <- function(model) model$model + 10

lm_tens <- function(formula, data)
{
  model <- if(missing(data)) lm(formula) else lm(formula, data = data)
  
  structure(list(data = data.frame(add_ten(model)), model = model),
            class = "tens")
}

plot.tens <- function(tens) {
  x = all.vars(formula(tens$data))[2]
  y = all.vars(formula(tens$data))[1]
  ggplot2::ggplot(tens$data, ggplot2::aes(x = x, y = y)) + 
    ggplot2::geom_point() + 
    ggplot2::geom_smooth()
}

coef.tens = function(tens) {
  coef(lm(formula(tens$model), data = tens$data))
}

So now we just need to do:

set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))

mod <- lm_tens(y ~ x)
coef(mod)
#> (Intercept)           x 
#>   4.3269914   0.5775404
plot(mod)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note that we only need to call add_ten once here.