Suppose we have the following data:
# simulate data to fit
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
Let's also suppose the user has fit a model:
# user fits a lm
mod = lm(y~x)
Now suppose I have an R package designed to perform several operations on the object mod
. Just for simplicify, suppose we have two functions, one that plots the data, and one that computes the coefficients. However, as an intermediary, suppose we want to perform some operation on the data (in this example, add ten).
Example:
# function that adds ten to all scores
add_ten = function(model) {
data = model$model
data = data + 10
return(data)
}
# functions I defined that do something to the "add_ten" dataset
plot_ten = function(model) {
new_data = data.frame(add_ten(model))
x = all.vars(formula(model))[2]
y = all.vars(formula(model))[1]
ggplot2::ggplot(new_data, aes_string(x=x, y=y)) + geom_point() + geom_smooth()
}
coefs_ten = function(model) {
new_data = data.frame(add_ten(model))
coef(lm(formula(model), new_data))
}
(Obviously, this is pretty silly to do. In actuality, the operation I want to perform is multiple imputation, which is computationally intensive).
Notice in the above example I have to call the add_ten
function twice, once for plot_ten and once for coefs_ten. This is inefficient.
So, now to my question, what is the best way to create a reusable object within a function?
I could, of course, create an object to be placed in the user's global environment:
add_ten = function(model) {
# check for add_ten_data in the global environment
if (exists("add_ten_data", where = .GlobalEnv)) return(get("add_ten_data", envir = .GlobalEnv))
data = model$model
data = data + 10
# assign add_ten_data to the global environment
assign('add_ten_data', data, envir = .GlobalEnv)
return(data)
}
I'm happy to do so, but worry about the "netiquette" of putting something in the user's environment. There's also a potential problem if users happen to have an object called "add_ten_data" in their environment.
So, what is the best way of accomplishing this?
Thanks in advance!
You should certainly avoid writing an object to the global environment. If you find that you have to repeat the same computationally expensive task at the top of a number of different functions, it means you are carrying out the computationally expensive task too late.
For example, you could create an S3 class that holds the necessary components to produce a "cheap" plot and a "cheap" extraction of the coefficients. It even has the benefits of generic dispatch:
add_ten <- function(model) model$model + 10
lm_tens <- function(formula, data)
{
model <- if(missing(data)) lm(formula) else lm(formula, data = data)
structure(list(data = data.frame(add_ten(model)), model = model),
class = "tens")
}
plot.tens <- function(tens) {
x = all.vars(formula(tens$data))[2]
y = all.vars(formula(tens$data))[1]
ggplot2::ggplot(tens$data, ggplot2::aes(x = x, y = y)) +
ggplot2::geom_point() +
ggplot2::geom_smooth()
}
coef.tens = function(tens) {
coef(lm(formula(tens$model), data = tens$data))
}
So now we just need to do:
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
mod <- lm_tens(y ~ x)
coef(mod)
#> (Intercept) x
#> 4.3269914 0.5775404
plot(mod)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Note that we only need to call add_ten
once here.
I've seen functions do something like this (e.g., the mice package and the norm package) and always found the two-stage process a little frustrating. But, i think a good alternative is similar to what you propose: do not require the lm_tens function, but use it if they've called it (otherwise, repeat add_ten).
@dfife it depends what possible uses you have for the object. I came across an example today from the
eulerr
package. Theeuler
class contains a few different lightweight fields that made it convenient to bundle them up into a class, but the most of the work is done later; theplot.euler
function is expensive, so the structures needed to draw the plot are only generated whenplot
is called. On the other hand, most regression functions do the computationally expensive part at the outset and you can pass the model around knowing that it's going to be cheap to do any work on it later.Let me give a bit more detail (just in case I'm overlooking some obvious solution). My package accepts models fitted from the lavaan package. (lavaan is NOT my package). Some of my functions require computing standard errors (which are estimated through mutliple imputation, which is computationally intensive). I can't attach these standard errors to the already-estimated lavaan model, so I have been computing them for each function. But, the same standard error computations might happen multiple times for different functions. Hence, the question about placing them in the global environment :)
@dfife so why not create a class that wraps the lavaan object and holds it as a member, but also holds the standard errors? Say your class is called "dfife_class" and contains a member "model" which is a lavaan object, and a member "SE" which has the computed standard errors. At the head of each function check whether it has been passed a lavaan object or a "dfife_class" object. If it's a lavaan object, calculate the SE and turn the lavaan into a "dfife_class" object. Then write your function to handle "dfife_class" objects
I see what you're saying. Yes, that's a good idea and will save an extra step (assuming users fit with lm_tens instead of lm). Thanks!