Lab 10: Multiple Regression

Learning Objectives

Today, we will explore methods for building and selecting models containing multiple predictor variables using two different data sets:

Mole-rate energy expenditure data:

We will fit models containing 1 continuous and 1 categorical covariate.
We will consider an interaction between the continuous and categorical variable.

Longnose dace abundance data

We will fit a model with multiple predictor variables and use stepAIC for model selection
We will explore methods for averaging across models

Lab Exercises

This first example comes from W&S (example 18.4).

Mole rats are the only known mammals with distinct social casts. A single queen and a small number of males are the only reproducing individuals in a colony. Remaining individuals, called workers, gather food, defend the colony, care for the young, and maintain the burrows. Recently, it was discovered that there might be two worker castes in the Damaraland mole rat (Crptomys damarensis). “Frequent workers” do almost all of the work in the colony, whereas infrequent workers” (lazy) do little work except on rare occasions after rains, when they extend the burrow system. To assess the physiological differences between the two types of workers, Scantlebury et al. (2006) compared daily energy expenditures of wild mole rats during a dry season. Energy expenditure appears to vary with body mass in both groups, but infrequent workers are heavier than frequent workers. How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass?

Scatterplot of log energy expenditure versus log mass, with separate regression lines overlaid.

Scatterplot of log energy expenditure versus log mass, with regression lines with the same slope overlaid.

We will fit and compare the two models represented by these two figures.

Fit a model containing ln.mass and caste using: lm(ln.energy ~ ln.mass + caste, data=MoleRats). Traditionally, this model was referred to as an ANCOVA model for “Analysis of Covariance”, and is represented by the figure on the right. Examine the fitted model using the summary function. Write out the equation for the model and interpret the 3 parameters in the model. Is ln.mass significant? Also, do the castes appear to expend different amounts of energy?
The model you just fit assumes the effect of ln.mass is the same for both castes (lazy and workers). Alternatively, we could fit a model where the effect of ln.mass is different for the two groups (figure on the left). We can do this by including another predictor: ln.mass:caste, representing an interaction between ln.mass and caste. In this model, the effect of ln.mass depends on the level of caste (i.e., whether the individual is lazy or a worker). Fit a model with this additional variable (you can just include ln.mass:caste inside of lm). Write out the equation for the model and interpret the 4 parameters in the model. What is the slope of the line relating changes in ln.mass and changes in ln.energy for lazy individuals? What is the slope for workers? Note: this model could also be fit using: lm(ln.energy ~ ln.mass*caste, data=MoleRats)
Compare the two fitted models using AIC (AIC(model1name, model2name)). Also, comment on the significance of the interaction term as judged by the t-test (using the summary function on the model containing the interaction term). Which model is better? Justify your answer.
Creating the first plot (associated with the model that includes an interaction) is easy using the ggformula package. To create the second plot, we have to get the model predicted values and add them to our MoleRat data set. Then, we can use gf_point and gf_line to create the plot. I’ve added code to produce both plots in your template.
Check whether the assumptions of linear regression are reasonable for the two models. I have provided code for the first model to help you out here. You will need to adapt this code to also create plots for the second model.

This example uses data from the Maryland Biological Stream Survey, downloaded from: http://udel.edu/~mcdonald/statmultreg.html.

stream: name of each sampled stream (i.e., the name associated with each case).
longnosedace: number of longnose dace (Rhinichthys cataractae) per 75-meter section of stream.
acreage: area (in acres) drained by the stream.
do2: the dissolved oxygen (in mg/liter).
maxdepth: the maximum depth (in cm) of the 75-meter segment of stream.
no3: nitrate concentration (mg/liter).
so4: sulfate concentration (mg/liter).
temp: the water temperature on the sampling date (in degrees C).

Suppose we are interested in predicting the abundance of longnose dace using chemical and biological characteristics of a stream.

Lets begin by fitting a full model = a model containing all of the predictor variables in the data set (other than stream): fullmod<-lm(longnosedace~acreage+do2+maxdepth+no3+so4+temp, data=dace). Look at the fitted model using summary(fullmod). Which variables are statistically significant?
Are the assumptions of linear regression met for the full model?
Try to find the “best” model using stepAIC in the MASS library. This function will perform what is called backwards stepwise selection using AIC (remember, smaller values of AIC indicate “better” fit). At each “step”, the current “best” model is compared (using AIC) to all models that have 1 fewer variable included [for the first step, the full model is considered “best”]. If all of the simpler models have larger values of AIC than the current “best” model, then the algorithm stops. If not, then we update the current “best” model (i.e., pick the one with the smallest AIC). We then repeat the process (considering models with an additional variable deleted). We continue until we can no longer improve model fit (as judged by AIC) or we we run out of explanatory variables to drop.
What model was selected as “best” by stepAIC? Fit this model and inspect the coefficients using the summary function.
Are the assumptions of linear regression met for the “best” model?
Compare the two different measures of \(R^2\) for both the full and “best” model.

If we were to go out and collect another data set with all the same variables and use stepAIC to choose the best model, we might end up with a very different model. One way to think about the problem is that we have uncertainty regarding the most appropriate model for inference and also uncertainty associated with the coefficients in our fitted models.

The t-tests and confidence intervals that we have seen in class assume that our model was pre-specified - i.e., chosen before collecting and analyzing the data. When we fit several possible models, select the best according to some criteria (e.g., AIC), and then act as though our model was pre-specified, we are ignoring the uncertainty associated with choosing the best model. This leads to what Breiman (1992) labeled the ``quiet scandal’’ in the statistical community: data-driven model selection leading to overly optimistic results, namely, confidence intervals that are too narrow, p-values that are too small, and models unlikely to fare well when applied to new data.

Model averaging can, in some cases, ameliorate these problems. In addition, averaging among many “good” models can often result in smaller prediction errors than using a single, “best” model. This method has gained a lot of popularity in the wildlife literature following the publication of Burnham and Anderson’s book, “Model selection and multimodel inference: a practical information-theoretic approach” (Burnham and Anderson 2002). There are at least 2 R packages that make model averaging “easy” to do (though, I would caution that it is not always appropriate - as we will see when we discuss “causal networks”). We will explore functions in the MuMIn package (similar functionality is provided by the AICcmodavg package). For this section, run the code in the template, noting:

The dredge function fits models with all possible combinations of predictor variables (a total of 64 models in this case).
The plot summarizes the models, first by ranking them by AIC, and then by indicating which variables are in each model (represented by the different rows). The “best” model is one with acreage, maxdepth, and no3 (top row). The second “best” model adds temp, etc.
model.avg(ms1, subset = delta < 4) averages coefficients across all models that are within 4 AIC units from the best model. Models are given weights that are proportional to \(\exp(-AIC)\) (models with smaller AIC’s are given larger weights).
confset.95p <- get.models(ms1, cumsum(weight) <= .95) finds the set of models such that the combined weight of the models is \(\le\) 0.95.
avgmod.95p <- model.avg(confset.95p) averages the coefficients across models in the “confidence set”, and summary(avgmod.95p) provides a variety of summaries for the model-averaged quantities. In the output, I would argue that the coefficient table “(with shrinkage)” is most appropriate. Also, I wouldn’t put too much emphasis on the section “Relative variable importance.” Its best to use the coefficients and their standard errors to judge importance.

Rough model-averaged confidence intervals can be obtained by taking \(\hat{\beta} \pm\) \(z^*\)SE (although the development and testing of methods for quantifying uncertainy associated with model-averaged quantities is still a very active area of statstical research).

Related to the problem of model-selection uncertainty is the problem of overfitting. From Fieberg and Johnson (2015):

Important to this discussion is that too much analysis can actually be a bad thing— ``If you torture your data long enough, it will confess to anything’’ (a quote usually attributed to economist Ronald Coase). When models are too finely tuned to specific aspects of the data in hand, we say the model is overfit, in the sense that the model is accommodating, not only the actual effects of the explanatory variables, but also the noise inherent in any real data set (Mundry and Nunn 2009). Especially prone to overfitting are 1) overly complex models, those with too many explanatory variables relative to the available sample size; and 2) models that allow for very flexible relationships between explanatory and response variables (Harrell 2001). An extreme example involves fitting a polynomial of degree \(n-1\) to a set of \(n\) data points—the model will fit the data exactly, but predictions of future data likely will be poor. One guideline for avoiding overfitting is to limit model degrees of freedom to 5% to 10% of the effective sample size (Harrell 2001, Burnham and Anderson 2002, Babyak 2004, Giudice et al. 2012).

Said another way, we might choose to limit the number of predictor variables in the set of models that we consider to between \(< n/10\) and \(<n/20\).

In the end, it may be a useful modeling strategy to avoid model selection altogether - we may choose to fit a full model and then stop (provided the assumptions of the model are met). Statistical significance and biological signifance of each predictor can then be judged using the estimated coefficient and its SE.

Literature Cited

Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York

Fieberg, J., & Johnson, D. H. (2015). MMI: Multimodel inference or models with management implications?. The Journal of Wildlife Management, 79(5), 708-718.

Scantlebury, M. J. R. Speakman, M.K. Oosthuizen, T. J. Roper, and N.C. Bennet. (2006). Energetics reveals physiologically distinct castes in a eusocial mammal. Nature 440:795-797.