library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
In scientific research and business practise, there are usually multiple covariates \(x_1,x_2,...,x_p,p>1\) affecting the response \(y\). Obviously, simple linear regression won’t be sufficient if we want to model the relationship between \(y\) and \(x_1,x_2,...,x_p\). A natural generalisation of the simple linear model is \[ mean[y]=a+b_1x_1+b_2x_2+...+b_px_p. \] Such model is called multiple linear regression or multivariable linear model. In essence, the additional predictors are expected to explain the variation in the response not explained by a simple linear regression fit.
Adding more covariates means that we have to estimate more
coefficients as \(b_1\), \(b_2\),…,\(b_p\). You may feel that multiple linear
regression looks much more complicated than simple linear regression.
The amazing fact is that we can still handle such a model with
lm(). More importantly, most techniques we have learnt for
simple linear regression remain valid for the multiple variable
case.
In this exercise we’ll look at modelling sales using all
the advertising budgets on all the three different platforms. Recall
that sales is correlated with youtube,
facebook and newspaper.
Load the package datarium and the data
marketing. We further turn the data set to a tidy
tibble.
library(datarium)
data(`marketing`)
marketing <- marketing |> tibble()
marketing
## # A tibble: 200 × 4
## youtube facebook newspaper sales
## <dbl> <dbl> <dbl> <dbl>
## 1 276. 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 182. 49.6 70.2 22.2
## 5 217. 13.0 70.1 15.5
## 6 10.4 58.7 90 8.64
## 7 69 39.4 28.2 14.2
## 8 144. 23.5 13.9 15.8
## 9 10.3 2.52 1.2 5.76
## 10 240. 3.12 25.4 12.7
## # … with 190 more rows
A quick scan on the data set above reveals a critical fact: We have
overlook the multivariable nature of the data set. Observations in
sales are obtained with different combinations of the
advertising budgets on three platforms. Therefore, all three variables
youtube, facebook and newspaper
may contribute to the corresponding sales. There will be an overlap
between the audiences from different platforms. How can we quantify the
contributions of these three variables? In addition, if we built three
separate simple linear models with each of the covariates, we’d end up
with three different predictions for sales. Which one should we
choose?
To address the above issues, we have to model the relationship
between sales and multiple covariates in one step. We’ll
extend our simple linear model to take into account some of the other
variables in the marketing data in this exercise.
We’ll start by adding in facebook to our first
simple linear model (lm.youtube) and produce the model
summary of this extended model as follows.
lm.youbook <- lm(sales ~ youtube + facebook, data=marketing)
summary(lm.youbook)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5572 -1.0502 0.2906 1.4049 3.3994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.919 <2e-16 ***
## youtube 0.04575 0.00139 32.909 <2e-16 ***
## facebook 0.18799 0.00804 23.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 197 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
Compare it to the model summary of lm.youtube as
follows. you’ll notice a few things.
Estimate for youtube will have
changed.facebook.lm.youtube <- lm(sales ~ youtube, data=marketing)
summary(lm.youtube)
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0632 -2.3454 -0.2295 2.4805 8.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.439112 0.549412 15.36 <2e-16 ***
## youtube 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Why do you think the Estimate (and thus
Std. Error, t value, etc.) for
youtube changed? What is the P-value for
facebook here testing? Add some notes to your notebook
about this.
Answer: The Estimate (and thus
Std. Error, t value, etc.) for
youtube changes as new information from
facebook has been added in to the linear model. A certain
portion of variablity in sales is explained by
facebook. The P-value for facebook here tests
if the coefficient of facebook is equal to zero or not in
this bivariable linear model. This indeedly tests if the covariate
facebook make sufficient contributions to the explanation
of variations in sales given the fact that the covariate
youtube has explained a significant portion of the
variability in sales.
What is your conclusion about the relationship between
sales, youtube and
facebook?
Answer: Both youtube and
facebook contribute to the explanation of variablities in
sales through a bivariate linear function. The regression
equation can be written as
sales=3.50532+0.04575youtube+0.18799facebook.
Now add in newspaper to the linear model with both
youtube and facebook, and check the model
summary as follows
lm.youboper <- lm(sales ~ youtube + facebook + newspaper, data=marketing)
summary(lm.youboper)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
What is your conclusion about the relationship between
sales and newspaper?
Answer: After accouting for the contributions of
youtube and facebook to modelling
sales, newspaper does not offer any further
improvement for modelling sales as the P-value of
newspaper is insignificant at the level 0.05.
Fit a simple linear model relating sales to only
newspaper. What do you find in the R summary on the
significance of newspaper. Why?
Answer: newspaper becomes significant
in the simple linear model. Without knowing anything on
youtube and facebook, we can still use
newpaper to explain some uncertainties in
sales.
lm.newspaper <- lm(sales ~ newspaper, data=marketing)
summary(lm.newspaper)
##
## Call:
## lm(formula = sales ~ newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.473 -4.065 -1.007 4.207 15.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.82169 0.74570 19.88 < 2e-16 ***
## newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.111 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148The Multiple R-squareds in three models,
lm.youtube, lm.youbook, and
lm.youboper are detailed as follows.
summary(lm.youtube)$r.squared
## [1] 0.6118751
summary(lm.youbook)$r.squared
## [1] 0.8971943
summary(lm.youboper)$r.squared
## [1] 0.8972106
Compare the Multiple R-squareds of the above three
models. What do you find?
Answer: Multiple R-squared gets larger
if we add in more covariates to the linear model which reflects a better
goodness of fit.
Let’s take a look at visualising your model with all three
variables using visreg. You’ll notice that it will produce
3 plots, one for each of the variables.
library(visreg)
visreg(lm.youboper,gg=TRUE)
## [[1]]
##
## [[2]]
##
## [[3]]
These are called partial residual plots. What they
do is plot the relationship between the response and a single variable
after holding all other variables constant (usually at their median).
This allows you to see the effect of one variable after accounting for
the others. Notice the relationship with newspaper isn’t
very strong. You can choose a particular variable by using the
xvar variable.
e.g. visreg(lm.youboper, xvar="newspaper").
Compare these plots with the visualisation of the simple linear
models lm.youtube, lm.facebook,
lm.newspaper in Workshop C1. What do you find?
Answer: The slopes of each plot have been changed,
especially for newspaper. This is not amazing since the
presence of new covariates updates the relationship between
sales and rest covarites.
The diagnostics of a multivariable linear model follow the same
principles in those of a simple linear model. Let’s take a look at
the model diagnostics for the model with all three covariates. Produce
4-in-1 plots using plot(). Add some notes to your R
notebook as to whether the model assumptions are satisfied.
Answer: We use
par(mfrow=c(2,2)) to show all four plots together. It only
works for the basic R plot() so don’t use it for any
ggplot() stuff. The residuals vs fitted plot is
showing a bowl/bathtub trend while the scale-location plot looks spindle
shaped. With the smoothed curves in them, we can say that both linearity
and equal variance are highly questionable. The Q-Q plot looks also not
very good with some obvious outliers. The residuals vs leverage plot
further confirms the existence of outliers.
par(mfrow=c(2,2))
plot(lm.youboper)
We can take a log of sales and refit a multivariable
linear model as follows
lm.youboper.log <- lm(log(sales) ~ youtube + facebook + newspaper, data=marketing)
Produce 4-in-1 diagnostic plots using plot(). Add
some notes to your R notebook as to whether the model assumptions are
satisfied.
Answer: All plots are suggesting that there are a few outliers distorting the diagnostics plots. If we ignore those outliers, the residuals vs fitted and scale-location plots will support the linearity and equal variance of the transformed model.You can try to do it by removing the outliers from the data, re-fit the model and, produce the diagnostics plots.
par(mfrow=c(2,2))
plot(lm.youboper.log)
We can still use predict() for predicting with a
multivariable linear model. The only difference is that we need to
specify each covariates in the data frame for newdata
as
newbudget <- data.frame(youtube=0,facebook=0)
predict(lm.youbook,newdata=newbudget, interval='confidence')
## fit lwr upr
## 1 3.50532 2.808412 4.202228
One can modify interval and level to get
prediction or confidence intervals at different confidence levels.
Compare the above confidence interval with the confidence
intervals of two simple linear models lm.youtube and
lm.facebook at zero budgets. Discuss your
findings.
Answer: Both the confidence and prediction intervals
become narrower in the multivariable linear model. Adding in more
covariates in our linear model explains more variabilities in
sales which yields better uncertainty quantification in
prediction.
In Ex1, we have seen that, by including more covariates in
lm(), the goodness of fit of our linear models, i.e. \(R^2\), can be improved. Even if the new
covariate is insignificant in the original model, \(R^2\) of lm.youboper is
slightly higher than lm.youbook. It is not hard to conclude
that lm.youbook is much better than
lm.youtube. But how can we choose between
lm.youbook and lm.youboper? These two models
seem in a dead heat with each other.
An tricky fact is that \(R^2\) will always be improved no matter what covariate is added into a linear model. This can be demonstrated by the following simulation study.
Let’s add in an additional variable to marketing as
follows
set.seed(2020)
marketing.sim <- marketing |> mutate(noise=rnorm(200))
This additional variable (noise) is simulated from an
exponential distribution. Of course, noise does not
contribute any information to sales. But let us add in it
to lm() and produce a model summary as follows
lm.youboper.sim <- lm(sales ~ youtube + facebook + newspaper + noise, data=marketing.sim)
summary(lm.youboper.sim)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper + noise,
## data = marketing.sim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6014 -0.9853 0.2846 1.4330 3.2370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.550460 0.375408 9.458 <2e-16 ***
## youtube 0.045699 0.001397 32.701 <2e-16 ***
## facebook 0.188528 0.008615 21.883 <2e-16 ***
## newspaper -0.001381 0.005886 -0.235 0.815
## noise -0.114642 0.127544 -0.899 0.370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.024 on 195 degrees of freedom
## Multiple R-squared: 0.8976, Adjusted R-squared: 0.8955
## F-statistic: 427.5 on 4 and 195 DF, p-value: < 2.2e-16
From the R summary, it is not amazing to find that noise
is insignificant.
Compare the Multiple R-squared of
lm.youboper.sim with those of lm.youtube,
lm.youbook, and lm.youboper. What do you
find?
Answer: The Multiple R-squared of
lm.youboper.sim is larger than all the above models which
suggests a (slightly) better goodness of fit.
The insignificance of a covariate does not mean that it is
certainly not related to the response. In our data set
marketing.sim, though noise is just a
redundant variable containing no information, newspaper is
still correlated with sales. The correct interpretation is
that, after extracting the information on sales from
youtube and facebook, newspaper
becomes insignificant in explaining the variations in
sales.
The guaranteed improvement in \(R^2\) by adding more covariates can be dangerous as it may lead to some over-complicated models. In statistical modelling, an very important practical guideline is Occam’s razor or the law of parsimony which is the problem-solving principle that “entities should not be multiplied without necessity”. If two models provide similar fits to the real data set, we tend to keep the more parsimonious one, i.e. the model with less covariates.
For practitioners, a simple but effective idea is to remove these
insignificant covariates from our linear model. In addition, we have
Adjusted R-squared in the R summary to help us find the
most concise model with a sufficient goodness of fit.
Adjusted R-squared is modified from
Multiple R-squared by taking the complexity of the linear
models (the number of covariates) into the consideration. These
numerical indicators can be extracted directly as follows.
summary(lm.youtube)$adj.r.squared
## [1] 0.6099148
summary(lm.youbook)$adj.r.squared
## [1] 0.8961505
summary(lm.youboper)$adj.r.squared
## [1] 0.8956373
summary(lm.youboper.sim)$adj.r.squared
## [1] 0.895535
Find the best model which balances the complexity and the
goodness of fit by using Adjusted R-squared.
Answer: The best model is lm.youbook.
Not amazing, noise does not contribute any information and
the information in newspaper has been included in
youtube and facebook.
Another tool to examine the necessity of including one or more
covariates in our model is the ANalysis Of VAriance. We
can figure out that lm.youtube is a model reduced from
lm.youbook by setting the coefficient of
facebook at zero. Therefore, just like comparing the linear
trend model and quadratic trend model in Workshop C4,
anova() can test if the reduction in R^2 is
sufficient or not when adding one covariate as follows
anova(lm.youtube,lm.youbook)
## Analysis of Variance Table
##
## Model 1: sales ~ youtube
## Model 2: sales ~ youtube + facebook
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 3027.64
## 2 197 801.96 1 2225.7 546.74 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above ANOVA table suggests we shall keep the extended model
lm.youbook.
Similarly, lm.youbook is a model reduced from
lm.youboper.sim by setting the coefficient of
facebook and noise at zero. The advantage of
anova() is that it can check the pros and cons of two or
more covariates as a group simulatanously as follows,
anova(lm.youbook,lm.youboper.sim)
## Analysis of Variance Table
##
## Model 1: sales ~ youtube + facebook
## Model 2: sales ~ youtube + facebook + newspaper + noise
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 197 801.96
## 2 195 798.52 2 3.4361 0.4196 0.6579
Df in the second row of the ANOVA table is the number of
additional coefficients being tested. It is not hard to decide that we
shall keep the reduced model lm.youbook.
lm.youbook is involved in both ANOVA tables. But its
roles in the two tables are different. It is the extended model in the
first table but becomes the reduced one in the second. Just be careful
when using anova() and make sure that you are putting a
model at its correct position.
Re-run anova() by swapping the reduced model with
the extended model. Discuss your findings.
Answer: Negative Df and
Sum of Sq, same F and P-value. The conclusion
will be the same, i.e. keeping the simple model.
anova(lm.youboper.sim,lm.youbook)
## Analysis of Variance Table
##
## Model 1: sales ~ youtube + facebook + newspaper + noise
## Model 2: sales ~ youtube + facebook
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 195 798.52
## 2 197 801.96 -2 -3.4361 0.4196 0.6579
We can directly call anova() on a fitted linear model
without considering the pair of a reduced model and an extended model
as
anova(lm.youboper.sim)
## Analysis of Variance Table
##
## Response: sales
## Df Sum Sq Mean Sq F value Pr(>F)
## youtube 1 4773.1 4773.1 1165.5873 <2e-16 ***
## facebook 1 2225.7 2225.7 543.5169 <2e-16 ***
## newspaper 1 0.1 0.1 0.0312 0.8600
## noise 1 3.3 3.3 0.8079 0.3698
## Residuals 195 798.5 4.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above ANOVA table is testing the necessities of including each covariate one by one after adding in the previous covarite(s) to the linear model sequantially.
The first row corresponding to youtube tells us that
adding youtube makes more sense than including no covariate
in modelling sales (such a naive model can be fitted by
lm(sales~1, data=marketing)). The second row corresponding
to facebook tells us that adding facebook
still makes more sense even if we have added youtube into
modelling sales. The third (fourth) row suggests that,
after considering the previous two (and three) covariates, the covariate
newspaper (noise) contributes little
information in modelling sales.
We can permute the order of covariates in lm() and
generate the corresponding ANOVA table as follows
lm.youboper.sim.2 <- lm(sales ~ youtube + noise + newspaper + facebook,data=marketing.sim)
anova(lm.youboper.sim.2)
## Analysis of Variance Table
##
## Response: sales
## Df Sum Sq Mean Sq F value Pr(>F)
## youtube 1 4773.1 4773.1 1165.5873 < 2.2e-16 ***
## noise 1 8.7 8.7 2.1346 0.1456
## newspaper 1 259.5 259.5 63.3759 1.377e-13 ***
## facebook 1 1960.9 1960.9 478.8456 < 2.2e-16 ***
## Residuals 195 798.5 4.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpret each row of this ANOVA table and compare it with the previous ANOVA table. Discuss your findings.
Answer: newspaper becomes significant.
It means, after accouting for youtube and
noise, newspaper still contains useful
information in predicting sales. But facebook
is not yet considered. The summary table can be further used to check
the significance of the coefficient.
The ANOVA table relies heavily on the \(F\)-test as we have mentioned in Workshop
C4. From the R summaries of all above four models, we can access the a
row called F-statistic. The F-statistics of
all four models above are detailed as follows.
summary(lm.youtube)$fstatistic
## value numdf dendf
## 312.145 1.000 198.000
summary(lm.youbook)$fstatistic
## value numdf dendf
## 859.6177 2.0000 197.0000
summary(lm.youboper)$fstatistic
## value numdf dendf
## 570.2707 3.0000 196.0000
summary(lm.youboper.sim)$fstatistic
## value numdf dendf
## 427.4858 4.0000 195.0000
Check the value and numdf of
F-statistic. Discuss your findings.
Answer: The best model has the highest
value of F-statistic. numdf is
exactly the number of the coefficients (except for the intercept).
F-statistic reported in the R summary is also from an
ANOVA table which tests a null model with all coefficients being zero
(the reduced model) against the fitted linear model (the extended
model).
\(F\)-test is called an omnibus test as it tests if all coefficients in a linear model are equal to zero as a group. In a math way, the null hypothesis can be written as \[ H_0: b_1=b_2=...=b_p=0. \] Any \(b_i\) being non-zero significantly will reject the null hypothesis and lead to a conclusion that there exist at least one covariate in our data set explaining the variations in the response \(y\).
If you get F-statistic insignificant in your R summary,
you need to double check your data set to make sure that the data set
itself makes sense.
The solution is not provided for the optional exercise.
In a multivariate data set, we usually have a response \(y\) and multiple covariates \(x_1\), \(x_2\),…, \(x_p\). A multivariable linear model aims to model the relationship between \(y\) and the \(x\)’s. We are expecting that the variations in \(y\) can be well explained by including a suitable number of covariates as discussed in the previous exercise.
A side effect of multiple covariates is that there exist correlations between covariates themselves. Even if those correlations are weak, some specific combinations of correlations between several covariates can lead to some ill-posed results in our linear model.
In this exercise, we will study this critical issue arising from many real data sets, i.e. collinearity or multicollinearity.
Let’s simulate a dataset as follows.
set.seed(2021)
n <- 20
demo <- tibble(x1=1:n,x2=sample(1:n),e=rnorm(n)) |> mutate(y=0.5*x1+0.5*x2+e)
demo
## # A tibble: 20 × 4
## x1 x2 e y
## <int> <int> <dbl> <dbl>
## 1 1 7 0.182 4.18
## 2 2 6 1.51 5.51
## 3 3 14 1.60 10.1
## 4 4 20 -1.84 10.2
## 5 5 12 1.62 10.1
## 6 6 4 0.131 5.13
## 7 7 19 1.48 14.5
## 8 8 18 1.51 14.5
## 9 9 11 -0.942 9.06
## 10 10 13 -0.186 11.3
## 11 11 5 -1.10 6.90
## 12 12 17 1.21 15.7
## 13 13 1 -1.62 5.38
## 14 14 9 0.105 11.6
## 15 15 15 -1.46 13.5
## 16 16 3 -0.354 9.15
## 17 17 16 -0.0937 16.4
## 18 18 2 1.10 11.1
## 19 19 8 -1.96 11.5
## 20 20 10 -1.45 13.6
It is easy to fit a linear model based on the simulated data set
demo as
lm.demo <-lm(y~x1+x2,data=demo)
summary(lm.demo)
##
## Call:
## lm(formula = y ~ x1 + x2, data = demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60938 -0.95633 0.09897 0.89777 1.99849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.75665 0.82349 0.919 0.371
## x1 0.40593 0.04757 8.533 1.50e-07 ***
## x2 0.51938 0.04757 10.917 4.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.198 on 17 degrees of freedom
## Multiple R-squared: 0.9036, Adjusted R-squared: 0.8922
## F-statistic: 79.65 on 2 and 17 DF, p-value: 2.321e-09
Now let’s add in another predictor x3 which is the sum
of the other two predictors to the tibble
demo.e.collin <- demo |> mutate(x3=x1+x2)
demo.e.collin
## # A tibble: 20 × 5
## x1 x2 e y x3
## <int> <int> <dbl> <dbl> <int>
## 1 1 7 0.182 4.18 8
## 2 2 6 1.51 5.51 8
## 3 3 14 1.60 10.1 17
## 4 4 20 -1.84 10.2 24
## 5 5 12 1.62 10.1 17
## 6 6 4 0.131 5.13 10
## 7 7 19 1.48 14.5 26
## 8 8 18 1.51 14.5 26
## 9 9 11 -0.942 9.06 20
## 10 10 13 -0.186 11.3 23
## 11 11 5 -1.10 6.90 16
## 12 12 17 1.21 15.7 29
## 13 13 1 -1.62 5.38 14
## 14 14 9 0.105 11.6 23
## 15 15 15 -1.46 13.5 30
## 16 16 3 -0.354 9.15 19
## 17 17 16 -0.0937 16.4 33
## 18 18 2 1.10 11.1 20
## 19 19 8 -1.96 11.5 27
## 20 20 10 -1.45 13.6 30
Notice that the way we are generating this data, the response
y only really depends on x1 and
x2 What happens when we attempt to fit a regression model
in R using all of the three predictors?
lm.demo.e.collin <-lm(y~x1+x2+x3,data=demo.e.collin)
summary(lm.demo.e.collin)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = demo.e.collin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60938 -0.95633 0.09897 0.89777 1.99849
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.75665 0.82349 0.919 0.371
## x1 0.40593 0.04757 8.533 1.50e-07 ***
## x2 0.51938 0.04757 10.917 4.21e-09 ***
## x3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.198 on 17 degrees of freedom
## Multiple R-squared: 0.9036, Adjusted R-squared: 0.8922
## F-statistic: 79.65 on 2 and 17 DF, p-value: 2.321e-09
We see that R simply decides to exclude the variable x3.
Try to add another variable x4=x2-x1 and re-fit a
linear model y~x1+x2+x4. Discuss your findings.
What if we do not remove x3?
This creates a big trouble for R as a bit arithmetics will show that
y=0.5x1+0.5x2 is equivalent to y=0.5x3. More
crazily, we have y=408.5x3-408x2-408x1. There are infinite
combinations of coefficients for our underlying linear model.
Why is this happening? It is simply because that x3 can
be predicted perfectly from x1 and x2 with a
linear formula x3=x2+x1. The information contained in
x3 is redundant given the information from x1
and x2.
When this happens, we say there is exact or
perfect collinearity in the dataset. As a result of
this issue, R essentially chose to fit the model
y ~ x1 + x2 which agrees with the true underlying data
generation mechanism.
However notice that two other models would generate different R summaries
lm.demo.e.collin.2 <-lm(y~x1+x3+x2,data=demo.e.collin)
summary(lm.demo.e.collin.2)
##
## Call:
## lm(formula = y ~ x1 + x3 + x2, data = demo.e.collin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60938 -0.95633 0.09897 0.89777 1.99849
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.75665 0.82349 0.919 0.3710
## x1 -0.11344 0.05961 -1.903 0.0741 .
## x3 0.51938 0.04757 10.917 4.21e-09 ***
## x2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.198 on 17 degrees of freedom
## Multiple R-squared: 0.9036, Adjusted R-squared: 0.8922
## F-statistic: 79.65 on 2 and 17 DF, p-value: 2.321e-09
lm.demo.e.collin.3 <-lm(y~x2+x3+x1,data=demo.e.collin)
summary(lm.demo.e.collin.3)
##
## Call:
## lm(formula = y ~ x2 + x3 + x1, data = demo.e.collin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60938 -0.95633 0.09897 0.89777 1.99849
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.75665 0.82349 0.919 0.3710
## x2 0.11344 0.05961 1.903 0.0741 .
## x3 0.40593 0.04757 8.533 1.5e-07 ***
## x1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.198 on 17 degrees of freedom
## Multiple R-squared: 0.9036, Adjusted R-squared: 0.8922
## F-statistic: 79.65 on 2 and 17 DF, p-value: 2.321e-09
The order of covariates in lm() matters, just like
anova(). R fits the model y ~ x1 + x3 and
y ~ x2 + x3 respectively.
Given the fact x3=x1+x2, a bit arithmetic calculation
will reveal that the above three model fits,
lm.demo.e.collin, lm.demo.e.collin.2, and
lm.demo.e.collin.3, are essentially equivalent.
This can be further confirmed by the fitted values and residuals of three models. Extract the fitted values and residuals of the above three models and compare them.
This is a result of all of the information contained in
x3 being derived from x1 or x2.
As long as one of x1 or x2 is included in the
model, x3 can be used to recover the information from the
variable not included.
While their fitted values (and residuals) are all the same, their
estimated coefficients are quite different. The sign of x2
is switched in two of the models! So only lm.demo.collin
properly explains the relationship between the variables,
lm.demo.e.collin.2 and lm.demo.e.collin.3
still predict as well as lm.demo.collin, despite the
coefficients having little to no meaning, a concept we will return to
later.
Exact collinearity is an extreme example of collinearity, which
occurs in multiple regression when predictor variables are highly
correlated. From above two steps, it seems that exact collinearity is
not a big deal since lm() can handle it automatically.
Yeah. Exact collinearity can be resolved easily but let us add a bit
random perturbation to x3 as follows
set.seed(2022)
demo.collin <- demo |> mutate(x3.r=x1+x2+rnorm(n,sd=0.01))
demo.collin
## # A tibble: 20 × 5
## x1 x2 e y x3.r
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 7 0.182 4.18 8.01
## 2 2 6 1.51 5.51 7.99
## 3 3 14 1.60 10.1 17.0
## 4 4 20 -1.84 10.2 24.0
## 5 5 12 1.62 10.1 17.0
## 6 6 4 0.131 5.13 9.97
## 7 7 19 1.48 14.5 26.0
## 8 8 18 1.51 14.5 26.0
## 9 9 11 -0.942 9.06 20.0
## 10 10 13 -0.186 11.3 23.0
## 11 11 5 -1.10 6.90 16.0
## 12 12 17 1.21 15.7 29.0
## 13 13 1 -1.62 5.38 14.0
## 14 14 9 0.105 11.6 23.0
## 15 15 15 -1.46 13.5 30.0
## 16 16 3 -0.354 9.15 19.0
## 17 17 16 -0.0937 16.4 33.0
## 18 18 2 1.10 11.1 20.0
## 19 19 8 -1.96 11.5 27.0
## 20 20 10 -1.45 13.6 30.0
Now x3.r is no longer a sum of x1 and
x2. Without knowing the random pertubation caused by
rnorm(n), we won’t be able to recover x2 from
x1 and x3.r and vice versa. A tri-variable
linear model fitted to this data set is given as follows
lm.demo.collin <- lm(y~x1+x2+x3.r, data=demo.collin)
summary(lm.demo.collin)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3.r, data = demo.collin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81821 -0.68708 0.05492 0.81924 1.77835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5182 0.8915 0.581 0.569
## x1 22.8368 29.6306 0.771 0.452
## x2 22.9405 29.6178 0.775 0.450
## x3.r -22.4176 29.6131 -0.757 0.460
##
## Residual standard error: 1.213 on 16 degrees of freedom
## Multiple R-squared: 0.9069, Adjusted R-squared: 0.8895
## F-statistic: 51.96 on 3 and 16 DF, p-value: 1.804e-08
Unlike exact collinearity, here we can still fit a model with all of the predictors, but what effect does this have? All three coefficients become less significant!
One of the first things we should notice is that the \(F\)-test for the regression tells us that
the regression is significant, however each individual predictor is not.
Another interesting result is the opposite signs of the coefficients for
x1 and x3.r. This should seem rather
counter-intuitive. Increasing x1 increases y,
but increasing x3.r decreases y?
This happens as a result of one or more predictors can be modelled by
other predictors with a linear model. For example, the x1
variable explains a large amount of the variation in x3.r.
When they are both in the model, their effects on the response are
lessened individually, but together they still explain a large portion
of the variation of y
Actually, Estimates for x1 and
x2 still look ok but their Std.Errors are just
too large which results in small t values and large
P-values.
Including a variable like x.r in our linear model is
very dangerous for our statistical inference. It distort the model
summary by enlarging the value of Std.Error which further
leads to a false P-value.
In some cases, we can identify the collinearity in our data set
before fitting a linear model. This can be done via the pairs
plot produced by ggpairs() from the R package
GGally as
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
demo.collin |> select(-e) |> ggpairs()
The diagonal of the pairs plot dispicts the densities of corresponding variables in the data set. The lower triangle collects the scatter plots of different pairs of variables and the upper triangle summarise the corresponding correlation coefficients. The pairs plot provides us an efficient way to visualise a multivariable data set.
From the above pairs plot, we can find that x3.r is
highly linearly correlated with y. If two predictors are
highly correlated which means one can be predicted by another through a
line, they can also be identified from the pairs plot.
Produce the pairs plot for the data set marketing.
Can you identify any issues in this data set?
However, it may not be easy to identify the collinearity in the
data set with just a pairs plot. For example, in the last step we find
that x.3r is highly correlated with y but it
does not really reveal the true collinearity between the covariates. We
need a better tool to spot those duplicates hidden in the data set.
Notice that Std.Errors in the previous summary of
lm.demo.collin are abnormally large. We use the so-called
Variance Inflation Factor (VIF) to detect the possible
collinearites in a multivariable data set. The variance inflation factor
quantifies the effect of collinearity on the variance of our regression
estimates. The VIFs for each of the predictors in a linear model can be
calculated by vif() from the R package faraway
as follows.
library(faraway)
##
## Attaching package: 'faraway'
## The following object is masked from 'package:GGally':
##
## happy
vif(lm.demo.collin)
## x1 x2 x3.r
## 396517.8 396174.1 622232.7
vif(lm.youboper)
## youtube facebook newspaper
## 1.004611 1.144952 1.145187
In practice it is common to say that any VIF greater than 5 is cause for concern. So in this example we see there is a huge multicollinearity issue as all three predictors have a VIF greater than 5.
Check the VIFs of lm.youboper. Can you identify any
issues in this data set?