This is Lab 9 on Data Mining. It is a simple introduction to the practicalities of linear regression.
First we load the datamining package and the data for the lab.
devtools::load_all()
## i Loading datamining
load('lab9.rda')
The dataset is 196 weeks of grocery sales for a store. The variables are:
| Variable | Description |
|---|---|
| Price.1 | DOLE PINEAPPLE ORANG 64 OZ |
| Price.2 | FIVE ALIVE CTRUS BEV 64 OZ |
| Price.3 | HH FRUIT PUNCH 64 OZ |
| Price.4 | HH ORANGE JUICE 64 OZ |
| Price.5 | MIN MAID O J CALCIUM 64 OZ |
| Price.6 | MIN MAID O J PLASTIC 96 OZ |
| Price.7 | MM PULP FREE OJ 64 OZ |
| Price.8 | SUNNY DELIGHT FLA CI 64 OZ |
| Price.9 | TREE FRESH O J REG 64 OZ |
| Price.10 | TROP PURE PRM HOMEST 64 OZ |
| Price.l1 | TROP SB HOMESTYLE OJ 64 OZ |
| Sold.4 | Number of units sold for HH ORANGE JUICE 64 OZ |
It is stored in a matrix \(x\). Look at the first few rows via x[1:3,].
head(x)
First we make a copy of x and put the response first in the data.frame. Then we transform Sold.4 appropriately and standardize all variables to have zero mean and unit variance.
xx <- x[, c(ncol(x), 1:ncol(x)-1)]
hist(xx$Sold.4)
xx$Sold.4 <- log(xx$Sold.4)
xx <- as.data.frame(scale(xx))
hist(xx$Sold.4)
We construct a linear model to predict Sold.4 as a function of Price.
fit <- lm(Sold.4 ~ ., data=xx)
summary(fit)
##
## Call:
## lm(formula = Sold.4 ~ ., data = xx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64931 -0.52982 -0.06812 0.56385 1.64732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.320e-15 5.144e-02 0.000 1.00000
## Price.1 -6.389e-02 5.506e-02 -1.160 0.24738
## Price.2 -1.500e-03 6.880e-02 -0.022 0.98263
## Price.3 2.130e-02 5.783e-02 0.368 0.71304
## Price.4 -6.531e-01 5.761e-02 -11.337 < 2e-16 ***
## Price.5 8.692e-01 2.913e-01 2.984 0.00323 **
## Price.6 -6.336e-02 8.810e-02 -0.719 0.47298
## Price.7 -6.891e-01 2.916e-01 -2.363 0.01918 *
## Price.8 -4.441e-02 5.535e-02 -0.802 0.42344
## Price.9 1.607e-03 5.883e-02 0.027 0.97824
## Price.10 4.824e-02 6.371e-02 0.757 0.44992
## Price.11 2.911e-01 6.090e-02 4.780 3.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7201 on 184 degrees of freedom
## Multiple R-squared: 0.5107, Adjusted R-squared: 0.4815
## F-statistic: 17.46 on 11 and 184 DF, p-value: < 2.2e-16
Then we plot all variables versus the residuals of this model.
predict_plot(fit, xx)
## plotting residuals
The important predictors appear to be Price.4, Price.5, Price.7 and Price.11.
We pick one of the important predictors from the last step and include it in a new model.
fit_4 <- lm(Sold.4 ~ Price.4, data=xx)
summary(fit_4)
##
## Call:
## lm(formula = Sold.4 ~ Price.4, data = xx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73482 -0.56929 -0.08913 0.65868 1.60808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.403e-16 5.643e-02 0.00 1
## Price.4 -6.157e-01 5.657e-02 -10.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.79 on 194 degrees of freedom
## Multiple R-squared: 0.3791, Adjusted R-squared: 0.3759
## F-statistic: 118.5 on 1 and 194 DF, p-value: < 2.2e-16
predict_plot(fit_4, xx)
## plotting residuals
The same predictors as before seem important with Price.11 the most important.
Next we add Price.11 to the model.
fit_4_11 <- lm(Sold.4 ~ Price.4 + Price.11, data=xx)
summary(fit_4_11)
##
## Call:
## lm(formula = Sold.4 ~ Price.4 + Price.11, data = xx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6298 -0.5697 -0.0398 0.6173 1.6679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.655e-16 5.305e-02 0.000 1
## Price.4 -6.599e-01 5.388e-02 -12.249 < 2e-16 ***
## Price.11 2.772e-01 5.388e-02 5.144 6.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7427 on 193 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4483
## F-statistic: 80.24 on 2 and 193 DF, p-value: < 2.2e-16
predict_plot(fit_4_11, xx)
## plotting residuals
We keep adding predictors until no more predictors seem useful.
fit_4_11_5_7 <- lm(Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7, data=xx)
summary(fit_4_11_5_7)
##
## Call:
## lm(formula = Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7,
## data = xx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77543 -0.45978 -0.07869 0.55778 1.66823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.380e-15 5.101e-02 0.000 1.00000
## Price.4 -6.782e-01 5.256e-02 -12.901 < 2e-16 ***
## Price.11 2.584e-01 5.217e-02 4.954 1.6e-06 ***
## Price.5 8.756e-01 2.820e-01 3.105 0.00219 **
## Price.7 -7.128e-01 2.825e-01 -2.523 0.01246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7142 on 191 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.49
## F-statistic: 47.83 on 4 and 191 DF, p-value: < 2.2e-16
Starting again from a model with only Price.4, we use step to add predictors.
fit_step <- step(fit_4, formula(xx))
## Start: AIC=-90.42
## Sold.4 ~ Price.4
##
## Df Sum of Sq RSS AIC
## + Price.11 1 14.599 106.47 -113.607
## + Price.5 1 8.066 113.00 -101.934
## + Price.7 1 6.094 114.98 -98.544
## + Price.6 1 4.879 116.19 -96.484
## + Price.10 1 2.691 118.38 -92.827
## <none> 121.07 -90.422
## + Price.2 1 1.010 120.06 -90.063
## + Price.3 1 0.593 120.48 -89.384
## + Price.1 1 0.378 120.69 -89.035
## + Price.9 1 0.171 120.90 -88.699
## + Price.8 1 0.072 121.00 -88.539
## - Price.4 1 73.930 195.00 0.997
##
## Step: AIC=-113.61
## Sold.4 ~ Price.4 + Price.11
##
## Df Sum of Sq RSS AIC
## + Price.5 1 5.811 100.66 -122.607
## + Price.7 1 4.140 102.33 -119.380
## + Price.1 1 1.715 104.76 -114.791
## <none> 106.47 -113.607
## + Price.10 1 0.904 105.57 -113.278
## + Price.6 1 0.381 106.09 -112.310
## + Price.8 1 0.170 106.30 -111.921
## + Price.2 1 0.161 106.31 -111.903
## + Price.3 1 0.012 106.46 -111.629
## + Price.9 1 0.003 106.47 -111.613
## - Price.11 1 14.599 121.07 -90.422
## - Price.4 1 82.766 189.24 -2.882
##
## Step: AIC=-122.61
## Sold.4 ~ Price.4 + Price.11 + Price.5
##
## Df Sum of Sq RSS AIC
## + Price.7 1 3.246 97.414 -127.031
## + Price.1 1 1.321 99.340 -123.195
## <none> 100.661 -122.607
## + Price.6 1 0.562 100.098 -121.704
## + Price.8 1 0.528 100.133 -121.638
## + Price.2 1 0.304 100.357 -121.199
## + Price.9 1 0.172 100.489 -120.942
## + Price.3 1 0.066 100.594 -120.736
## + Price.10 1 0.054 100.607 -120.711
## - Price.5 1 5.811 106.471 -113.607
## - Price.11 1 12.344 113.005 -101.934
## - Price.4 1 87.631 188.292 -1.864
##
## Step: AIC=-127.03
## Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7
##
## Df Sum of Sq RSS AIC
## <none> 97.414 -127.031
## + Price.1 1 0.871 96.543 -126.793
## + Price.8 1 0.729 96.685 -126.504
## + Price.6 1 0.445 96.970 -125.928
## + Price.10 1 0.140 97.274 -125.314
## + Price.2 1 0.104 97.310 -125.242
## + Price.9 1 0.070 97.345 -125.172
## + Price.3 1 0.030 97.385 -125.091
## - Price.7 1 3.246 100.661 -122.607
## - Price.5 1 4.917 102.332 -119.380
## - Price.11 1 12.517 109.931 -105.339
## - Price.4 1 84.891 182.305 -6.197
The automatically generated model is the same as that generated manually.
Finally we make a partial residual plot for the model from step.
predict_plot(fit_step, partial=TRUE)
## plotting partial residuals
Residual and partial residual plots are also available in the function termplot in the stats package.
library(stats)
par(mfrow = c(2, 2))
termplot(fit_step, partial.resid = TRUE, ask = FALSE)
par(mfrow = c(1, 1))
Note that the partial residuals for Sold.4 are shown using predict_plot whereas termplot uses the partial residuals for each predictor on the vertical axis.