Introduction

This is Lab 9 on Data Mining. It is a simple introduction to the practicalities of linear regression.

First we load the datamining package and the data for the lab.

devtools::load_all()
## i Loading datamining
load('lab9.rda')

The dataset

The dataset is 196 weeks of grocery sales for a store. The variables are:

Variable Description
Price.1 DOLE PINEAPPLE ORANG 64 OZ
Price.2 FIVE ALIVE CTRUS BEV 64 OZ
Price.3 HH FRUIT PUNCH 64 OZ
Price.4 HH ORANGE JUICE 64 OZ
Price.5 MIN MAID O J CALCIUM 64 OZ
Price.6 MIN MAID O J PLASTIC 96 OZ
Price.7 MM PULP FREE OJ 64 OZ
Price.8 SUNNY DELIGHT FLA CI 64 OZ
Price.9 TREE FRESH O J REG 64 OZ
Price.10 TROP PURE PRM HOMEST 64 OZ
Price.l1 TROP SB HOMESTYLE OJ 64 OZ
Sold.4 Number of units sold for HH ORANGE JUICE 64 OZ

It is stored in a matrix \(x\). Look at the first few rows via x[1:3,].

head(x)

Standardizing

First we make a copy of x and put the response first in the data.frame. Then we transform Sold.4 appropriately and standardize all variables to have zero mean and unit variance.

xx <- x[, c(ncol(x), 1:ncol(x)-1)]
hist(xx$Sold.4)

xx$Sold.4 <- log(xx$Sold.4)
xx <- as.data.frame(scale(xx))
hist(xx$Sold.4)

Adding predictors

We construct a linear model to predict Sold.4 as a function of Price.

fit <- lm(Sold.4 ~ ., data=xx)
summary(fit)
## 
## Call:
## lm(formula = Sold.4 ~ ., data = xx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64931 -0.52982 -0.06812  0.56385  1.64732 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.320e-15  5.144e-02   0.000  1.00000    
## Price.1     -6.389e-02  5.506e-02  -1.160  0.24738    
## Price.2     -1.500e-03  6.880e-02  -0.022  0.98263    
## Price.3      2.130e-02  5.783e-02   0.368  0.71304    
## Price.4     -6.531e-01  5.761e-02 -11.337  < 2e-16 ***
## Price.5      8.692e-01  2.913e-01   2.984  0.00323 ** 
## Price.6     -6.336e-02  8.810e-02  -0.719  0.47298    
## Price.7     -6.891e-01  2.916e-01  -2.363  0.01918 *  
## Price.8     -4.441e-02  5.535e-02  -0.802  0.42344    
## Price.9      1.607e-03  5.883e-02   0.027  0.97824    
## Price.10     4.824e-02  6.371e-02   0.757  0.44992    
## Price.11     2.911e-01  6.090e-02   4.780 3.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7201 on 184 degrees of freedom
## Multiple R-squared:  0.5107, Adjusted R-squared:  0.4815 
## F-statistic: 17.46 on 11 and 184 DF,  p-value: < 2.2e-16

Then we plot all variables versus the residuals of this model.

predict_plot(fit, xx)
## plotting residuals

The important predictors appear to be Price.4, Price.5, Price.7 and Price.11.

We pick one of the important predictors from the last step and include it in a new model.

fit_4 <- lm(Sold.4 ~ Price.4, data=xx)
summary(fit_4)
## 
## Call:
## lm(formula = Sold.4 ~ Price.4, data = xx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73482 -0.56929 -0.08913  0.65868  1.60808 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.403e-16  5.643e-02    0.00        1    
## Price.4     -6.157e-01  5.657e-02  -10.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.79 on 194 degrees of freedom
## Multiple R-squared:  0.3791, Adjusted R-squared:  0.3759 
## F-statistic: 118.5 on 1 and 194 DF,  p-value: < 2.2e-16
predict_plot(fit_4, xx)
## plotting residuals

The same predictors as before seem important with Price.11 the most important.

Next we add Price.11 to the model.

fit_4_11 <- lm(Sold.4 ~ Price.4 + Price.11, data=xx)
summary(fit_4_11)
## 
## Call:
## lm(formula = Sold.4 ~ Price.4 + Price.11, data = xx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6298 -0.5697 -0.0398  0.6173  1.6679 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.655e-16  5.305e-02   0.000        1    
## Price.4     -6.599e-01  5.388e-02 -12.249  < 2e-16 ***
## Price.11     2.772e-01  5.388e-02   5.144 6.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7427 on 193 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4483 
## F-statistic: 80.24 on 2 and 193 DF,  p-value: < 2.2e-16
predict_plot(fit_4_11, xx)
## plotting residuals

We keep adding predictors until no more predictors seem useful.

fit_4_11_5_7 <- lm(Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7, data=xx)
summary(fit_4_11_5_7)
## 
## Call:
## lm(formula = Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7, 
##     data = xx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77543 -0.45978 -0.07869  0.55778  1.66823 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.380e-15  5.101e-02   0.000  1.00000    
## Price.4     -6.782e-01  5.256e-02 -12.901  < 2e-16 ***
## Price.11     2.584e-01  5.217e-02   4.954  1.6e-06 ***
## Price.5      8.756e-01  2.820e-01   3.105  0.00219 ** 
## Price.7     -7.128e-01  2.825e-01  -2.523  0.01246 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7142 on 191 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:   0.49 
## F-statistic: 47.83 on 4 and 191 DF,  p-value: < 2.2e-16

Automatic selection

Starting again from a model with only Price.4, we use step to add predictors.

fit_step <- step(fit_4, formula(xx))
## Start:  AIC=-90.42
## Sold.4 ~ Price.4
## 
##            Df Sum of Sq    RSS      AIC
## + Price.11  1    14.599 106.47 -113.607
## + Price.5   1     8.066 113.00 -101.934
## + Price.7   1     6.094 114.98  -98.544
## + Price.6   1     4.879 116.19  -96.484
## + Price.10  1     2.691 118.38  -92.827
## <none>                  121.07  -90.422
## + Price.2   1     1.010 120.06  -90.063
## + Price.3   1     0.593 120.48  -89.384
## + Price.1   1     0.378 120.69  -89.035
## + Price.9   1     0.171 120.90  -88.699
## + Price.8   1     0.072 121.00  -88.539
## - Price.4   1    73.930 195.00    0.997
## 
## Step:  AIC=-113.61
## Sold.4 ~ Price.4 + Price.11
## 
##            Df Sum of Sq    RSS      AIC
## + Price.5   1     5.811 100.66 -122.607
## + Price.7   1     4.140 102.33 -119.380
## + Price.1   1     1.715 104.76 -114.791
## <none>                  106.47 -113.607
## + Price.10  1     0.904 105.57 -113.278
## + Price.6   1     0.381 106.09 -112.310
## + Price.8   1     0.170 106.30 -111.921
## + Price.2   1     0.161 106.31 -111.903
## + Price.3   1     0.012 106.46 -111.629
## + Price.9   1     0.003 106.47 -111.613
## - Price.11  1    14.599 121.07  -90.422
## - Price.4   1    82.766 189.24   -2.882
## 
## Step:  AIC=-122.61
## Sold.4 ~ Price.4 + Price.11 + Price.5
## 
##            Df Sum of Sq     RSS      AIC
## + Price.7   1     3.246  97.414 -127.031
## + Price.1   1     1.321  99.340 -123.195
## <none>                  100.661 -122.607
## + Price.6   1     0.562 100.098 -121.704
## + Price.8   1     0.528 100.133 -121.638
## + Price.2   1     0.304 100.357 -121.199
## + Price.9   1     0.172 100.489 -120.942
## + Price.3   1     0.066 100.594 -120.736
## + Price.10  1     0.054 100.607 -120.711
## - Price.5   1     5.811 106.471 -113.607
## - Price.11  1    12.344 113.005 -101.934
## - Price.4   1    87.631 188.292   -1.864
## 
## Step:  AIC=-127.03
## Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7
## 
##            Df Sum of Sq     RSS      AIC
## <none>                   97.414 -127.031
## + Price.1   1     0.871  96.543 -126.793
## + Price.8   1     0.729  96.685 -126.504
## + Price.6   1     0.445  96.970 -125.928
## + Price.10  1     0.140  97.274 -125.314
## + Price.2   1     0.104  97.310 -125.242
## + Price.9   1     0.070  97.345 -125.172
## + Price.3   1     0.030  97.385 -125.091
## - Price.7   1     3.246 100.661 -122.607
## - Price.5   1     4.917 102.332 -119.380
## - Price.11  1    12.517 109.931 -105.339
## - Price.4   1    84.891 182.305   -6.197

The automatically generated model is the same as that generated manually.

Finally we make a partial residual plot for the model from step.

predict_plot(fit_step, partial=TRUE)
## plotting partial residuals

Residual and partial residual plots are also available in the function termplot in the stats package.

library(stats)
par(mfrow = c(2, 2))
termplot(fit_step, partial.resid = TRUE, ask = FALSE)

par(mfrow = c(1, 1))

Note that the partial residuals for Sold.4 are shown using predict_plot whereas termplot uses the partial residuals for each predictor on the vertical axis.