You will need to load the core library for the course textbook:
Let’s load all the packages upfront. NOTE: we use message=FALSE to suppress some messages.
library(ISLR) library(MASS) library(glmnet) library(leaps) library(pls)
In the lab session for this topic (Sections 5.3.2 and 5.3.3 in James et al.), we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in Section 5.1.5 (5.4, page 184).
logit_model glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial) summary(logit_model)
Call: glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly) Deviance Residuals: Min 1Q Median 3Q Max -1.623 -1.261 1.001 1.083 1.506 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.22122 0.06147 3.599 0.000319 *** Lag1 -0.03872 0.02622 -1.477 0.139672 Lag2 0.06025 0.02655 2.270 0.023232 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1496.2 on 1088 degrees of freedom Residual deviance: 1488.2 on 1086 degrees of freedom AIC: 1494.2 Number of Fisher Scoring iterations: 4
logit_model glm(Direction ~ Lag1 + Lag2, data = Weekly[-1,], family = binomial) summary(logit_model)
Call: glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1, ]) Deviance Residuals: Min 1Q Median 3Q Max -1.6258 -1.2617 0.9999 1.0819 1.5071 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.22324 0.06150 3.630 0.000283 *** Lag1 -0.03843 0.02622 -1.466 0.142683 Lag2 0.06085 0.02656 2.291 0.021971 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1494.6 on 1087 degrees of freedom Residual deviance: 1486.5 on 1085 degrees of freedom AIC: 1492.5 Number of Fisher Scoring iterations: 4
Instead of writing the code all over the place that converts predicted probabilities to Up/Down, we’ll use a function similar to what we did last week.
# get prediction as Up/Down direction - same function as what we used last week predict_glm_direction function(model, newdata = NULL) predict(model, newdata, type="response") return(ifelse(predictions 0.5, "Down", "Up")) > predicted_direction predict_glm_direction(logit_model, Weekly[1,])
Prediction was Up, true direction was Down
prediction_errors rep(NA, nrow(Weekly)) for (i in 1:nrow(Weekly)) glm(Direction ~ Lag1 + Lag2, data = Weekly[-i,], family = binomial) predicted_direction predict_glm_direction(logit_model, Weekly[i,]) prediction_errors[i] ifelse(predicted_direction != Weekly$Direction[i], 1, 0) >
mean(prediction_errors)
[1] 0.4499541
LOOCV estimates a test error rate of 0.4499541.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Load and split the College data.
# split the dataset into training and test set with the split_ratio. Optionally set the random seed before splitting # returns a list of two elements: # - first element is the training subset # - second elemet is the test subset sample_split function(dataset, split_ratio, seed = NULL) < set.seed(seed) train_subset sample(nrow(dataset) * split_ratio) return(list( train = dataset[train_subset, ], test = dataset[-train_subset, ] )) > college_split sample_split(College, 0.7, seed = 12345)
ols_model lm(Apps ~ . , data = college_split$train) ols_predictions predict(ols_model, college_split$test)
Since we need to calculate the MSE throughout this exercise, let’s write a function to do that.
calc_rmse function(y, y_hat) < return(sqrt(mean((y - y_hat)^2))) >
Now we can calculate the MSE for the OLS model
ols_rmse calc_rmse(college_split$test$Apps, ols_predictions)
Test RMSE for OLS is: 1188.832
Pick \(\lambda\) using training subset and report error on test subset
train_matrix model.matrix(Apps ~ . , data = college_split$train) test_matrix model.matrix(Apps ~ . , data = college_split$test) grid 10 ^ seq(4, -2, length = 100) ridge_model cv.glmnet(train_matrix, college_split$train$Apps, alpha = 0, lambda = grid, thresh = 1e-12) ridge_predictions predict(ridge_model, test_matrix, s = ridge_model$lambda.min) ridge_rmse calc_rmse(college_split$test$Apps, ridge_predictions)
Test RMSE for ridge model is: 1188.8145824
Pick \(\lambda\) using training subset and report error on test subset
lasso_model cv.glmnet(train_matrix, college_split$train$Apps, alpha = 1, lambda = grid, thresh = 1e-12) lasso_predictions predict(lasso_model, test_matrix, s = lasso_model$lambda.min) lasso_rmse calc_rmse(college_split$test$Apps, lasso_predictions)
Test RMSE for lasso model is: 1188.7994726
predict(lasso_model, s = lasso_model$lambda.min, type = "coefficients")
19 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) -601.38112596 (Intercept) . PrivateYes -321.41076249 Accept 1.70685797 Enroll -1.35810665 Top10perc 45.45111099 Top25perc -15.88692306 F.Undergrad 0.09983028 P.Undergrad 0.01567394 Outstate -0.09222852 Room.Board 0.11816525 Books -0.04056782 Personal 0.05792203 PhD -5.56419852 Terminal -5.43103094 S.F.Ratio 21.59623616 perc.alumni 1.87814752 Expend 0.10806715 Grad.Rate 8.13596816
pcr_model pcr(Apps ~ . , data = college_split$train, scale = TRUE, validation = "CV") validationplot(pcr_model, val.type = "MSEP")
pcr_predictions predict(pcr_model, college_split$test, ncomp = 10) pcr_rmse calc_rmse(college_split$test$Apps, pcr_predictions)
Test RMSE for PCR model is: 1556.8207088
pls_model plsr(Apps ~ . , data = college_split$train, scale = TRUE, validation = "CV") validationplot(pls_model, val.type = "MSEP")
pls_predictions predict(pls_model, college_split$test, ncomp = 10) pls_rmse calc_rmse(college_split$test$Apps, pls_predictions)
Test RMSE for PLS model is: 1178.3609154
Let’s write a function to compute the \(R^2\) so we can compare the fit for various models. Recall that the \(R^2\) is given by:
calc_r2 function(y, y_hat) mean(y) rss sum((y - y_hat)^2) tss sum((y - y_bar)^2) return(1 - (rss / tss)) >
Results for OLS, Lasso, Ridge are comparable. Lasso reduces the \(F.Undergrad\) and \(Books\) variables to zero and shrinks coefficients of other variables. Here are the test \(R^2\) for all models.
ols_r2 calc_r2(college_split$test$Apps, ols_predictions) ridge_r2 calc_r2(college_split$test$Apps, ridge_predictions) lasso_r2 calc_r2(college_split$test$Apps, lasso_predictions) pcr_r2 calc_r2(college_split$test$Apps, pcr_predictions) pls_r2 calc_r2(college_split$test$Apps, pls_predictions) barplot(c(ols_r2, ridge_r2, lasso_r2, pcr_r2, pls_r2), names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
The plot shows that test \(R^2\) for all models except PCR are around 0.9, with PLS having slightly higher test \(R^2\) than others. PCR has a smaller test \(R^2\) of less than 0.8. All models except PCR predict college applications with high accuracy.
We will now try to predict per capita crime rate in the Boston data set.
Best subset selection
We need a function for getting predictions from a regsubsets model.
predict.regsubsets function(object, newdata, id, . ) as.formula(object$call[[2]]) model_matrix model.matrix(formula, newdata) coefi coef(object, id = id) model_matrix[, names(coefi)] %*% coefi >
Now run K-fold cross-validation
set.seed(12345) k 10 p ncol(Boston)-1 folds sample(rep(1:k, length = nrow(Boston))) cv_errors matrix(NA, k, p) for (i in 1:k) regsubsets(crim ~ . , data = Boston[folds!=i,], nvmax = p) for (j in 1:p) predict(best_fit, Boston[folds==i, ], id = j) cv_errors[i,j] mean((Boston$crim[folds==i] - prediction)^2) > > cv_rmse sqrt(apply(cv_errors, 2, mean)) plot(cv_rmse, type = "b")
which.min(cv_rmse)
[1] 12
min(cv_rmse)
[1] 6.55376
Lasso
x model.matrix(crim ~ . -1, data = Boston) cv_lasso cv.glmnet(x, Boston$crim, type.measure = "mse", alpha = 1) plot(cv_lasso)
coef(cv_lasso)
14 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 1.4186414 zn . indus . chas . nox . rm . age . dis . rad 0.2298449 tax . ptratio . black . lstat . medv .
sqrt(cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.1se])
[1] 7.568515
Ridge regression
x model.matrix(crim ~ . -1, data = Boston) cv_ridge cv.glmnet(x, Boston$crim, type.measure = "mse", alpha = 0) plot(cv_ridge)
coef(cv_ridge)
14 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 1.761587688 zn -0.002865786 indus 0.026514040 chas -0.142145708 nox 1.684835359 rm -0.130992458 age 0.005612849 dis -0.084630196 rad 0.039946869 tax 0.001830673 ptratio 0.063756646 black -0.002278831 lstat 0.031682619 medv -0.020839190
sqrt(cv_ridge$cvm[cv_ridge$lambda == cv_ridge$lambda.1se])
[1] 7.760891
PCR
pcr_model pcr(crim ~ . , data = Boston, scale = TRUE, validation = "CV") summary(pcr_model)
Data: X dimension: 506 13 Y dimension: 506 1 Fit method: svdpc Number of components considered: 13 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV 8.61 7.270 7.278 6.832 6.824 6.855 6.908 adjCV 8.61 7.264 7.272 6.824 6.813 6.846 6.895 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV 6.901 6.770 6.800 6.795 6.811 6.784 6.705 adjCV 6.887 6.756 6.785 6.780 6.795 6.765 6.686 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps X 93.45 95.40 97.04 98.46 99.52 100.0 crim 42.47 42.55 42.78 43.04 44.13 45.4
13 component PCR fit has lowest CV/adjCV RMSEP.
See above answer for cross-validated mean squared errors of selected models.
I would choose the 12 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.