Tag Archives: grid search

Grid Search, Cross Validation and XGBoost in R

This post is to help take the chapter on tree-based methods of the highly reputed Introduction to Statistical Learning text to the next level with grid search, cross validation and XGBoost.

I assume that you have read the chapter on Tree-Based Methods (chapter 8 in the copy I got) and executed the code in the lab section. Here we look into regression trees only. It should be straight forward to apply the ideas here to a classification problem by changing the objective function.

The last statement of the Boosting sub-section (of the lab) states that

using lambda of 0.2 leads to a slightly lower test MSE than lambda of 0.001.

It is a pity that the authors do not teach how to select the optimal lambda via grid search and cross-validation. Here is an attempt to do that.

Grid Search with Cross Validation

Let us first split the Boston dataset into 80% training and 20% test (unlike the 50-50 split the authors do).

set.seed(1234)
trainInd <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
dfTrain <- Boston[ trainInd,]
dfTest  <- Boston[-trainInd,]
dim(dfTrain) # 404 14
dim(dfTest) # 102 14

For selecting the best parameters for gbm, we will need the caret package. Install it using this R command:

install.packages("caret")

Here is the code to do a grid search over all the hyper-parameters of gbm:

require(caret)

# set a grid of parameters to optimize upon;
# all these are gbm params and must be specified
gbmGrid <- expand.grid(
  n.trees = c(250, 500, 1000),
  shrinkage = c(0.001, 0.01, 0.1), # lambda
  interaction.depth = c(1, 2, 4), 
  n.minobsinnode = c(10) # left at default
)

# 5-fold CV; set method to "repeatedcv" 
# and set repeats for repeated CV
gbmTrControl <- trainControl(
  method = "cv",
  number = 5,
  #repeats = 4,
  verboseIter = FALSE,
  allowParallel = TRUE
)

# do the training
gbmTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) 
                              %in% c("medv")]), 
  y = dfTrain$medv,
  trControl = gbmTrControl,
  tuneGrid = gbmGrid,
  method = "gbm"
)

# get the top 5 models
head(gbmTrain$results[with(gbmTrain$results, 
                           order(RMSE)), ], 5)
# MSE for best model = 3.42 ^ 2 = 11.7

# get the best model's parameters
gbmTrain$bestTune
#   n.trees interaction.depth shrinkage n.minobsinnode
#       500                 4       0.1             10

yhatGbm <- predict(gbmTrain, newdata = dfTest)
mean((yhatGbm - dfTest$medv) ^ 2) # 7.28

# plot the preds in red and actual in blue
plot(dfTest$medv, yhatGbm, col = "red")
abline(0, 1, col = "blue")

The plot looks like this:

gbm

In this case, we are actually getting much lower test MSE than training MSE.

Feel free to expand the grid search to larger ranges and select better parameters.

XGBoost

Next, we see how to do the same optimal parameter selection for XGBoost.

Install the XGBoost package using

install.packages("xgboost")

The following code then searches and gives you the best hyper-parameters for XGBoost.

library(xgboost)

xgbGrid <- expand.grid(
  nrounds = c(250, 500, 1000),
  max_depth = c(1, 2, 4),
  eta = c(0.001, 0.003, 0.01),
  gamma = c(0, 1, 2),
  colsample_bytree = c(1, 0.5, 0.25),
  min_child_weight = c(1, 2)
)

xgbTrControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  verboseIter = FALSE,
  returnData = FALSE,
  allowParallel = TRUE
)

xgbTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) %in% c("medv")]), 
  y = dfTrain$medv,
  objective = "reg:linear",
  trControl = xgbTrControl,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

# get the top model and its results
head(xgbTrain$results[with(xgbTrain$results, 
                             order(RMSE)), ], 1)
# MSE 3.12 ^ 2 = 9.74

yhatXgb <- predict(xgbTrain, newdata = dfTest)
mean((yhatXgb - dfTest$medv) ^ 2) # 10.49

plot(dfTest$medv, yhatXgb, col = "red")
abline(0, 1, col = "blue")

# Variable Importance
names <- names(dfTrain)[! names(dfTrain) %in% c("medv")]
importanceMatrix <- xgb.importance(names, 
                                   model = xgbTrain$finalModel)
xgb.plot.importance(importanceMatrix[1:10,])

The variable importance plot looks like this:

varImp

Again, we can see that lstat and rm are the two main features.

The test MSE for the two models are lower compared to the ones presented in the text. This could be because our training set size is larger and also because of the optimizations we did for the models. For fair comparison, you should test the code in the text with 80% training data like we did here.

PS: I could not figure out easily how to plot the variable importance for the GBM model.