Tag Archives: machine learning

Beta Binomial Model in R

I have been reading Kevin Murphy’s book Machine Learning: a Probabilistic Perspective to study machine learning in more depth. I love the book and it has been an exciting experience so far, but I wish it could have been better.

Once I started reading the first new topic (for me) Sec. 3.3 The Beta Binomial Model, I started seeing that he has used inconsistent notation in the formulae, especially with the shape parameters for the Beta distribution. I am not going to get into the details here, but if you read that section, you will understand the sloppiness.

Here, I thought I could share the R code for producing the plot in Figure 3.7. It is quite unfortunate that the Matlab code for the book (pmtk3) also uses inconsistent variable names. Fig. 3.7 is using M = 10, which is not even mentioned in the text! For the plug-in approximation you need to use the MAP estimate given in Eq. (3.21) and a binomial distribution.

I hope the code below will help map the variable names to the ones defined in the text correctly and will clarify how the plots are produced.

##########################################
# beta binomial model
##########################################

library(ggplot2)

# prior is beta(a, b) with a = 2, b = 2
a <- 2
b <- 2

# likelihood; 17 tails and 3 heads
N0 <- 17
N1 <- 3
N <- N0 + N1

# predict number of heads in M = 10 more tosses
M <- 10

# posterior predictive distribution (ppd)
ppd <- rep(NA, M)
for (k in 0:M) {
  # Eq (3.34) - (a, b) replaced by (a + N1, b + N0)
  ppd[k + 1] <- choose(M, k) * 
    beta(a + k + N1, M - k + N0 + b) / 
    beta(a + N1, b + N0)
}
# store results in a date frame for plotting with ggplot
# store the title for faceting
pltDf <- data.frame(x = 0:M, d = ppd, 
                    title = rep("Posterior Pred", M + 1))

# plug-in approx.
# first get MAP estimate = mode of posterior. Eq (3.21)
thetaHat <- (a + N1 - 1) / (a + b + N - 2)
pluginApprox <- rep(NA, M + 1)
for (k in 0:M) {
  pluginApprox[k + 1] <- choose(M, k) * 
    thetaHat ^ k * 
    (1 - thetaHat) ^ (M - k)
}
pluginDf <- data.frame(x = 0:M, d = pluginApprox, 
                       title = rep("Plug-in Approx", M + 1))

pltDf <- rbind(pltDf, pluginDf)

ggplot(pltDf, aes(x, d)) + 
  geom_bar(stat = "identity", fill = "navyblue", 
           alpha = 0.8) + 
  scale_x_continuous(breaks = pltDf$x) + 
  facet_wrap(~title)

That produces the following plot:

fig3-7

Grid Search, Cross Validation and XGBoost in R

This post is to help take the chapter on tree-based methods of the highly reputed Introduction to Statistical Learning text to the next level with grid search, cross validation and XGBoost.

I assume that you have read the chapter on Tree-Based Methods (chapter 8 in the copy I got) and executed the code in the lab section. Here we look into regression trees only. It should be straight forward to apply the ideas here to a classification problem by changing the objective function.

The last statement of the Boosting sub-section (of the lab) states that

using lambda of 0.2 leads to a slightly lower test MSE than lambda of 0.001.

It is a pity that the authors do not teach how to select the optimal lambda via grid search and cross-validation. Here is an attempt to do that.

Grid Search with Cross Validation

Let us first split the Boston dataset into 80% training and 20% test (unlike the 50-50 split the authors do).

set.seed(1234)
trainInd <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
dfTrain <- Boston[ trainInd,]
dfTest  <- Boston[-trainInd,]
dim(dfTrain) # 404 14
dim(dfTest) # 102 14

For selecting the best parameters for gbm, we will need the caret package. Install it using this R command:

install.packages("caret")

Here is the code to do a grid search over all the hyper-parameters of gbm:

require(caret)

# set a grid of parameters to optimize upon;
# all these are gbm params and must be specified
gbmGrid <- expand.grid(
  n.trees = c(250, 500, 1000),
  shrinkage = c(0.001, 0.01, 0.1), # lambda
  interaction.depth = c(1, 2, 4), 
  n.minobsinnode = c(10) # left at default
)

# 5-fold CV; set method to "repeatedcv" 
# and set repeats for repeated CV
gbmTrControl <- trainControl(
  method = "cv",
  number = 5,
  #repeats = 4,
  verboseIter = FALSE,
  allowParallel = TRUE
)

# do the training
gbmTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) 
                              %in% c("medv")]), 
  y = dfTrain$medv,
  trControl = gbmTrControl,
  tuneGrid = gbmGrid,
  method = "gbm"
)

# get the top 5 models
head(gbmTrain$results[with(gbmTrain$results, 
                           order(RMSE)), ], 5)
# MSE for best model = 3.42 ^ 2 = 11.7

# get the best model's parameters
gbmTrain$bestTune
#   n.trees interaction.depth shrinkage n.minobsinnode
#       500                 4       0.1             10

yhatGbm <- predict(gbmTrain, newdata = dfTest)
mean((yhatGbm - dfTest$medv) ^ 2) # 7.28

# plot the preds in red and actual in blue
plot(dfTest$medv, yhatGbm, col = "red")
abline(0, 1, col = "blue")

The plot looks like this:

gbm

In this case, we are actually getting much lower test MSE than training MSE.

Feel free to expand the grid search to larger ranges and select better parameters.

XGBoost

Next, we see how to do the same optimal parameter selection for XGBoost.

Install the XGBoost package using

install.packages("xgboost")

The following code then searches and gives you the best hyper-parameters for XGBoost.

library(xgboost)

xgbGrid <- expand.grid(
  nrounds = c(250, 500, 1000),
  max_depth = c(1, 2, 4),
  eta = c(0.001, 0.003, 0.01),
  gamma = c(0, 1, 2),
  colsample_bytree = c(1, 0.5, 0.25),
  min_child_weight = c(1, 2)
)

xgbTrControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  verboseIter = FALSE,
  returnData = FALSE,
  allowParallel = TRUE
)

xgbTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) %in% c("medv")]), 
  y = dfTrain$medv,
  objective = "reg:linear",
  trControl = xgbTrControl,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

# get the top model and its results
head(xgbTrain$results[with(xgbTrain$results, 
                             order(RMSE)), ], 1)
# MSE 3.12 ^ 2 = 9.74

yhatXgb <- predict(xgbTrain, newdata = dfTest)
mean((yhatXgb - dfTest$medv) ^ 2) # 10.49

plot(dfTest$medv, yhatXgb, col = "red")
abline(0, 1, col = "blue")

# Variable Importance
names <- names(dfTrain)[! names(dfTrain) %in% c("medv")]
importanceMatrix <- xgb.importance(names, 
                                   model = xgbTrain$finalModel)
xgb.plot.importance(importanceMatrix[1:10,])

The variable importance plot looks like this:

varImp

Again, we can see that lstat and rm are the two main features.

The test MSE for the two models are lower compared to the ones presented in the text. This could be because our training set size is larger and also because of the optimizations we did for the models. For fair comparison, you should test the code in the text with 80% training data like we did here.

PS: I could not figure out easily how to plot the variable importance for the GBM model.