Tag Archives: R

Enhanced feature importance plot in scikit-learn with R’s ggplot2

A typical plot I want to make after running an ensemble model like random forest or gradient boosting is the feature importance plot. Scikit-learn has support for plotting feature importances as described here, but I found the lack of variable names in that plot quite disappointing! Let’s enhance it!

I also find matplotlib quite tedious compared to R’s ggplot. While there is a Python package for ggplot, it is quite limited. Thanks to rpy2, we can run R code directly from inside IPython Notebook.

This post just shows the two steps involved in doing this:

  1. load the feature importances returned by an ensemble model into a Pandas data frame
  2. pass the data frame to R for plotting with ggplot2

Loading feature importances into a Pandas data frame

I am going to assume that you have a model that returns feature importances, which is just a numpy ndarray. Here is the function to get the data frame:

import pandas as pd

def get_feature_importance_df(feature_importances, 
                              column_names, 
                              top_n=25):
    """Get feature importance data frame.

    Parameters
    ----------
    feature_importances : numpy ndarray
        Feature importances computed by an ensemble 
            model like random forest or boosting
    column_names : array-like
        Names of the columns in the same order as feature 
            importances
    top_n : integer
        Number of top features

    Returns
    -------
    df : a Pandas data frame

    """
    
    imp_dict = dict(zip(column_names, 
                        feature_importances))
    top_features = sorted(imp_dict, 
                          key=imp_dict.get, 
                          reverse=True)[0:top_n]
    top_importances = [imp_dict[feature] for feature 
                          in top_features]
    df = pd.DataFrame(data={'feature': top_features, 
                            'importance': top_importances})
    return df

Then call the function and store the results in a variable like this:

imp_df = get_feature_importance_df(feature_importances, 
                                   df.columns)

Pass on the data frame to R

For this, I would refer you to Ritesh Agrawal’s post again. Once you have tried the steps in that post, you can use the following code in IPython notebook:

%load_ext rpy2.ipython
%%R -i imp_df -w 1000 -h 650 -u px
library(ggplot2)

# to maintain the descending order
imp_df$feature <- factor(
  imp_df$feature, 
  levels = imp_df$feature[order(imp_df$importance, 
                                decreasing = T)])

ggplot(imp_df, aes(feature, importance)) + 
  geom_bar(stat="identity", fill = "blue") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

The result

That produces a beautiful feature importance plot with the names of the variables on the X axis. (Sorry, I have truncated the names since this is from my office project.)

Feature importance plot

Beta Binomial Model in R

I have been reading Kevin Murphy’s book Machine Learning: a Probabilistic Perspective to study machine learning in more depth. I love the book and it has been an exciting experience so far, but I wish it could have been better.

Once I started reading the first new topic (for me) Sec. 3.3 The Beta Binomial Model, I started seeing that he has used inconsistent notation in the formulae, especially with the shape parameters for the Beta distribution. I am not going to get into the details here, but if you read that section, you will understand the sloppiness.

Here, I thought I could share the R code for producing the plot in Figure 3.7. It is quite unfortunate that the Matlab code for the book (pmtk3) also uses inconsistent variable names. Fig. 3.7 is using M = 10, which is not even mentioned in the text! For the plug-in approximation you need to use the MAP estimate given in Eq. (3.21) and a binomial distribution.

I hope the code below will help map the variable names to the ones defined in the text correctly and will clarify how the plots are produced.

##########################################
# beta binomial model
##########################################

library(ggplot2)

# prior is beta(a, b) with a = 2, b = 2
a <- 2
b <- 2

# likelihood; 17 tails and 3 heads
N0 <- 17
N1 <- 3
N <- N0 + N1

# predict number of heads in M = 10 more tosses
M <- 10

# posterior predictive distribution (ppd)
ppd <- rep(NA, M)
for (k in 0:M) {
  # Eq (3.34) - (a, b) replaced by (a + N1, b + N0)
  ppd[k + 1] <- choose(M, k) * 
    beta(a + k + N1, M - k + N0 + b) / 
    beta(a + N1, b + N0)
}
# store results in a date frame for plotting with ggplot
# store the title for faceting
pltDf <- data.frame(x = 0:M, d = ppd, 
                    title = rep("Posterior Pred", M + 1))

# plug-in approx.
# first get MAP estimate = mode of posterior. Eq (3.21)
thetaHat <- (a + N1 - 1) / (a + b + N - 2)
pluginApprox <- rep(NA, M + 1)
for (k in 0:M) {
  pluginApprox[k + 1] <- choose(M, k) * 
    thetaHat ^ k * 
    (1 - thetaHat) ^ (M - k)
}
pluginDf <- data.frame(x = 0:M, d = pluginApprox, 
                       title = rep("Plug-in Approx", M + 1))

pltDf <- rbind(pltDf, pluginDf)

ggplot(pltDf, aes(x, d)) + 
  geom_bar(stat = "identity", fill = "navyblue", 
           alpha = 0.8) + 
  scale_x_continuous(breaks = pltDf$x) + 
  facet_wrap(~title)

That produces the following plot:

fig3-7

Connect to Oracle Database from R on mac

This is a follow-up post of this Connect to Oracle from R by Ankit Agarwal. I tried to add a comment to his blog, but the formatting was all messed up, so thought of writing it up here more for my own reference.

Just adding how to connect to Oracle DB from R on a mac. I am on El Capitan version 10.11.3.

We will use RJDBC here. First part is to figure out how to find and set JAVA_HOME.

Run this command from command line:

ls -l $(which java)

If this doesn’t output anything, then you need to install Java. Go to the Java SDK download page and download the dmg file for Mac for the latest Java SDK. Then install Java by double clicking the dmg file.

In my case, this outputs:

lrwxr-xr-x  1 root  wheel  74 Jan 13  2016 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java

Please scroll right to see the symbolic link. You want to note down the part up to JavaVM.framework to use in R script i.e. /System/Library/Frameworks/JavaVM.framework in my case. This will be set to JAVA_HOME in our script. (You can also set the environment variable JAVA_HOME using .bashrc if you want.)

Download ojdbc JAR as mentioned in the main post. I just downloaded ojdbc6.jar and stored it in my home directory.

That’s all you need to install.

Then try the following from R, which is just the code given in the main post with slight modifications:

# install RJDBC if not done already
install.packages("RJDBC")

# Set JAVA_HOME to what we found out before
jHome <- '/System/Library/Frameworks/JavaVM.framework'
Sys.setenv(JAVA_HOME=jHome)
options(java.parameters="-Xmx2g")
library(rJava)

# Output Java version
.jinit()
print(.jcall("java/lang/System", "S", "getProperty", 
             "java.version"))

# Load RJDBC library
library(RJDBC)

# Create connection driver and open connection
jdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", 
                   classPath="~/ojdbc6.jar")

# db connection details; please fill in yours
dbHost  <- "xx"
dbPort  <- "xx"
srvName <- "xx"
dbUser  <- "xx"
dbPass  <- "xx"

conStr <- paste("jdbc:oracle:thin:@//", dbHost, ":", dbPort, 
                "/", srvName, sep = "")
jdbcConnection <- dbConnect(jdbcDriver, conStr, 
                            dbUser, dbPass)

# Query to get count of a table; replace YOUR_TABLE
dbGetQuery(jdbcConnection, "select count(*) 
           from <YOUR_TABLE>")

# Close connection
dbDisconnect(jdbcConnection)

An Alternative to R pairs with reshape2 and ggplot2

Base plot function: “pairs”

You may be aware of the pairs plotting function in R. Here is an example with the Boston data set. We will just use a few variables since there isn’t much room here to show the plots for all of them.

miniBoston <- Boston[, c("crim", "lstat", "rm", "medv")]

Let us inspect this new data frame:

> head(miniBoston)
     crim lstat    rm medv
1 0.00632  4.98 6.575 24.0
2 0.02731  9.14 6.421 21.6
3 0.02729  4.03 7.185 34.7
4 0.03237  2.94 6.998 33.4
5 0.06905  5.33 7.147 36.2
6 0.02985  5.21 6.430 28.7

Let’s plot this with the pairs function now:

pairs(miniBoston)

This produces the following plot:
MiniBostonPairs

This is great for seeing associations between all the variables. While this looks OK, it takes too long to get generated and wastes a lot of space (each plot is there twice) when you have a lot of variables and too much data and all you are interested is in seeing how a specific variable is related to all the other variables.

Sensible Alternative with reshape2 and ggplot2

If you do not have reshape2 and ggplot2 installed already, install them first:

install.packages("reshape2")
install.packages("ggplot2")

Let us say we are interested in how the variable medv is related to the other variables. We proceed with:

library(reshape2)
meltBoston <- melt(miniBoston, "medv")

If you inspect meltBoston, you see

> str(meltBoston)
'data.frame':	1518 obs. of  3 variables:
 $ medv    : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ variable: Factor w/ 3 levels "crim","lstat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

> head(meltBoston)
  medv variable   value
1 24.0     crim 0.00632
2 21.6     crim 0.02731
3 34.7     crim 0.02729
4 33.4     crim 0.03237
5 36.2     crim 0.06905
6 28.7     crim 0.02985

Now we can plot like this:

library(ggplot2)
ggplot(data = meltBoston, aes(x = value, y = medv)) + 
  geom_point(size = 0.3, pch = 1) + 
  facet_wrap(~ variable, 
             ncol = 3, 
             scales = "free_x")

You get a plot like this which uses the space more effectively:

BostonGgplot

Grid Search, Cross Validation and XGBoost in R

This post is to help take the chapter on tree-based methods of the highly reputed Introduction to Statistical Learning text to the next level with grid search, cross validation and XGBoost.

I assume that you have read the chapter on Tree-Based Methods (chapter 8 in the copy I got) and executed the code in the lab section. Here we look into regression trees only. It should be straight forward to apply the ideas here to a classification problem by changing the objective function.

The last statement of the Boosting sub-section (of the lab) states that

using lambda of 0.2 leads to a slightly lower test MSE than lambda of 0.001.

It is a pity that the authors do not teach how to select the optimal lambda via grid search and cross-validation. Here is an attempt to do that.

Grid Search with Cross Validation

Let us first split the Boston dataset into 80% training and 20% test (unlike the 50-50 split the authors do).

set.seed(1234)
trainInd <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
dfTrain <- Boston[ trainInd,]
dfTest  <- Boston[-trainInd,]
dim(dfTrain) # 404 14
dim(dfTest) # 102 14

For selecting the best parameters for gbm, we will need the caret package. Install it using this R command:

install.packages("caret")

Here is the code to do a grid search over all the hyper-parameters of gbm:

require(caret)

# set a grid of parameters to optimize upon;
# all these are gbm params and must be specified
gbmGrid <- expand.grid(
  n.trees = c(250, 500, 1000),
  shrinkage = c(0.001, 0.01, 0.1), # lambda
  interaction.depth = c(1, 2, 4), 
  n.minobsinnode = c(10) # left at default
)

# 5-fold CV; set method to "repeatedcv" 
# and set repeats for repeated CV
gbmTrControl <- trainControl(
  method = "cv",
  number = 5,
  #repeats = 4,
  verboseIter = FALSE,
  allowParallel = TRUE
)

# do the training
gbmTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) 
                              %in% c("medv")]), 
  y = dfTrain$medv,
  trControl = gbmTrControl,
  tuneGrid = gbmGrid,
  method = "gbm"
)

# get the top 5 models
head(gbmTrain$results[with(gbmTrain$results, 
                           order(RMSE)), ], 5)
# MSE for best model = 3.42 ^ 2 = 11.7

# get the best model's parameters
gbmTrain$bestTune
#   n.trees interaction.depth shrinkage n.minobsinnode
#       500                 4       0.1             10

yhatGbm <- predict(gbmTrain, newdata = dfTest)
mean((yhatGbm - dfTest$medv) ^ 2) # 7.28

# plot the preds in red and actual in blue
plot(dfTest$medv, yhatGbm, col = "red")
abline(0, 1, col = "blue")

The plot looks like this:

gbm

In this case, we are actually getting much lower test MSE than training MSE.

Feel free to expand the grid search to larger ranges and select better parameters.

XGBoost

Next, we see how to do the same optimal parameter selection for XGBoost.

Install the XGBoost package using

install.packages("xgboost")

The following code then searches and gives you the best hyper-parameters for XGBoost.

library(xgboost)

xgbGrid <- expand.grid(
  nrounds = c(250, 500, 1000),
  max_depth = c(1, 2, 4),
  eta = c(0.001, 0.003, 0.01),
  gamma = c(0, 1, 2),
  colsample_bytree = c(1, 0.5, 0.25),
  min_child_weight = c(1, 2)
)

xgbTrControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  verboseIter = FALSE,
  returnData = FALSE,
  allowParallel = TRUE
)

xgbTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) %in% c("medv")]), 
  y = dfTrain$medv,
  objective = "reg:linear",
  trControl = xgbTrControl,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

# get the top model and its results
head(xgbTrain$results[with(xgbTrain$results, 
                             order(RMSE)), ], 1)
# MSE 3.12 ^ 2 = 9.74

yhatXgb <- predict(xgbTrain, newdata = dfTest)
mean((yhatXgb - dfTest$medv) ^ 2) # 10.49

plot(dfTest$medv, yhatXgb, col = "red")
abline(0, 1, col = "blue")

# Variable Importance
names <- names(dfTrain)[! names(dfTrain) %in% c("medv")]
importanceMatrix <- xgb.importance(names, 
                                   model = xgbTrain$finalModel)
xgb.plot.importance(importanceMatrix[1:10,])

The variable importance plot looks like this:

varImp

Again, we can see that lstat and rm are the two main features.

The test MSE for the two models are lower compared to the ones presented in the text. This could be because our training set size is larger and also because of the optimizations we did for the models. For fair comparison, you should test the code in the text with 80% training data like we did here.

PS: I could not figure out easily how to plot the variable importance for the GBM model.