Category Archives: Technology

Enhanced feature importance plot in scikit-learn with R’s ggplot2

A typical plot I want to make after running an ensemble model like random forest or gradient boosting is the feature importance plot. Scikit-learn has support for plotting feature importances as described here, but I found the lack of variable names in that plot quite disappointing! Let’s enhance it!

I also find matplotlib quite tedious compared to R’s ggplot. While there is a Python package for ggplot, it is quite limited. Thanks to rpy2, we can run R code directly from inside IPython Notebook.

This post just shows the two steps involved in doing this:

  1. load the feature importances returned by an ensemble model into a Pandas data frame
  2. pass the data frame to R for plotting with ggplot2

Loading feature importances into a Pandas data frame

I am going to assume that you have a model that returns feature importances, which is just a numpy ndarray. Here is the function to get the data frame:

import pandas as pd

def get_feature_importance_df(feature_importances, 
                              column_names, 
                              top_n=25):
    """Get feature importance data frame.

    Parameters
    ----------
    feature_importances : numpy ndarray
        Feature importances computed by an ensemble 
            model like random forest or boosting
    column_names : array-like
        Names of the columns in the same order as feature 
            importances
    top_n : integer
        Number of top features

    Returns
    -------
    df : a Pandas data frame

    """
    
    imp_dict = dict(zip(column_names, 
                        feature_importances))
    top_features = sorted(imp_dict, 
                          key=imp_dict.get, 
                          reverse=True)[0:top_n]
    top_importances = [imp_dict[feature] for feature 
                          in top_features]
    df = pd.DataFrame(data={'feature': top_features, 
                            'importance': top_importances})
    return df

Then call the function and store the results in a variable like this:

imp_df = get_feature_importance_df(feature_importances, 
                                   df.columns)

Pass on the data frame to R

For this, I would refer you to Ritesh Agrawal’s post again. Once you have tried the steps in that post, you can use the following code in IPython notebook:

%load_ext rpy2.ipython
%%R -i imp_df -w 1000 -h 650 -u px
library(ggplot2)

# to maintain the descending order
imp_df$feature <- factor(
  imp_df$feature, 
  levels = imp_df$feature[order(imp_df$importance, 
                                decreasing = T)])

ggplot(imp_df, aes(feature, importance)) + 
  geom_bar(stat="identity", fill = "blue") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

The result

That produces a beautiful feature importance plot with the names of the variables on the X axis. (Sorry, I have truncated the names since this is from my office project.)

Feature importance plot

R bookdown on mac El Capitan

Was excited to see R bookdown from Yuhui Xe of RStudio, but when I tried to get started, I ran into a problem.

I am using (mac) OS X El Capitan 10.11.6 and when I tried to “build the book”, I got the following error:

xelatex: command not found.

I realized I never installed LaTex on my mac, so I tried to download MacTex 2016 from the main site, but that link did not download the package for me! It just displayed some text response.

Looking at MacTex FAQ, I found the mirror site. I could then download the package from one of the mirrors close to me. The size of the package was 2.7 GB. With that I installed LaTex.

There is one more step required though. We need to set the PATH variable. I edited my ~/.bashrc and added /Library/TeX/texbin to it. Then I restarted RStudio. I was then able to build the book 🙂

Beta Binomial Model in R

I have been reading Kevin Murphy’s book Machine Learning: a Probabilistic Perspective to study machine learning in more depth. I love the book and it has been an exciting experience so far, but I wish it could have been better.

Once I started reading the first new topic (for me) Sec. 3.3 The Beta Binomial Model, I started seeing that he has used inconsistent notation in the formulae, especially with the shape parameters for the Beta distribution. I am not going to get into the details here, but if you read that section, you will understand the sloppiness.

Here, I thought I could share the R code for producing the plot in Figure 3.7. It is quite unfortunate that the Matlab code for the book (pmtk3) also uses inconsistent variable names. Fig. 3.7 is using M = 10, which is not even mentioned in the text! For the plug-in approximation you need to use the MAP estimate given in Eq. (3.21) and a binomial distribution.

I hope the code below will help map the variable names to the ones defined in the text correctly and will clarify how the plots are produced.

##########################################
# beta binomial model
##########################################

library(ggplot2)

# prior is beta(a, b) with a = 2, b = 2
a <- 2
b <- 2

# likelihood; 17 tails and 3 heads
N0 <- 17
N1 <- 3
N <- N0 + N1

# predict number of heads in M = 10 more tosses
M <- 10

# posterior predictive distribution (ppd)
ppd <- rep(NA, M)
for (k in 0:M) {
  # Eq (3.34) - (a, b) replaced by (a + N1, b + N0)
  ppd[k + 1] <- choose(M, k) * 
    beta(a + k + N1, M - k + N0 + b) / 
    beta(a + N1, b + N0)
}
# store results in a date frame for plotting with ggplot
# store the title for faceting
pltDf <- data.frame(x = 0:M, d = ppd, 
                    title = rep("Posterior Pred", M + 1))

# plug-in approx.
# first get MAP estimate = mode of posterior. Eq (3.21)
thetaHat <- (a + N1 - 1) / (a + b + N - 2)
pluginApprox <- rep(NA, M + 1)
for (k in 0:M) {
  pluginApprox[k + 1] <- choose(M, k) * 
    thetaHat ^ k * 
    (1 - thetaHat) ^ (M - k)
}
pluginDf <- data.frame(x = 0:M, d = pluginApprox, 
                       title = rep("Plug-in Approx", M + 1))

pltDf <- rbind(pltDf, pluginDf)

ggplot(pltDf, aes(x, d)) + 
  geom_bar(stat = "identity", fill = "navyblue", 
           alpha = 0.8) + 
  scale_x_continuous(breaks = pltDf$x) + 
  facet_wrap(~title)

That produces the following plot:

fig3-7

Using Variables in PostgreSQL PgAdmin III Client

There was a requirement in my office to be able to use variables for table names and field values in PostgreSQL queries.

If you have used Oracle DB, then you know you can define variables using define and later use those variables in your queries. Here are two variable declarations, one for a table name and another for a field value.

define mytbl = 'sometable';
define maxid = 2500;

Once those variables are defined, we can use them in our queries like this:

select count(*) from &mytbl where id <= &maxid;

This is possible to do in Postgres using the PostgreSQL PgAdmin III client. The feature to use from PgAdmin III is pgScript.

Here is the code to do this. First, you would enter something like this in the SQL Editor:

declare @mytbl, @maxid; 
set @mytbl = 'sometable'; 
set @maxid = 2500; 
set @res = select count(*) from @mytbl where id <= @maxid;
print @res;

Instead of just executing this script like usual, you need to click on the icon next to execute query. The icon is “execute pgScript” and it has a P with a GS below it. Once it runs, it will show both the query (substituted with the variable table name and field value) and the results of your query.

So if you ever need to define your variables at the top of a PostgreSQL script and later use them inside your complicated SQLs below, then this is one way to do it.

Connect to Oracle Database from R on mac

This is a follow-up post of this Connect to Oracle from R by Ankit Agarwal. I tried to add a comment to his blog, but the formatting was all messed up, so thought of writing it up here more for my own reference.

Just adding how to connect to Oracle DB from R on a mac. I am on El Capitan version 10.11.3.

We will use RJDBC here. First part is to figure out how to find and set JAVA_HOME.

Run this command from command line:

ls -l $(which java)

If this doesn’t output anything, then you need to install Java. Go to the Java SDK download page and download the dmg file for Mac for the latest Java SDK. Then install Java by double clicking the dmg file.

In my case, this outputs:

lrwxr-xr-x  1 root  wheel  74 Jan 13  2016 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java

Please scroll right to see the symbolic link. You want to note down the part up to JavaVM.framework to use in R script i.e. /System/Library/Frameworks/JavaVM.framework in my case. This will be set to JAVA_HOME in our script. (You can also set the environment variable JAVA_HOME using .bashrc if you want.)

Download ojdbc JAR as mentioned in the main post. I just downloaded ojdbc6.jar and stored it in my home directory.

That’s all you need to install.

Then try the following from R, which is just the code given in the main post with slight modifications:

# install RJDBC if not done already
install.packages("RJDBC")

# Set JAVA_HOME to what we found out before
jHome <- '/System/Library/Frameworks/JavaVM.framework'
Sys.setenv(JAVA_HOME=jHome)
options(java.parameters="-Xmx2g")
library(rJava)

# Output Java version
.jinit()
print(.jcall("java/lang/System", "S", "getProperty", 
             "java.version"))

# Load RJDBC library
library(RJDBC)

# Create connection driver and open connection
jdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", 
                   classPath="~/ojdbc6.jar")

# db connection details; please fill in yours
dbHost  <- "xx"
dbPort  <- "xx"
srvName <- "xx"
dbUser  <- "xx"
dbPass  <- "xx"

conStr <- paste("jdbc:oracle:thin:@//", dbHost, ":", dbPort, 
                "/", srvName, sep = "")
jdbcConnection <- dbConnect(jdbcDriver, conStr, 
                            dbUser, dbPass)

# Query to get count of a table; replace YOUR_TABLE
dbGetQuery(jdbcConnection, "select count(*) 
           from <YOUR_TABLE>")

# Close connection
dbDisconnect(jdbcConnection)

An Alternative to R pairs with reshape2 and ggplot2

Base plot function: “pairs”

You may be aware of the pairs plotting function in R. Here is an example with the Boston data set. We will just use a few variables since there isn’t much room here to show the plots for all of them.

miniBoston <- Boston[, c("crim", "lstat", "rm", "medv")]

Let us inspect this new data frame:

> head(miniBoston)
     crim lstat    rm medv
1 0.00632  4.98 6.575 24.0
2 0.02731  9.14 6.421 21.6
3 0.02729  4.03 7.185 34.7
4 0.03237  2.94 6.998 33.4
5 0.06905  5.33 7.147 36.2
6 0.02985  5.21 6.430 28.7

Let’s plot this with the pairs function now:

pairs(miniBoston)

This produces the following plot:
MiniBostonPairs

This is great for seeing associations between all the variables. While this looks OK, it takes too long to get generated and wastes a lot of space (each plot is there twice) when you have a lot of variables and too much data and all you are interested is in seeing how a specific variable is related to all the other variables.

Sensible Alternative with reshape2 and ggplot2

If you do not have reshape2 and ggplot2 installed already, install them first:

install.packages("reshape2")
install.packages("ggplot2")

Let us say we are interested in how the variable medv is related to the other variables. We proceed with:

library(reshape2)
meltBoston <- melt(miniBoston, "medv")

If you inspect meltBoston, you see

> str(meltBoston)
'data.frame':	1518 obs. of  3 variables:
 $ medv    : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ variable: Factor w/ 3 levels "crim","lstat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...

> head(meltBoston)
  medv variable   value
1 24.0     crim 0.00632
2 21.6     crim 0.02731
3 34.7     crim 0.02729
4 33.4     crim 0.03237
5 36.2     crim 0.06905
6 28.7     crim 0.02985

Now we can plot like this:

library(ggplot2)
ggplot(data = meltBoston, aes(x = value, y = medv)) + 
  geom_point(size = 0.3, pch = 1) + 
  facet_wrap(~ variable, 
             ncol = 3, 
             scales = "free_x")

You get a plot like this which uses the space more effectively:

BostonGgplot

Grid Search, Cross Validation and XGBoost in R

This post is to help take the chapter on tree-based methods of the highly reputed Introduction to Statistical Learning text to the next level with grid search, cross validation and XGBoost.

I assume that you have read the chapter on Tree-Based Methods (chapter 8 in the copy I got) and executed the code in the lab section. Here we look into regression trees only. It should be straight forward to apply the ideas here to a classification problem by changing the objective function.

The last statement of the Boosting sub-section (of the lab) states that

using lambda of 0.2 leads to a slightly lower test MSE than lambda of 0.001.

It is a pity that the authors do not teach how to select the optimal lambda via grid search and cross-validation. Here is an attempt to do that.

Grid Search with Cross Validation

Let us first split the Boston dataset into 80% training and 20% test (unlike the 50-50 split the authors do).

set.seed(1234)
trainInd <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
dfTrain <- Boston[ trainInd,]
dfTest  <- Boston[-trainInd,]
dim(dfTrain) # 404 14
dim(dfTest) # 102 14

For selecting the best parameters for gbm, we will need the caret package. Install it using this R command:

install.packages("caret")

Here is the code to do a grid search over all the hyper-parameters of gbm:

require(caret)

# set a grid of parameters to optimize upon;
# all these are gbm params and must be specified
gbmGrid <- expand.grid(
  n.trees = c(250, 500, 1000),
  shrinkage = c(0.001, 0.01, 0.1), # lambda
  interaction.depth = c(1, 2, 4), 
  n.minobsinnode = c(10) # left at default
)

# 5-fold CV; set method to "repeatedcv" 
# and set repeats for repeated CV
gbmTrControl <- trainControl(
  method = "cv",
  number = 5,
  #repeats = 4,
  verboseIter = FALSE,
  allowParallel = TRUE
)

# do the training
gbmTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) 
                              %in% c("medv")]), 
  y = dfTrain$medv,
  trControl = gbmTrControl,
  tuneGrid = gbmGrid,
  method = "gbm"
)

# get the top 5 models
head(gbmTrain$results[with(gbmTrain$results, 
                           order(RMSE)), ], 5)
# MSE for best model = 3.42 ^ 2 = 11.7

# get the best model's parameters
gbmTrain$bestTune
#   n.trees interaction.depth shrinkage n.minobsinnode
#       500                 4       0.1             10

yhatGbm <- predict(gbmTrain, newdata = dfTest)
mean((yhatGbm - dfTest$medv) ^ 2) # 7.28

# plot the preds in red and actual in blue
plot(dfTest$medv, yhatGbm, col = "red")
abline(0, 1, col = "blue")

The plot looks like this:

gbm

In this case, we are actually getting much lower test MSE than training MSE.

Feel free to expand the grid search to larger ranges and select better parameters.

XGBoost

Next, we see how to do the same optimal parameter selection for XGBoost.

Install the XGBoost package using

install.packages("xgboost")

The following code then searches and gives you the best hyper-parameters for XGBoost.

library(xgboost)

xgbGrid <- expand.grid(
  nrounds = c(250, 500, 1000),
  max_depth = c(1, 2, 4),
  eta = c(0.001, 0.003, 0.01),
  gamma = c(0, 1, 2),
  colsample_bytree = c(1, 0.5, 0.25),
  min_child_weight = c(1, 2)
)

xgbTrControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  verboseIter = FALSE,
  returnData = FALSE,
  allowParallel = TRUE
)

xgbTrain <- train(
  x = as.matrix(dfTrain[, ! names(dfTrain) %in% c("medv")]), 
  y = dfTrain$medv,
  objective = "reg:linear",
  trControl = xgbTrControl,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

# get the top model and its results
head(xgbTrain$results[with(xgbTrain$results, 
                             order(RMSE)), ], 1)
# MSE 3.12 ^ 2 = 9.74

yhatXgb <- predict(xgbTrain, newdata = dfTest)
mean((yhatXgb - dfTest$medv) ^ 2) # 10.49

plot(dfTest$medv, yhatXgb, col = "red")
abline(0, 1, col = "blue")

# Variable Importance
names <- names(dfTrain)[! names(dfTrain) %in% c("medv")]
importanceMatrix <- xgb.importance(names, 
                                   model = xgbTrain$finalModel)
xgb.plot.importance(importanceMatrix[1:10,])

The variable importance plot looks like this:

varImp

Again, we can see that lstat and rm are the two main features.

The test MSE for the two models are lower compared to the ones presented in the text. This could be because our training set size is larger and also because of the optimizations we did for the models. For fair comparison, you should test the code in the text with 80% training data like we did here.

PS: I could not figure out easily how to plot the variable importance for the GBM model.

Installing numpy on windows

Installing numpy on a windows machine is not straight forward. For me, it required combining a few different answers on stackoverflow. Here are the steps:

1. Open the unofficial Windows numpy page in a new tab.

2. Download the correct version of .whl file based on your Python version number and whether you have 32 or 64 bit Python. For example, if you are using 64-bit Python 2.7.10, download the file named numpy‑1.9.3+vanilla‑cp27‑none‑win_amd64.whl. (cp27 stands for Python 2.7 and amd64 means 64 bit.) Note that it doesn’t matter whether your OS is 32 or 64 bit, all that matters is the python version. (To figure out which one your Python is, just open up python interpreter in a terminal and it prints that info when it starts.)

3. Open windows terminal as administrator and then change to the directory where you downloaded the .whl file using the command cd.

4. Then install it with pip. For the above file example, the command is

If this throws a Fatal error in launcher, then install it with

HeidiSQL: The Best MySQL GUI Client?

We use MySQL at work. I am disappointed when I see even experienced developers using MySQL command line client or even MySQL workbench to run their queries. There is a nice GUI client, HeidiSQL, which is completely free and provides auto-complete features. The moment you alias a table name in HeidiSQL, it helps you write your queries much faster. Just like the Eclipse IDE, HeidiSQL uses Ctrl+Space for auto-suggest. You can have it auto-suggest column names, or even table and database names. It helps me save several minutes every day in writing SQL queries.

Here is an example, where I pressed Ctrl+Space after the table alias lo for the table list_options:

HeidiSQL Auto Suggest

If you are also a purist like me when writing SQL and want to upper-case all the keywords in your query to make them look pleasing, just highlight your query and press Ctrl+F8. (I have always preferred lower-case for database, table and column names and upper-case for SQL keywords.)

There is also another GUI client, SQLYog, which automatically upper-cases SQL keywords in your query as you type. But if you need its auto-complete feature, you need to pay for it. For me, the upper-casing is not as much of importance as auto-suggest.

From among the few different MySQL clients I have tried, HeidiSQL has been the one that has worked the best for me!