Monthly Archives: January 2017

Enhanced feature importance plot in scikit-learn with R’s ggplot2

A typical plot I want to make after running an ensemble model like random forest or gradient boosting is the feature importance plot. Scikit-learn has support for plotting feature importances as described here, but I found the lack of variable names in that plot quite disappointing! Let’s enhance it!

I also find matplotlib quite tedious compared to R’s ggplot. While there is a Python package for ggplot, it is quite limited. Thanks to rpy2, we can run R code directly from inside IPython Notebook.

This post just shows the two steps involved in doing this:

  1. load the feature importances returned by an ensemble model into a Pandas data frame
  2. pass the data frame to R for plotting with ggplot2

Loading feature importances into a Pandas data frame

I am going to assume that you have a model that returns feature importances, which is just a numpy ndarray. Here is the function to get the data frame:

import pandas as pd

def get_feature_importance_df(feature_importances, 
                              column_names, 
                              top_n=25):
    """Get feature importance data frame.

    Parameters
    ----------
    feature_importances : numpy ndarray
        Feature importances computed by an ensemble 
            model like random forest or boosting
    column_names : array-like
        Names of the columns in the same order as feature 
            importances
    top_n : integer
        Number of top features

    Returns
    -------
    df : a Pandas data frame

    """
    
    imp_dict = dict(zip(column_names, 
                        feature_importances))
    top_features = sorted(imp_dict, 
                          key=imp_dict.get, 
                          reverse=True)[0:top_n]
    top_importances = [imp_dict[feature] for feature 
                          in top_features]
    df = pd.DataFrame(data={'feature': top_features, 
                            'importance': top_importances})
    return df

Then call the function and store the results in a variable like this:

imp_df = get_feature_importance_df(feature_importances, 
                                   df.columns)

Pass on the data frame to R

For this, I would refer you to Ritesh Agrawal’s post again. Once you have tried the steps in that post, you can use the following code in IPython notebook:

%load_ext rpy2.ipython
%%R -i imp_df -w 1000 -h 650 -u px
library(ggplot2)

# to maintain the descending order
imp_df$feature <- factor(
  imp_df$feature, 
  levels = imp_df$feature[order(imp_df$importance, 
                                decreasing = T)])

ggplot(imp_df, aes(feature, importance)) + 
  geom_bar(stat="identity", fill = "blue") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

The result

That produces a beautiful feature importance plot with the names of the variables on the X axis. (Sorry, I have truncated the names since this is from my office project.)

Feature importance plot