Tag Archives: python

Enhanced feature importance plot in scikit-learn with R’s ggplot2

A typical plot I want to make after running an ensemble model like random forest or gradient boosting is the feature importance plot. Scikit-learn has support for plotting feature importances as described here, but I found the lack of variable names in that plot quite disappointing! Let’s enhance it!

I also find matplotlib quite tedious compared to R’s ggplot. While there is a Python package for ggplot, it is quite limited. Thanks to rpy2, we can run R code directly from inside IPython Notebook.

This post just shows the two steps involved in doing this:

  1. load the feature importances returned by an ensemble model into a Pandas data frame
  2. pass the data frame to R for plotting with ggplot2

Loading feature importances into a Pandas data frame

I am going to assume that you have a model that returns feature importances, which is just a numpy ndarray. Here is the function to get the data frame:

import pandas as pd

def get_feature_importance_df(feature_importances, 
                              column_names, 
                              top_n=25):
    """Get feature importance data frame.

    Parameters
    ----------
    feature_importances : numpy ndarray
        Feature importances computed by an ensemble 
            model like random forest or boosting
    column_names : array-like
        Names of the columns in the same order as feature 
            importances
    top_n : integer
        Number of top features

    Returns
    -------
    df : a Pandas data frame

    """
    
    imp_dict = dict(zip(column_names, 
                        feature_importances))
    top_features = sorted(imp_dict, 
                          key=imp_dict.get, 
                          reverse=True)[0:top_n]
    top_importances = [imp_dict[feature] for feature 
                          in top_features]
    df = pd.DataFrame(data={'feature': top_features, 
                            'importance': top_importances})
    return df

Then call the function and store the results in a variable like this:

imp_df = get_feature_importance_df(feature_importances, 
                                   df.columns)

Pass on the data frame to R

For this, I would refer you to Ritesh Agrawal’s post again. Once you have tried the steps in that post, you can use the following code in IPython notebook:

%load_ext rpy2.ipython
%%R -i imp_df -w 1000 -h 650 -u px
library(ggplot2)

# to maintain the descending order
imp_df$feature <- factor(
  imp_df$feature, 
  levels = imp_df$feature[order(imp_df$importance, 
                                decreasing = T)])

ggplot(imp_df, aes(feature, importance)) + 
  geom_bar(stat="identity", fill = "blue") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

The result

That produces a beautiful feature importance plot with the names of the variables on the X axis. (Sorry, I have truncated the names since this is from my office project.)

Feature importance plot

Installing numpy on windows

Installing numpy on a windows machine is not straight forward. For me, it required combining a few different answers on stackoverflow. Here are the steps:

1. Open the unofficial Windows numpy page in a new tab.

2. Download the correct version of .whl file based on your Python version number and whether you have 32 or 64 bit Python. For example, if you are using 64-bit Python 2.7.10, download the file named numpy‑1.9.3+vanilla‑cp27‑none‑win_amd64.whl. (cp27 stands for Python 2.7 and amd64 means 64 bit.) Note that it doesn’t matter whether your OS is 32 or 64 bit, all that matters is the python version. (To figure out which one your Python is, just open up python interpreter in a terminal and it prints that info when it starts.)

3. Open windows terminal as administrator and then change to the directory where you downloaded the .whl file using the command cd.

4. Then install it with pip. For the above file example, the command is

If this throws a Fatal error in launcher, then install it with

Python vs. Perl vs. PHP vs. Bash

There are a lot of posts about this topic! So why one more? This post is just to share my personal experiences with the few scripting languages I have used in the past few years. Contrary to hot debates about which is better, I consider each of these languages I use i.e. Python, Perl, PHP and Bash (or shell) as simply a knife in my swiss army knife. When a specific task can be accomplished easily with one, I just use it.

Perl

I begin with Perl because that’s the first scripting language I learnt. I have scripted a lot in Perl. Here are the parts I enjoy in Perl:

  1. I owe my love for regular expressions (regex) to Perl. If you want to learn regular expressions, you should learn them in Perl (or Python or PHP), and definitely not in Java :-). In my job (where we mainly use Java) when people need help with regexes, they go to people with scripting experience. I believe Java developers should not try to learn regexes with Java. They should try to learn regexes with a scripting language and then transition to Java regexes.
  2. Perl really shines when you need to parse a large text/log file line by line and do a bunch of regex matches.
  3. Perl makes accessing *nix command outputs very easy with back ticks. For example, you could just put back ticks around a long system command and capture its output in a variable like this:
  4. Perl’s string interpolation of variable values inside double quoted strings is a lovely feature. For example:
  5. Perl has a lot of one liners. It takes some time to get used to them, but once you learn them, they can be very useful.

And here are a few tasks for which I found Perl to be unnatural or difficult to use:

  1. If I need to work with classes and objects, I do not use Perl. Somehow I feel OOP support in Perl is not straight forward to learn and use.
  2. While Perl hashes have a lot of power, the sigils (especially in multi-dimensional hashes) made my head spin.
  3. The main reason I moved away from Perl was when I started working with MongoDB (from 2012). I am glad that I had learnt some Python by then, so I could compare the two languages for working with mongo. Python’s pymongo was a mature client library, while Mongo’s creator 10gen (now MongoDB, Inc.) did not seem to care much about Perl drivers. The Perl driver documentations were also not clear, and even as of today the Perl driver tutorial page does not even load! Also MongoDB’s JSON documents map seamlessly to Python dictionaries, which IMO are much easier to work with than Perl’s multi-dimensional hashes.

Python

I was using Perl happily until a few years ago, but then started seeing a lot of people using Python. And there were hot debates about which language is better. So I started learning Python from online tutorials. Python has one of the best documentations about any programming language I have ever read. Later, I read The Quick Python Book, Second Edition. And today I am an avid user (and learner) of Python! There are a lot of very nice things about Python and I am not going to list them all here. Just the top points:

  1. I love the Python IDLE. It makes development a lot of fun with Python. Usually I test some small function in IDLE and then copy it over to my Python script in Eclipse-PyDev (which IMO is a great Python code editor).
  2. Like I mentioned before, if I need to work with MongoDB, I use python and pymongo.
  3. For any decent sized project which we intend to keep for some time (especially in cron jobs), Python is my language of choice.
  4. No more curly braces and semi-colons making code shorter and cleaner.

While Python is my main scripting language of choice these days, I do miss two of Perl’s features: string interpolation and ease of command line output capture. It is possible to do these things in Python, but just not as easily as in Perl.

PHP

PHP is my language of choice when I need to build a quick dynamic web page. On most of our servers, PHP and Apache HTTP server are already installed, so building a web page in PHP is very easy. PHP provides a lot of built-in functions. Here are a few I used recently:

scandir – List files and directories inside the specified path
file_exists – Checks whether a file or directory exists
nl2br – Inserts HTML line breaks before all newlines in a string

Since I don’t usually build UI tools, my overall experience with PHP is limited. But if someone asks me to build a UI tool and provide them a link, I will most likely use PHP.

Bash/Shell Script

I love the *nix command line. I also use pipelines to do quick analysis (especially on log files). But if I need to put my code in a script, I stay away from bash and shell scripts. Somehow I find it difficult to just scan a bash script and understand what it does. The code has to be cluttered with comments to explain what it is doing. And if you have experience maintaining code for a while, you know people tend to forget to update comments when they update code. It is much easier for me to read a Python script and understand what it does quickly (without the need for comments).


So that’s the summary of when I use what :-).


You can vote for your favorite programming language below (no sign up required). Have fun voting and seeing the items move live as you vote!

Coolest Programming Languages

pymongo import problem

I was trying to install the python mongodb driver today on my 32-bit ubuntu 11.10 laptop. (I am using Python 2.7.)

Since I already have easy_install on my system, I simply used:

$ sudo easy_install bson
$ sudo easy_install pymongo
and they installed fine.

But when I tried to import it in python, I got the following error:
>>> import pymongo
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.3-py2.7-linux-i686.egg/pymongo/__init__.py", line 61, in <module>
from pymongo.connection import Connection
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.3-py2.7-linux-i686.egg/pymongo/connection.py", line 44, in <module>
from bson.py3compat import b
ImportError: No module named py3compat

One of the troubles is that pymongo has its own bson:

>>> import bson
/usr/local/lib/python2.7/dist-packages/pytz-2012f-py2.7.egg/pytz/__init__.py:35: UserWarning: Module bson was already imported from /usr/local/lib/python2.7/dist-packages/bson-0.3.3-py2.7.egg/bson/__init__.pyc, but /usr/local/lib/python2.7/dist-packages/pymongo-2.3-py2.7-linux-i686.egg is being added to sys.path
from pkg_resources import resource_stream

Then I saw that the pymongo install site recommends using pip. So I first got pip with
$ sudo easy_install pip

And as the first thing, I uninstalled both bson and pymongo using pip:
$ sudo pip uninstall bson
$ sudo pip uninstall pymongo

Then I installed pymongo using pip
$ sudo pip install pymongo

Now I can import pymongo without any problems!