Analyzing Elastic MapReduce data with Python, Pandas and scikit-learn

Published on 04 August 2014, in #data-engineering, #python

What a great time it is nowadays for data geeks!

We have Pandas and Scikit-learn - fantastic Python stack for data analysis. On top of that we have IPython and IPython Notebook - powerful coding, documentation and visualization layer for experimenting.

Then we have the whole Hadoop stack with an amazingly fast Impala SQL query engine. We don't even have to build the Hadoop cluster in-house, we just choose the size and spin up a cluster via Amazon AWS and we're done.

And now guys in Cloudera made Impyla - Python connector to Impala. And of course, they didn't forget to pack in an Impala connector for Pandas! How great is that?!

So, if you want to connect Pandas to Impala on Elastic MapReduce (EMR), here is how.

5 steps to connect Pandas to remote Impala #

Prerequisites #

Install the awesome Pandas, Scikit-learn and IPython stack if you haven't done that already.

Step 1: Install Impyla #

$ pip install impyla

Step 2: Create an SSH tunnel to Amazon EMR so you can access Impala from localhost #

$ ssh -L 12345:localhost:21050 your_user_name@your_node.compute.amazonaws.com

The Impala query engine runs on port 21050 on your Hadoop master node. For security reasons, this port is not accessible from the outside.

This shell command will open up the port 12345 on your local machine and forward it to the port 21050 on the Hadoop master node where the Impala query engine listens. (Of course, you can choose whatever port you want, it doesn't have to be 12345.)

Step 3: Connect Impyla to Impala via the tunnel #

>>> from impala.dbapi import connect
>>> conn = connect(host='localhost', port=12345)
>>> cur = conn.cursor()

Notice that we are connecting to localhost:12345 which is (securely) forwarded to Impala on Amazon EMR.

Step 4: Query Impala and convert the result into Pandas dataframe #

>>> from impala.util import as_pandas
>>> cur.execute('SELECT * FROM customers LIMIT 500')
>>> df = as_pandas(cur)

Note that the query result must be transported from remote cluster to your localhost. If the result is large, the download might take a while. You might want to check out the network traffic monitor on your system to see when the download is complete.

Step 5. Enjoy Hadoop data in Pandas #

>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf.fit(df[['age', 'gender']], df['lifetime_value'])

Bonus: Impala with scikit-learn API #

By the way, guys at Cloudera are now busy with implementing scikit-learn API in Impyla. They already have alpha implementation of linear regression, logistic regression and SVM ready. I'm quite excited where all this is going...

← Previous post: Detecting random text in web registration forms
→ Next post: Introduction to Apache Zookeeper, backbone of big-data systems

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter