Impala

Impala is a distributed SQL database that runs on a Hadoop cluster. While Impala enables you to run highly scalable queries against massive data sets, Impala provides much the same interface as a conventional SQL database.

Exploring data in Impala

If you’re using Impala, it’s likely that your Hadoop cluster also has Hue installed. Hue is a graphical interface for exploring data and tables in Impala, among other things. It’s a good place to start to find out what data is available in your cluster and to start developing queries.

Get the web address and login credentials from your Hadoop cluster administrator, and open the Impala Query Editor to start exploring tables and testing queries:

../../_images/hue_impala.png

Once you want to start bringing data out of the cluster and into SherlockML to do some deeper analysis, follow the instructions below to pull the results of queries into a notebook with Python.

Connecting to Impala from Python

To connect to Impala, you need to know:

  • its hostname: this is usually a string like customers.mydomain.com;
  • its port: the default port for Impala is 21050;
  • the name of the database that you want to connect to. If this is a new Impala instance, the only database on it is likely to be named default.

To connect to Impala from Python, we recommend using the Python module impyla. Install it on a SherlockML server with:

$ pip install impyla  # note the 'y'

You can then connect to Impala with:

import impala.dbapi

connection = impala.dbapi.connect(
    host='customers.mydomain.com',  # host on which the database is running
    database='database_name'  # name of the database to connect to
)

cursor = connection.cursor()
cursor.execute('SELECT * FROM customers')
customers = list(cursor.fetchall())
print('We have {} customers'.format(len(customers)))  # This is data science!

connection.close()

Note

We close the connection to allow the database server to reclaim resources. This can be critical in a Jupyter notebook, since the kernel remains alive for a long time.

The Impala connection object can also be used in a with statement to ensure it gets closed automatically:

import impala.dbapi

with impala.dbapi.connect(
    host='customers.mydomain.com', database='database_name'
) as connection:
    cursor = connection.cursor()
    cursor.execute('SELECT * FROM customers')
    customers = list(cursor.fetchall())
    print('We have {} customers'.format(len(customers)))

You can also use the read_sql() function in pandas to read the result of a query directly into a DataFrame:

import impala.dbapi
import pandas

with impala.dbapi.connect(
    host='customers.mydomain.com', database='database_name'
) as connection:
    df = pandas.read_sql('SELECT * FROM customers', connection)

print(df)