Hive
====

`Hive <https://hive.apache.org/>`_ is a distributed SQL database that runs
on a `Hadoop <http://hadoop.apache.org/>`_ cluster. White Hive enables you to
run highly scalable queries against massive data sets, Hive provides much the
same interface as a conventional SQL database.

Exploring data in Hive
----------------------

If you're using Hive, it's likely that your Hadoop cluster also has `Hue
<http://gethue.com/>`_ installed. Hue is a graphical interface for exploring
data and tables in Hive, among other things. It's a good place to start to
find out what data is available in your cluster and to start developing
queries.

Get the web address and login credentials from your Hadoop cluster
administrator, and open the Hive Query Editor to start exploring tables and
testing queries:

.. thumbnail:: images/hue_hive.png

Once you want to start bringing data out of the cluster and into Faculty to
do some deeper analysis, follow the instructions below to pull the results of
queries into a notebook with Python.

Connecting to Hive from Python
------------------------------

To connect to Hive, you need to know:

 - its `hostname`: this is usually a string like ``customers.mydomain.com``;
 - its `port`: the default port for Hive is 10000;
 - the name of the database that you want to connect to. If this is a new
   Hive instance, the only database on it is likely to be named ``default``;
 - the username to run queries as on Hive. You can find this from your database
   administrator.

To connect to Hive from Python, we recommend using the Python module `PyHive
<https://github.com/dropbox/PyHive>`_. Install it on a Faculty server
with:

.. code-block:: bash

   $ conda install pyhive --channel anaconda

.. note:: We recommend installing PyHive from the 'anaconda' conda channel
          rather from pip or the standard conda repositories to ensure you get
          all the required dependencies.

You can then connect to Hive with:

.. code-block:: python

    from pyhive import hive

    connection = hive.connect(
        host='customers.mydomain.com',  # host on which the database is running
        database='database_name',  # name of the database to connect to
        username='username'  # username to run queries as
    )

    cursor = connection.cursor()
    cursor.execute('SELECT * FROM customers')
    customers = list(cursor.fetchall())
    print('We have {} customers'.format(len(customers)))  # This is data science!

    connection.close()

.. note:: We close the connection to allow the database server to reclaim
          resources. This can be critical in a Jupyter notebook, since the
          kernel remains alive for a long time.

The Hive connection object can also be used with ``contextlib.closing`` in a
``with`` statement to ensure it gets closed automatically:

.. code-block:: python

    from contextlib import closing
    from pyhive import hive

    connection = hive.connect(
        host='customers.mydomain.com', database='database_name',
        username='username'
    )

    with closing(connection):
        cursor = connection.cursor()
        cursor.execute('SELECT * FROM customers')
        customers = list(cursor.fetchall())
        print('We have {} customers'.format(len(customers)))

You can also use the ``read_sql()`` function in pandas to read the result of a
query directly into a DataFrame:

.. code-block:: python

    from contextlib import closing
    from pyhive import hive
    import pandas

    connection = hive.connect(
        host='customers.mydomain.com', database='database_name',
        username='username'
    )

    with closing(connection):
        df = pandas.read_sql('SELECT * FROM customers', connection)

    print(df)