Using Apache Zeppelin

SherlockML provides a server type for Apache Zeppelin, an alternative web based notebook for data science. Zeppelin is designed for exploring and visualising large datasets, using tools such as Apache Spark. One of the distinguishing features of Zeppelin compared to the Jupyter notebook is the ability for a note to contain paragraphs (cells in Jupyter’s terminology) written in different languages and using different interpreters. For example, you might have a paragraph that uses PySpark to process data stored in the SherlockML File System and create a temporary table, which is queried from another paragraph with the SQL interpreter.


Creating a Zeppelin server

An Apache Zeppelin server can be created by choosing “Apache Zeppelin” in the server type dropdown menu. As Zeppelin is designed for use with big data tools such as Apache Spark, we recommend using the “Extra Large” server size. In SherlockML, Zeppelin is configured optimally for this server size.


Using sml, the command line interface to SherlockML, an Apache Zeppelin server can be created by passing --type zeppelin as an additional argument to the sml server new command.

Collaborating on Zeppelin notebooks

Apache Zeppelin servers in SherlockML are preconfigured to save notebooks in the project workspace, in a hidden directory called .zeppelin. This allows you to share notebooks between servers, and collaborate on notebooks with other team members in your project.

Using Conda environments in Zeppelin notebooks

Zeppelin servers on SherlockML use a Python 3 Conda environment by default. To use a Python 2 environment, execute %python.conda activate Python2 in a new paragraph. This change will apply to all Python paragraphs.


The PySpark interpreter is not affected by changes to the Conda environment. To use Python 2 with PySpark, see the Spark section of this guide.

Using Apache Spark in Zeppelin notebooks

On SherlockML, Zeppelin servers come with Apache Spark preinstalled and configured to use a local master with 8 threads and 24 GB of memory. To use Spark from Scala, start your paragraph with the line %spark. The SparkContext is bound to the value sc.


To interact with Spark using PySpark, start your paragraph with the line %spark.pyspark. As when using Scala, the SparkContext is available as sc.


By default, PySpark will use the Python interpreter in the Python 3 Conda environment. To instead use the Python 2 interpreter, you will need to change the interpreter settings. Open the dropdown menu labelled anonymous in the top right hand corner of the note, and click Interpreter. Then search for the spark interpreter. Click the edit button in the Spark interpreter menu, and change the value of the properties named zeppelin.pyspark.python and PYSPARK_PYTHON from /opt/anaconda/envs/Python3/bin/python to /opt/anaconda/envs/Python2/bin/python. Click Save and then OK in the modal that appears. Now, paragraphs using the spark.pyspark interpreter will use the Python 2 environment. This change will apply to all Zeppelin servers in your project. To use the Python 3 environment again, change both zeppelin.pyspark.python and PYSPARK_PYTHON back to /opt/anaconda/envs/Python3/bin/python.

../_images/zeppelin_pyspark_env_1.png ../_images/zeppelin_pyspark_env_2.png

Viewing interpreter logs

For debugging your code, the logs of the Zeppelin interpreters can be found in /opt/zeppelin/logs.