Using Apache Zeppelin¶
SherlockML provides a server type for Apache Zeppelin, an alternative web based notebook for data science. Zeppelin is designed for exploring and visualising large datasets, using tools such as Apache Spark. One of the distinguishing features of Zeppelin compared to the Jupyter notebook is the ability for a note to contain paragraphs (cells in Jupyter’s terminology) written in different languages and using different interpreters. For example, you might have a paragraph that uses PySpark to process data stored in the SherlockML File System and create a temporary table, which is queried from another paragraph with the SQL interpreter.
Creating a Zeppelin server¶
An Apache Zeppelin server can be created by choosing “Apache Zeppelin” in the server type dropdown menu. As Zeppelin is designed for use with big data tools such as Apache Spark, we recommend using the “Extra Large” server size. In SherlockML, Zeppelin is configured optimally for this server size.
sml, the command line interface to SherlockML, an Apache Zeppelin
server can be created by passing
--type zeppelin as an additional
argument to the
sml server new command.
Collaborating on Zeppelin notebooks¶
Apache Zeppelin servers in SherlockML are preconfigured to save notebooks in
the project workspace, in a hidden directory called
allows you to share notebooks between servers, and collaborate on notebooks
with other team members in your project.
Using Conda environments in Zeppelin notebooks¶
Zeppelin servers on SherlockML use a Python 3 Conda environment by default.
To use a Python 2 environment, execute
%python.conda activate Python2 in a
new paragraph. This change will apply to all Python paragraphs.
The PySpark interpreter is not affected by changes to the Conda environment. To use Python 2 with PySpark, see the Spark section of this guide.
Using Apache Spark in Zeppelin notebooks¶
On SherlockML, Zeppelin servers come with Apache Spark preinstalled and
configured to use a local master with 8 threads and 24 GB of memory. To use
Spark from Scala, start your paragraph with the line
SparkContext is bound to the value
To interact with Spark using PySpark, start your paragraph with the line
%spark.pyspark. As when using Scala, the
SparkContext is available as
By default, PySpark will use the Python interpreter in the Python 3 Conda
environment. To instead use the Python 2 interpreter, you will need to change
the interpreter settings. Open the dropdown menu labelled
the top right hand corner of the note, and click
Interpreter. Then search
spark interpreter. Click the
edit button in the Spark
interpreter menu, and change the value of the properties named
Save and then
the modal that appears. Now, paragraphs using the
interpreter will use the Python 2 environment. This change will apply to all
Zeppelin servers in your project. To use the Python 3 environment again,
PYSPARK_PYTHON back to
Viewing interpreter logs¶
For debugging your code, the logs of the Zeppelin interpreters can be found in