Connect to Spark on an external cluster¶
Using the Apache Livy service, you can connect to an external Spark cluster from SherlockML notebooks, apps and APIs.
To enable interaction with the Spark cluster, Apache Livy must be installed. Contact your cluster administrator to arrange installation - documentation on installation is available here.
We also strongly recommend to use Spark 2, which provides a much easier to
use interface for data science than Spark 1. Indeed, MLlib, the Spark
machine learning library has already deprecated their RDD (Spark 1)
interface. You can check your Spark version by running
print(sc.version) inside a Spark session as described below. Contact
your cluster administrator to install Spark 2 and configure Apache Livy to
Any libraries or other dependencies needed by your code must be installed
on the Spark cluster, not on your SherlockML server. Using
sparkmagic/pylivy and Apache Livy, the code you run inside a
cell is run inside the external cluster, not in your notebook.
The sparkmagic package provides Jupyter magics for managing Spark sessions on a external cluster and executing Spark code in them.
To use sparkmagic in your notebooks, install the package with
pip install sparkmagic in a terminal or with a SherlockML Environment, and load the magics in a notebook with:
This makes available the
%spark magic, which is the main entry point for
managing sessions and executing code.
Create a new PySpark session with:
%spark add -s my_session -l python -u https://mycluster.com:8998
--url option sets the URL where Livy is running - you can get
this from your system administrator. You can create R and Scala Spark sessions
by setting the
--language option to
For documentation of all options run
You can also list running sessions with:
and delete sessions with:
%spark delete sessionname
sparkmagic also provides the
%manage_spark command, which returns a widget
for managing Spark sessions on the Livy server, which you may prefer to the
Once you’ve created a Spark session as above, execute a cell on the cluster by
decorating a cell with the
%%spark cell magic (note the two
%%spark print('I am being executed on the external cluster')
It’s important to bear in mind the distinction between code executed in the
external Spark cluster from code that is executed in the notebook in your
SherlockML server. Cells that have the
%%spark magic are executed on
the external cluster, and will only see variables that exist there, and
cells without that magic are executed on your SherlockML server, and will
only see variables that exist there.
If you get errors like NameError: name ‘df’ is not defined, it may be because the variable you meant exists in the other context.
Transfer of data between the external cluster and SherlockML notebook must be done explicity, as described below.
spark SparkSession (Spark 2 only) and
sc SparkContext objects will
be inserted into the session automatically. For example, to create a Spark
DataFrame from a CSV file in the cluster’s HDFS filesystem:
%%spark df = spark.read.csv('hdfs:////data/sample_data.csv')
Variables created in one cell will persist in the session and will be available in other later cells. For example, we can run a second cell that counts the number of rows in the Spark DataFrame created above:
Any output generated by your code in the cluster will be retrieved and displayed as the output of the notebook cell in SherlockML.
Often, you’ll want to retrieve the contents of a Spark DataFrame from the
cluster so you can do additional processing and modelling in your normal
Jupyter notebook. You can do this with the
%spark -o df
This will evaluate and collect the Spark DataFrame
df on the external
Spark cluster, and save its data into a Pandas DataFrame in your SherlockML
notebook, also called
%spark -o will attempt to load all of the values from a Spark
DataFrame into the memory on your SherlockML server. If this is very large,
as is often the case with Spark DataFrames, it may crash your server due to
running out of memory!
You can also use
-o with a
%%spark cell magic. The below code creates a
Spark DataFrame in the external cluster called
top_ten, then collects it
into the SherlockML notebook as the Pandas DataFrame
%%spark -o top_ten top_ten = df.limit(10)
The pylivy package provides tools for managing Spark sessions on an external cluster and executing Spark code in them.
Unlike sparkmagic, pylivy does not depend on being executed from inside a Jupyter notebook, making it suitable for use in scripts, apps and APIs.
To use pylivy, install it with
pip install livy in a terminal or with a
pylivy provides the
LivySession class, which creates a Spark session and
shuts it down automatically when finished. To execute code in the session, pass
it as a string to the
run() method on the session:
from livy import LivySession with LivySession('https://mycluster.com:8998') as session: session.run('print("foo")')
You may also find the
textwrap.dedent() function from the Python standard
library useful for writing multiple-line code snippets inline:
from textwrap import dedent with LivySession('https://mycluster.com:8998') as session: session.run(dedent(""" df = spark.read.csv('hdfs:////data/sample_data.csv') top_ten = df.limit(10) """))
read() method on the session allows you to evaluate and retrieve the
contents of a Spark DataFrame. Pass it the name of the Spark DataFrame you want
to read, and it will return it as a Pandas DataFrame:
with LivySession('https://mycluster.com:8998') as session: session.run(dedent(""" df = spark.read.csv('hdfs:////data/sample_data.csv') top_ten = df.limit(10) """)) top_ten_pandas = session.read('top_ten') # Do something useful with the Pandas dataframe make_plot(top_ten_pandas)