PySpark in local modeΒΆ

The easiest way to try out Apache Spark from Python on SherlockML is in local mode. The entire processing is done on a single server. You thus still benefit from parallelisation across all the cores in your server, but not across several servers.

To use PySpark on SherlockML, create a custom environment to install PySpark. Your custom environment should include:

  • openjdk-8-jdk in the system section;
  • pyspark in the Python section, under pip.
../../_images/pyspark-env.png

Start a new Jupyter server with this environment. Unfortunately, PySpark does not play well with Anaconda environments. You therefore need to set environment variables telling Spark which Python executor to use. Add these lines to the top of your notebook:

import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

You can now import pyspark and create a Spark context:

import pyspark

number_cores = 8
memory_gb = 24
conf = (
    pyspark.SparkConf()
        .setMaster('local[{}]'.format(number_cores))
        .set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext(conf=conf)

pyspark does not support restarting the Spark context, so if you need to change the settings for your cluster, you will need to restart the Jupyter kernel.

Now that we have instantiated a Spark context, we can use it to run calculations:

rdd = sc.parallelize([1, 4, 9])
sum_squares = (
    rdd.map(lambda elem: float(elem)**2)
        .reduce(lambda elem1, elem2: elem1 + elem2)
)

This example hard-codes the number of threads and the memory. You may want to set these dynamically based on the size of the server. You can use the NUM_CPUS and AVAILABLE_MEMORY_MB environment variables to determine the size of the server the notebook is currently running on:

number_cores = int(os.environ["NUM_CPUS"])