Using Apache Spark

Apache Spark is a fast and general framework for distributed computing, capable of processing massive datasets across large clusters of computers. You’re unlikely to need Spark unless you have huge amounts of data (100s of GB and above), but when you have such large volumes of data it’s invaluable.

How to run Spark in SherlockML

Below we document two ways of running Spark in SherlockML:

Both methods provide access to the same Spark API, however are useful in very different contexts. In local mode, your code is processed on a single server, so can’t scale to the kind of processing bandwidth available on a full Spark cluster, but it is convenient for rapid development, especially when you don’t have access to a full cluster. For working with data on a full Spark cluster, you need to use the other guide.