README.md
[![Documentation Status](https://readthedocs.org/projects/sparkhpc/badge/?version=latest)](http://sparkhpc.readthedocs.io/en/latest/?badge=latest)
[![Build Status](https://travis-ci.org/rokroskar/sparkhpc.svg?branch=master)](https://travis-ci.org/rokroskar/sparkhpc)
[![Test Coverage](https://codeclimate.com/github/rokroskar/sparkhpc/badges/coverage.svg)](https://codeclimate.com/github/rokroskar/sparkhpc/coverage)
# sparkhpc: Spark on HPC clusters made easy
This package tries to greatly simplify deploying and managing [Apache Spark](http://spark.apache.org) clusters on HPC resources.
## Installation
### From [pypi](https://pypi.python.org)
```
$ pip install sparkhpc
```
### From source
```
$ python setup.py install
```
This will install the python package to your default package directory as well as the `sparkcluster` and `hpcnotebook` command-line scripts.
## Usage
There are two options for using this library: from the command line or directly from python code.
### Command line
#### Get usage info
```
Usage: sparkcluster [OPTIONS] COMMAND [ARGS]...
Options:
--scheduler [lsf|slurm] Which scheduler to use
--help Show this message and exit.
Commands:
info Get info about currently running clusters
launch Launch the Spark master and workers within a...
start Start the spark cluster as a batch job
stop Kill a currently running cluster ('all' to...
$ sparkcluster start --help
Usage: sparkcluster start [OPTIONS] NCORES
Start the spark cluster as a batch job
Options:
--walltime TEXT Walltime in HH:MM format
--jobname TEXT Name to use for the job
--template TEXT Job template path
--memory-per-executor INTEGER Memory to reserve for each executor (i.e. the
JVM) in MB
--memory-per-core INTEGER Memory per core to request from scheduler in
MB
--cores-per-executor INTEGER Cores per executor
--spark-home TEXT Location of the Spark distribution
--wait Wait until the job starts
--help Show this message and exit.
```
#### Start a cluster
```
$ sparkcluster start 10
```
#### Get information about currently running clusters
```
$ sparkcluster info
----- Cluster 0 -----
Job 31454252 not yet started
$ sparkcluster info
----- Cluster 0 -----
Number of cores: 10
master URL: spark://10.11.12.13:7077
Spark UI: http://10.11.12.13:8080
```
#### Stop running clusters
```
$ sparkcluster stop 0
Job <31463649> is being terminated
```
### Python code
```python
from sparkhpc import sparkjob
import findspark
findspark.init() # this sets up the paths required to find spark libraries
import pyspark
sj = sparkjob.sparkjob(ncores=10)
sj.wait_to_start()
sc = sj.start_spark()
sc.parallelize(...)
```
### Jupyter notebook
`sparkhpc` gives you nicely formatted info about your jobs and clusters in the jupyter notebook - see the [example notebook](./example.ipynb).
## Dependencies
### Python
* [click](http://click.pocoo.org/5/)
* [findspark](https://github.com/minrk/findspark)
These are installable via `pip install`.
### System configuration
* Spark installation in `~/spark` OR wherever `SPARK_HOME` points to
* java distribution (set `JAVA_HOME`)
* `mpirun` in your path
### Job templates
Simple job templates for the currently supported schedulers are included in the distribution. If you want to use your own template, you can specify the path using the `--template` flag to `start`. See the [included templates](sparkhpc/templates) for an example. Note that the variable names in curly braces, e.g. `{jobname}` will be used to inject runtime parameters. Currently you must specify `walltime`, `ncores`, `memory`, `jobname`, and `spark_home`. If you want to significantly alter the job submission, the best would be to subclass the relevant scheduler class (e.g. `LSFSparkCluster`) and override the `submit` method.
## Using other schedulers
The LSF and SLURM schedulers are currently supported. However, adding support for other schedulers is rather straightforward (see the `LSFSparkJob` and `SLURMSparkJob` implementations as examples). Please submit a pull request if you implement a new scheduler or get in touch if you need help!
To implement support for a new scheduler you should subclass `SparkCluster`. You must define the following *class* variables:
* `_peek()` (function to get stdout of the current job)
* `_submit_command` (command to submit a job to the scheduler)
* `_job_regex` (regex to get the job ID from return string of submit command)
* `_kill_command` (scheduler command to kill a job)
* `_get_current_jobs` (scheduler command to return jobid, status, jobname one job per line)
Note that `_get_current_jobs` should return a custom formatted string where the output looks like this:
```
JOB_NAME STAT JOBID
sparkcluster PEND 31610738
sparkcluster PEND 31610739
sparkcluster PEND 31610740
```
Depending on the scheduler's behavior, you may need to override some of the other methods as well.
## Jupyter notebook
Running Spark applications, especially with python, is really nice from the comforts of a [Jupyter notebook](http://jupyter.org/).
This package includes the `hpcnotebook` script, which will setup and launch a secure, password-protected notebook for you.
```
$ hpcnotebook
Usage: hpcnotebook [OPTIONS] COMMAND [ARGS]...
Options:
--port INTEGER Port for the notebook server
--help Show this message and exit.
Commands:
launch Launch the notebook
setup Setup the notebook
```
### Setup
Before launching the notebook, it needs to be configured. The script will first ask for a password for the notebook and generate a self-signed ssh
certificate - this is done to prevent other users of your cluster to stumble into your notebook by chance.
### Launching
On a computer cluster, you would normally either obtain an interactive job and issue the command below, or use this as a part of a batch submission script.
```
$ hpcnotebook launch
To access the notebook, inspect the output below for the port number, then point your browser to https://1.2.3.4:<port_number>
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[I 15:43:12.022 NotebookApp] Serving notebooks from local directory: /cluster/home/roskarr
[I 15:43:12.022 NotebookApp] 0 active kernels
[I 15:43:12.022 NotebookApp] The Jupyter Notebook is running at: https://[all ip addresses on your system]:8889/
[I 15:43:12.022 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
```
In this case, you could set up a port forward to host `1.2.3.4` and instruct your browser to connect to `https://1.2.3.4:8889`.
Inside the notebook, it is straightforward to set up the `SparkContext` using the `sparkhpc` package (see above).
## Contributing
Please submit an issue if you discover a bug or have a feature request! Pull requests also very welcome.