Metadata-Version: 2.0
Name: sparkmanager
Version: 0.3.1
Summary: A pyspark management framework
Home-page: https://github.com/matz-e/sparkmanager
Author: Matthias Wolf
Author-email: matthias.wolf@epfl.ch
License: MIT
Download-URL: https://github.com/matz-e/sparkmanager/archive/0.3.1.tar.gz
Keywords: apache-spark
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Requires-Dist: pyspark
Requires-Dist: six

Spark Management Consolidated
=============================

A small module that will load as a singleton class object to manage Spark
related things.

Installation
------------

Directly via ``pip`` on the command line, in a `virtualenv`:

.. code:: shell

   pip install https://github.com/matz-e/sparkmanager/tarball/master

or for the current user:

.. code:: shell

   pip install --user https://github.com/matz-e/sparkmanager/tarball/master

Usage
-----

The module itself acts as a mediator to Spark:

.. code:: python

   import sparkmanager as sm

   # Create a new application
   sm.create("My fancy name")

   data = sm.spark.range(5)
   # Will show up in the UI with the name "broadcasting some data"
   with sm.jobgroup("broadcasting some data"):
       data = sm.broadcast(data.collect())

The Spark session can be accessed via ``sm.spark``, the Spark context via
``sm.sc``. Both attributes are instantiated once the ``create`` method is
called, with the option to call unambiguous methods from both directly via
the :py:class:`SparkManager` object:

.. code:: python

   # The following two calls are equivalent
   c = sm.parallelize(range(5))
   d = sm.sc.parallelize(range(5))
   assert c.collect() == d.collect()

Cluster support scripts
-----------------------

.. note::

   Scripts to run on the cluster are still somewhat experimental and should
   be used with caution!

Environment setup
~~~~~~~~~~~~~~~~~

To create a self-contained Spark environment, the script provided in
``examples/env.sh`` can be used. It is currently tuned to the requirements of
the `bbpviz` cluster.  A usage example:

.. code:: shell

   SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh

The working directory will contain:

* A Python virtual environment
* A basic Spark configuration pointing to directories within the working
  directory
* An environment script to establish the setup

To use the resulting working environment:

.. code:: shell

   . /path/to/a/work/directory/env.sh

Spark deployment on allocations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within a cluster allocation, the script ``sm_cluster`` can be used to start
a Spark cluster.  The script will be automatically installed by `pip`.  To
use it, pass either a working directory containing an environment or
specify them separately:

.. code:: shell

   sm_cluster startup $WORKDIR
   sm_cluster startup $WORKDIR /path/to/some/env.sh

Similar, to stop a cluster (not necessary with slurm):

.. code:: shell

   sm_cluster shutdown $WORKDIR
   sm_cluster shutdown $WORKDIR /path/to/some/env.sh

Spark applications then can connect to a master found via:

.. code:: shell

   cat $WORKDIR/spark_master

TL;DR on BlueBrain 5
~~~~~~~~~~~~~~~~~~~~

Setup a Spark environment in your current shell, and point `WORKDIR` to a
shared directory. `SPARK_HOME` needs to be in your environment and point to
your Spark installation.  By default, only a file with the Spark master and
the cluster launch script will be copied to `WORKDIR`. Then submit a
cluster with:

.. code:: shell

   sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR


