PySpark and Jupyter

Problem: how to run PySpark in Jupyter notebook.

Some assumption before starting:

  • You have Anaconda installed.
  • You have Spark installed. District Data Lab has an exceptional article on how to get started with Spark in Python. It’s long, but detailed.
  • pyspark is in the $PATHvariable.

There are 2 solutions:

  1. The first one, it modified the environment variable that pyspark read. Then the jupyter/ipython notebook with pyspark environment would be started instead of pyspark console.
  2. The second one is installing the separate spark kernel for Jupyter. This way is more flexible, because the spark-kernel from IBM This solution is better because this spark kernel can run code in Scala, Python, Java, SparkSQL.

1st – the simple solution

Why does it work? [Disclaimer: I can only give my intuition on how the whole thing work] Check out the code of pyspark in github, it read some environment variables. Three in our interests are:

PYSPARK_PYTHON #
PYSPARK_DRIVER_PYTHON #
PYSPARK_DRIVER_PYTHON_OPTS #

You can put those variable in ~/.profile or ~/.bashrc so that every time you start a new shell, those variables are there. Add new info in one of these 2 files. Below is one example:

# ~/.profile
export PYSPARK_PYTHON=/home/hduser/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/hduser/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then anytime you run pysparkthe Jupyter notebook will start. Here is the piece of code to test whether pyspark is working. Suppose you have a text file

# text_file = file://<file_path>
# Suppose you have a file.txt in your Desktop
# sc is the SparkContext object, it's available when pyspark is executed
text_file = sc.textFile("file:///home/hduser/Desktop/file.txt")
text_file.count()

If you can run that code snippet in Jupyter, congratulations!

NOTE!!!

Use this way, and spark-submit doesn’t work anymore with pyspark. For example, following the tutorial in Spark Quick Start – Self-contained application

# Before setting those PYSPARK_ variables

$ ./bin/spark-submit examples/src/main/python/pi.py
Pi is roughly 3.146480

# After setting those PYSPARK_ variables

$./bin/spark-submit examples/src/main/python/pi.py
jupyter: '/usr/local/spark-1.6.1-bin-without-hadoop/examples/src/main/python/pi.py' is not a Jupyter command

So I recommend to use the 2nd way.

Create pyspark kernel in Jupyter

This one will add a selection of PySpark kernel whenever we start a new Jupyter notebook.

Follow the tutorial from limauto

Read Jupyter documentation to know where to put the kernels.json. I decided to put in /usr/local/share/jupyter/kernels, this will make it available system-wide.

# I choose to make a kernel available for a system
# Make a folder:
mkdir -p /usr/local/share/jupyter/kernels
mkdir -p /usr/local/share/jupyter/kernels/pyspark

# Content of /usr/share/jupyter/kernels/pyspark/kernel.json
{
    "display_name": "PySpark (Spark 1.6)",
"language": "python",
"argv": [
"python3",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}

Check whether the kernel is specified correctly

$ jupyter kernelspec list
Available kernels:
python3    /home/hduser/anaconda3/lib/python3.5/site-packages/ipykernel/resources
pyspark    /usr/local/share/jupyter/kernels/pyspark

Run jupyter notebook and check in the website, now you have PySpark kernel.

Jupyter_PySpark
Jupyter_PySpark_2
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s