Spark 1.2 vs 1.6

I was reading quite old book “Learning Spark” by Oreilly. It was targeted towards Spark 1.1. And since Spark 1.3, lots of new feature were incorporated, notable thing would be Data Frame.

# Code to load csv file into RDD
def loadRecord(line):
    """Parse a CSV line"""
    input = io.StringIO(line)
    reader = csv.DictReader(input)
    return next(reader)

input = sc.textFile(file_path).map(loadRecord)

The blog DataFrame Spark 1.5 from csv file – NodalPoint encourage to use the spark-csv library from databricks.

format('com.databricks.spark.csv').options(header='true').load(file_path)

Automatically load spark-csv library

You don’t want to specify pyspark --package com.databricks:spark-csv everytime you need to use pyspark shell. You can do this:

# Add to $SPARK_HOME/conf/spark-defaults.conf
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0

Then PySpark notebook in Jupyter, it can benefits from this thing as well.


