Jay Urbain
Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.
This document provides and introduction to Spark, and a hands on pySpark programming tutorial through a series of Jupyter Notebooks executed on the Databricks hosted computing environment. Covers Spark core, Spark SQL, Spark Streaming, and Spark MLib.
Tutorial:
- SparkIntro (Google Doc)
- SparkIntro (pdf)
References:
Documentation http://spark.apache.org/docs/2.1.0/api/python/index.html
Spark Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html
Short guide to useful pySpark dataframe commands: https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Learning Spark, Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau. O'Reilly Media, Inc., February 2015. Note: A little dated.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale, Sandy Ryza, Uri Laserson, Josh Wills, Sean Owen. O'Reilly Media, April 2015. Note: RDD focus, a little dated.