Skip to content

orwigeg/SparkIntro

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

SparkIntro

Jay Urbain

Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.

This document provides and introduction to Spark, and a hands on pySpark programming tutorial through a series of Jupyter Notebooks executed on the Databricks hosted computing environment. Covers Spark core, Spark SQL, Spark Streaming, and Spark MLib.

Tutorial:

Notebooks

Data

References:

Documentation http://spark.apache.org/docs/2.1.0/api/python/index.html

Spark Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html

Short guide to useful pySpark dataframe commands: https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Learning Spark, Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau. O'Reilly Media, Inc., February 2015. Note: A little dated.

Advanced Analytics with Spark: Patterns for Learning from Data at Scale, Sandy Ryza, Uri Laserson, Josh Wills, Sean Owen. O'Reilly Media, April 2015. Note: RDD focus, a little dated.

About

Fast introduction to Spark with pySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%