Skip to content

Lab Assignment 1

Pranoop Mutha edited this page Jan 31, 2018 · 17 revisions

ICP Team Id: 5-2

Member 1 : Pranoop Mutha Class id: 15

Member 2 Name: Geovanni, West Class id: 23

For this assignment, we have done in both Python by Geo and Scala by Pranoop. There is an equal contribution from both of us.

Objective:

  • Using Spark Transformations and Actions,we need to find the users who have rated more than 25 items from a movielens data set which consists of 1,00,000 movie ratings by 943 users on 1682 items.
  • Create GitHub Account
  • CreateZenHubTool Account with 3 milestones, and atleast 5 issues and show the analytics graph.

Spark Transformations and Actions:

A Spark transformation simply calls a Spark job, which may be written in Scala or Python and use SQL and Hive contexts. The Spark job resides in jar files or Python script files.

Transformations is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.Actions are performed that does not change the RDD data. It performs action on the data without changing the data.

The below transformations and actions are being used by us in this lab assignment.

  • map(func) : Return a new distributed dataset formed by passing each element of the source through a function func.

  • filter(func) : Return a new dataset formed by selecting those elements of the source on which func returns true.

  • reduceByKey(func,[numTasks]) : When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

  • sortByKey([ascending], [numTasks]) : When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

  • saveAsTextFile : rite the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Input Data

we have taken movielens data set which consists of 1,00,000 movie ratings by 943 users on 1682 items.


Scala Code performing Spark Transformations and Actions

Here we have used initially the split function to split the columns separated by Tab. Then we mapped first column as a key and the each of the key to value 1. Then, we performed a reduceByKey transformation which works as a count function, giving number of times the key gets repeated, then we sorted out the duplicates, filtered the data for having value greater than 25 using filter command and then sorted the results in the descending order using sortBy transformation. Finally we saved the results into text file using saveAsTextFile Action.

Python Code to Transform and perform Actions

Output from Python

Output from Scala











GitHub Account

Github Remote Repository:

We have created the Github Repository named Bigdata in Geo's github

Cloning the repository to desktop:

I then cloned it to the local and made changes to the source code for scala. Similarly , he made for the Python and included our screen shots

Cloned folder in local:

This is the folder after cloning into our local system

LabAssignments and ICP folder in the main folder:

Lab 1 Folder:

Displaying Lab 1 Folder

Documentation and Source folders in Lab 1:

Created documentation and source folders in the Lab 1 folder

Screenshots in documentation folder:

Inlcuded screenshots in the documentation folder

Source code for Python and Scala in Source Code Folder:

Included source code in the src folder

Zenhub:

Issues:

Zenhub Board:

Milestones

Burndown Charts or Graphs:





Source Code Link: Scala: https://github.com/GeoSnipes/Big-Data/tree/master/lab_assignments/Lab%201/src/Scala/CS5542-Lab1-SourceCode/Spark%20WordCount Python: https://github.com/GeoSnipes/Big-Data/tree/master/lab_assignments/Lab%201/src/Python

Clone this wiki locally