PySpark Tutorial 3:Advanced DataFrameTransformations
Complete the given assignment and please upload to
https://forms.gle/Uw9Z7TyoayHzDUiV6
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT
4.Real-Time Data Streaming
Objective:
Process real-time data streams, such as campus event registrations.Example: Event Registration Streamfrom pyspark.streaming import StreamingContextssc = StreamingContext(spark.sparkContext, 5) # 5-second windowlines = ssc.socketTextStream("localhost", 9999)# Process streaming dataregistrations = lines.map(lambda x: (x.split(",")[0], 1)) # Split by student namecount = registrations.reduceByKey(lambda a, b: a + b)count.pprint()ssc.start()ssc.awaitTermination()
Exercises:
1.Modify the script to lter out duplicate registrations.2.Aggregate registration data into hourly windows and count unique registrations.3.Create an alert system for when the total registrations exceed a threshold.
Additional Exercises:
1.Implement a real-time leaderboard for a college quiz competition using streaming data.2.Process live streaming data to identify the busiest times for campus facilities, such as libraries or cafeterias.3.Monitor and detect anomalies in a live student activity data stream, such as sudden spikes in logins or downloads.
5.
Optimizing PySpark Jobs
Objective:
Learn techniques like partitioning, caching, and using broadcast variables.
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT
E
xample: Optimizing Department Workload Analysis
columns = ["DeptID", "DeptName", "Students"]Optimize using cachingBroadcast department information
Exercises:
1.Partition data by department and count the number of records in each partition.2.Measure the performance improvement of caching on a large dataset.3.Use broadcast variables to enrich a dataset with additional metadata.
6. Handling Missing and Inconsistent Data
Objective:
Clean messy datasets using PySpark.
Example: Fixing Missing Grades
columns = ["StudentID", "Name", "Subject", "Marks"]Fill missing valuesDrop rows with missing names
Exercises:
1.Replace missing values with the mean or median for numeric columns.2.Identify and remove duplicate rows from a dataset.3.Detect and correct inconsistent data (e.g., invalid scores like -1).
Additional Exercises:
1.Create a script to identify and replace invalid values in a dataset, such as marks greater than 100 or negative scores.2.Detect missing information in registration forms and categorize incomplete records for follow-up.3.Implement a cleaning pipeline that normalizes inconsistent formats, such as date fields or department codes.
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT