Pyspark Tutorial 3

PySpark Tutorial 3:Advanced DataFrameTransformations

Complete the given assignment and please upload to

https://forms.gle/Uw9Z7TyoayHzDUiV6

PYSPARK TUTORI

Dr. J Geetha, Associate Professor, Department of CSE, RIT

Download to read ad-free

Objective

In this tutorial, Perform advanced transformations like joins, window functions, and pivots on large academic datasets.

Prerequisites

Students need Python and PySpark installed. This ensures they can run the code examples

1.Advanced DataFrame Operations

Objective:

Perform advanced transformations like joins, window functions, and pivots on large academic datasets.

Example: Analyzing Student Performance

A college has a dataset of student grades across subjects:columns = ["StudentID", "Name", "Subject", "Marks"]

Exercise 1:



Calculate average marks for each student



Rank students based on their average marks



Add a column categorizing students into grades (e.g., A, B, C) basedon their marks.



Pivot the dataset to show subjects as columns and marks as values.



Find the top-scoring student for each subject.

Additional Exercises(optional ):

1.Analyze atendance records for each suden and compue heir atendance percenage for asemeser.2.Use a window funcon o calculae he cumulave GPA for sudens across mulple semesers.3.Idenfy sudens who have consisenly scored below a cerain hreshold across all subjecs.

2.

PySpark SQL for Querying Data

Objective:



Use PySpark SQL for complex queries on college-related data.

Example: Library Data Analysiscolumns = ["StudentID", "Name", "Book", "BorrowedDate", "DaysBorrowed"]

PYSPARK TUTORI


Unlock this document
Upload a document to download this document or subscribe to read and download.
or
Unlock the next 5 pages after an ad
Python Pandas and MySQL Queries (Practical File) (2025 - 2026)
From Scribd10 pages409 views
Python Pandas and MySQL Queries (Practical File) (2025 - 2026)
No ratings yet


Exercises:

1.Query to nd the total books borrowed by each student

2.Query to nd books borrowed for more than 5 days3.Find the student who borrowed the most books.4.Identify the most frequently borrowed book.5.Compute the average borrowing period for each student.

Additional Exercises:

1.Create a query to nd the most popular elective courses based on enrollment numbers.2.Generate a report showing the average marks of students grouped by department and year.3.Identify trends in late submission rates for assignments over dierent semesters.

3. Machine Learning with PySpark MLlib

Objective:

Apply machine learning techniques to analyze college admission trends.Example: Predicting Admission Chancescolumns = ["GRE", "TOEFL", "GPA", "Research", "AdmitChance"]



Prepare features and label



Train a linear regression model



Predict admission chances for new data

Exercises:

1.Use the dataset to classify students into "High", "Medium", and "Low" admission chances.2.Train a decision tree regressor and compare its accuracy with linear regression.3.Evaluate the model using RMSE and R² metrics.


1.Create a query to nd the most popular elective courses based on enrollment numbers.2.Generate a report showing the average marks of students grouped by department and year.3.Identify trends in late submission rates for assignments over dierent semesters.

PYSPARK TUTORI


Unlock this document
Upload a document to download this document or subscribe to read and download.
or
Unlock the next 5 pages after an ad
Class 12 IP Practical File 2025-26
From Scribd28 pages9.8K views
Class 12 IP Practical File 2025-26
No ratings yet


4.Real-Time Data Streaming

Objective:

Process real-time data streams, such as campus event registrations.Example: Event Registration Streamfrom pyspark.streaming import StreamingContextssc = StreamingContext(spark.sparkContext, 5) # 5-second windowlines = ssc.socketTextStream("localhost", 9999)# Process streaming dataregistrations = lines.map(lambda x: (x.split(",")[0], 1)) # Split by student namecount = registrations.reduceByKey(lambda a, b: a + b)count.pprint()ssc.start()ssc.awaitTermination()

Exercises:

1.Modify the script to lter out duplicate registrations.2.Aggregate registration data into hourly windows and count unique registrations.3.Create an alert system for when the total registrations exceed a threshold.


1.Implement a real-time leaderboard for a college quiz competition using streaming data.2.Process live streaming data to identify the busiest times for campus facilities, such as libraries or cafeterias.3.Monitor and detect anomalies in a live student activity data stream, such as sudden spikes in logins or downloads.

5.

Optimizing PySpark Jobs

Objective:

Learn techniques like partitioning, caching, and using broadcast variables.

PYSPARK TUTORI



E

xample: Optimizing Department Workload Analysis

columns = ["DeptID", "DeptName", "Students"]Optimize using cachingBroadcast department information

Exercises:

1.Partition data by department and count the number of records in each partition.2.Measure the performance improvement of caching on a large dataset.3.Use broadcast variables to enrich a dataset with additional metadata.

6. Handling Missing and Inconsistent Data

Objective:

Clean messy datasets using PySpark.

Example: Fixing Missing Grades

columns = ["StudentID", "Name", "Subject", "Marks"]Fill missing valuesDrop rows with missing names

Exercises:

1.Replace missing values with the mean or median for numeric columns.2.Identify and remove duplicate rows from a dataset.3.Detect and correct inconsistent data (e.g., invalid scores like -1).


1.Create a script to identify and replace invalid values in a dataset, such as marks greater than 100 or negative scores.2.Detect missing information in registration forms and categorize incomplete records for follow-up.3.Implement a cleaning pipeline that normalizes inconsistent formats, such as date fields or department codes.

PYSPARK TUTORI


Pyspark Tutorial 3

Uploaded by

Pyspark Tutorial 3

Uploaded by

Share this document

You might also like