(cache)Pyspark Tutorial 3 | PDF | Apache Spark | Data
0% found this document useful (0 votes)
37 views5 pages

Pyspark Tutorial 3

0% found this document useful (0 votes)
37 views5 pages

Pyspark Tutorial 3

You are on page 1
5
 
PySpark Tutorial 3:Advanced DataFrameTransformations
Complete the given assignment and please upload to
https://forms.gle/Uw9Z7TyoayHzDUiV6
 
 
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT
 
Objective
In this tutorial, Perform advanced transformations like joins, window functions, and pivots on large academic datasets.
Prerequisites
Students need Python and PySpark installed. This ensures they can run the code examples
1.Advanced DataFrame Operations
Objective:
Perform advanced transformations like joins, window functions, and pivots on large academic datasets.
Example: Analyzing Student Performance
A college has a dataset of student grades across subjects:columns = ["StudentID", "Name", "Subject", "Marks"]
Exercise 1:
 
Calculate average marks for each student
 
Rank students based on their average marks
 
Add a column categorizing students into grades (e.g., A, B, C) basedon their marks.
 
Pivot the dataset to show subjects as columns and marks as values.
 
Find the top-scoring student for each subject.
 Additional Exercises(optional ):
1.Analyze atendance records for each suden and compue heir atendance percenage for asemeser.2.Use a window funcon o calculae he cumulave GPA for sudens across mulple semesers.3.Idenfy sudens who have consisenly scored below a cerain hreshold across all subjecs.
2.
PySpark SQL for Querying Data
Objective:
 
Use PySpark SQL for complex queries on college-related data.
Example: Library Data Analysiscolumns = ["StudentID", "Name", "Book", "BorrowedDate", "DaysBorrowed"]
 
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT

Unlock this document

Upload a document to download this document or subscribe to read and download.

or

Unlock the next 5 pages after an ad

Python Pandas and MySQL Queries (Practical File) (2025 - 2026)
From Scribd10 pages409 views
Python Pandas and MySQL Queries (Practical File) (2025 - 2026)
No ratings yet
 
Exercises:
1.Query to nd the total books borrowed by each student
2.Query to nd books borrowed for more than 5 days3.Find the student who borrowed the most books.4.Identify the most frequently borrowed book.5.Compute the average borrowing period for each student.
 Additional Exercises:
1.Create a query to nd the most popular elective courses based on enrollment numbers.2.Generate a report showing the average marks of students grouped by department and year.3.Identify trends in late submission rates for assignments over dierent semesters.
3. Machine Learning with PySpark MLlib
Objective:
Apply machine learning techniques to analyze college admission trends.Example: Predicting Admission Chancescolumns = ["GRE", "TOEFL", "GPA", "Research", "AdmitChance"]
 
Prepare features and label
 
 Train a linear regression model
 
Predict admission chances for new data
Exercises:
1.Use the dataset to classify students into "High", "Medium", and "Low" admission chances.2.Train a decision tree regressor and compare its accuracy with linear regression.3.Evaluate the model using RMSE and R² metrics.
 Additional Exercises:
1.Create a query to nd the most popular elective courses based on enrollment numbers.2.Generate a report showing the average marks of students grouped by department and year.3.Identify trends in late submission rates for assignments over dierent semesters.
 
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT

Unlock this document

Upload a document to download this document or subscribe to read and download.

or

Unlock the next 5 pages after an ad

Class 12 IP Practical File 2025-26
From Scribd28 pages9.8K views
Class 12 IP Practical File 2025-26
No ratings yet
 
4.Real-Time Data Streaming
Objective:
Process real-time data streams, such as campus event registrations.Example: Event Registration Streamfrom pyspark.streaming import StreamingContextssc = StreamingContext(spark.sparkContext, 5) # 5-second windowlines = ssc.socketTextStream("localhost", 9999)# Process streaming dataregistrations = lines.map(lambda x: (x.split(",")[0], 1)) # Split by student namecount = registrations.reduceByKey(lambda a, b: a + b)count.pprint()ssc.start()ssc.awaitTermination()
Exercises:
1.Modify the script to lter out duplicate registrations.2.Aggregate registration data into hourly windows and count unique registrations.3.Create an alert system for when the total registrations exceed a threshold.
 Additional Exercises:
1.Implement a real-time leaderboard for a college quiz competition using streaming data.2.Process live streaming data to identify the busiest times for campus facilities, such as libraries or cafeterias.3.Monitor and detect anomalies in a live student activity data stream, such as sudden spikes in logins or downloads.
5.
Optimizing PySpark Jobs
Objective:
Learn techniques like partitioning, caching, and using broadcast variables.
 
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT
 
E
xample: Optimizing Department Workload Analysis
columns = ["DeptID", "DeptName", "Students"]Optimize using cachingBroadcast department information
Exercises:
1.Partition data by department and count the number of records in each partition.2.Measure the performance improvement of caching on a large dataset.3.Use broadcast variables to enrich a dataset with additional metadata.
6. Handling Missing and Inconsistent Data
Objective:
Clean messy datasets using PySpark.
Example: Fixing Missing Grades
columns = ["StudentID", "Name", "Subject", "Marks"]Fill missing valuesDrop rows with missing names
Exercises:
1.Replace missing values with the mean or median for numeric columns.2.Identify and remove duplicate rows from a dataset.3.Detect and correct inconsistent data (e.g., invalid scores like -1).
 Additional Exercises:
1.Create a script to identify and replace invalid values in a dataset, such as marks greater than 100 or negative scores.2.Detect missing information in registration forms and categorize incomplete records for follow-up.3.Implement a cleaning pipeline that normalizes inconsistent formats, such as date fields or department codes.
 
PYSPARK TUTORI
Dr. J Geetha, Associate Professor, Department of CSE, RIT
576648e32a3d8b82ca71961b7a986505