Member-only story
PySpark Interviews: The “Tricky but Simple” Questions
PySpark interviews often include questions that look hard at first, but are easy if you think them through. These questions check how well you can solve problems and use basic PySpark skills. You might get questions about removing duplicate data in unusual ways, dealing with empty data, using special functions for ranking, changing data structures, and making code run faster. The main thing is to show you can think clearly and solve problems step-by-step, not just remember code.
Question 1
Using PySpark, process a user information dataset that contains duplicate user_id entries. The goal is to remove those duplicates, ensuring that for each user_id, only the row with the latest created_date is retained. The final result should contain the most recent entry for every user.
# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
from datetime import date
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col
spark = SparkSession.builder.appName('Spark Playground').getOrCreate()
#enter the file path here
file_path = "/datasets/customer.csv"
#read the file
user_df = spark.read.format('csv').option('header', 'true').load(file_path)
user_df.show()