Data Engineer Interview Guide: Real-Time & Streaming Questions
Core Concepts
- What’s the difference between batch processing and real-time (stream) processing? When would you use one vs the other? Answer
- Explain event-time vs processing-time. Why is this distinction critical in real-time pipelines?Answer
- What are watermarks and how do they help with late-arriving events?Answer
- Describe stateful vs stateless stream processing with an example.Answer
- Compare at-most-once, at-least-once, and exactly-once delivery semantics.
- What are common bottlenecks in streaming systems and how do you handle them?
- In Kafka, what is the role of partitions, offsets, and consumer groups?Answer
- Difference between Kafka Streams, Flink, and Spark Structured Streaming. Answer
Metrics & Product Scenarios
- How would you define and calculate DAU/WAU/MAU in a streaming system? Answer
- Suppose you need to track real-time order cancellations in DoorDash — what’s your approach? Answer
- Netflix wants “Avg 30-Day Viewing Days” in real-time. How would you model and compute it? Answer
- How do you track active drivers in Uber in the last 15 minutes using event streams? Answer
- How would you calculate retention cohorts in near real-time?Answer
- What metrics would you define to monitor fraud detection pipeline latency? Answer
Data Modeling
- Design fact and dimension tables for a real-time ride-hailing system (Uber/Lyft).Answer
- How would you model clickstream events in a dimensional schema?Answer
- Explain why we need bridge tables (like multi-genre titles in Netflix) in event modeling. Answer
- How would you store partial viewing sessions for analytics?Answer
Get Archana Goyal’s stories in your inbox
Join Medium for free to get updates from this writer.
SQL & ETL Scenarios
- Write SQL to compute rolling 30-day distinct active users.Answer
- You receive duplicate events in Kafka — how do you deduplicate in SQL/Spark?
- Write a query to find top 5 restaurants with most real-time orders in last 24h (DoorDash).Answer
- Suppose you have trip events (start, end). Write SQL to calculate average trip duration in the last 1 hour.Answer
- How would you implement incremental ETL for streaming → warehouse (Snowflake/BigQuery)?Answer
System Design & Edge Cases
- Design a real-time dashboard for Uber surge pricing. What components do you need? Answer
- How would you design an alerting system if order delivery exceeds 45 minutes?
- What happens if events arrive out of order? How do you correct them?
- How do you scale a Flink job when state grows too large?
- How do you design for backpressure handling in Spark Structured Streaming?
- If your Kafka consumer is lagging heavily, how do you debug and fix it?
- Explain how you’d handle schema evolution in streaming data.
Refer Material: