Sitemap

Data Engineer Interview Guide: Real-Time & Streaming Questions

3 min readAug 20, 2025
Press enter or click to view image in full size
https://medium.com/@goyalarchana17/15-critical-questions-every-data-engineer-should-be-prepared-to-answer-3aa82791653e

Core Concepts

  1. What’s the difference between batch processing and real-time (stream) processing? When would you use one vs the other? Answer
  2. Explain event-time vs processing-time. Why is this distinction critical in real-time pipelines?Answer
  3. What are watermarks and how do they help with late-arriving events?Answer
  4. Describe stateful vs stateless stream processing with an example.Answer
  5. Compare at-most-once, at-least-once, and exactly-once delivery semantics.
  6. What are common bottlenecks in streaming systems and how do you handle them?
  7. In Kafka, what is the role of partitions, offsets, and consumer groups?Answer
  8. Difference between Kafka Streams, Flink, and Spark Structured Streaming. Answer

Metrics & Product Scenarios

  1. How would you define and calculate DAU/WAU/MAU in a streaming system? Answer
  2. Suppose you need to track real-time order cancellations in DoorDash — what’s your approach? Answer
  3. Netflix wants “Avg 30-Day Viewing Days” in real-time. How would you model and compute it? Answer
  4. How do you track active drivers in Uber in the last 15 minutes using event streams? Answer
  5. How would you calculate retention cohorts in near real-time?Answer
  6. What metrics would you define to monitor fraud detection pipeline latency? Answer

Data Modeling

  1. Design fact and dimension tables for a real-time ride-hailing system (Uber/Lyft).Answer
  2. How would you model clickstream events in a dimensional schema?Answer
  3. Get Archana Goyal’s stories in your inbox

    Join Medium for free to get updates from this writer.

  4. Explain why we need bridge tables (like multi-genre titles in Netflix) in event modeling. Answer
  5. How would you store partial viewing sessions for analytics?Answer

SQL & ETL Scenarios

  1. Write SQL to compute rolling 30-day distinct active users.Answer
  2. You receive duplicate events in Kafka — how do you deduplicate in SQL/Spark?
  3. Write a query to find top 5 restaurants with most real-time orders in last 24h (DoorDash).Answer
  4. Suppose you have trip events (start, end). Write SQL to calculate average trip duration in the last 1 hour.Answer
  5. How would you implement incremental ETL for streaming → warehouse (Snowflake/BigQuery)?Answer

System Design & Edge Cases

  1. Design a real-time dashboard for Uber surge pricing. What components do you need? Answer
  2. How would you design an alerting system if order delivery exceeds 45 minutes?
  3. What happens if events arrive out of order? How do you correct them?
  4. How do you scale a Flink job when state grows too large?
  5. How do you design for backpressure handling in Spark Structured Streaming?
  6. If your Kafka consumer is lagging heavily, how do you debug and fix it?
  7. Explain how you’d handle schema evolution in streaming data.

Refer Material:

Archana Goyal
Archana Goyal

No responses yet

Write a response