Advanced-level SQL interview questions for Data Engineer positions.
1. How do you handle incremental data loads in an ETL pipeline?
Answer: Use watermarks (last_updated timestamp), change data capture (CDC), or delta tables. Track high-water mark, process only new/changed records. Handle late-arriving data with lookback windows. Use upsert (MERGE) for idempotent loads.
-- Incremental load using watermark WITH watermark AS ( SELECT COALESCE(MAX(last_updated), '1900-01-01') AS hwm FROM etl_metadata WHERE table_name = 'orders' ) INSERT INTO staging_orders SELECT * FROM source_orders WHERE last_updated > (SELECT hwm FROM watermark); -- Upsert to target (PostgreSQL) INSERT INTO dim_orders (order_id, status, amount) SELECT order_id, status, amount FROM staging_orders ON CONFLICT (order_id) DO UPDATE SET status = EXCLUDED.status, amount = EXCLUDED.amount; -- Update watermark UPDATE etl_metadata SET last_value = (SELECT MAX(last_updated) FROM staging_orders) WHERE table_name = 'orders';