Interview Prep
Interview Questions for ETL Testing — SQL Validation, Data Pipelines, and What Testers Actually Get Asked
ETL testing interviews are 80% SQL. The rest is understanding data pipelines, validation techniques, and debugging failed loads. Here is what companies ask at every level.

ETL testing validates that data moves correctly from source systems to the data warehouse — every BI and analytics team depends on it.
Why ETL Testing Interviews Are SQL-Heavy
ETL testing validates that data moves correctly from source systems to the data warehouse. Every BI, analytics, and data engineering team needs ETL testers. The interview tests SQL skills (80% of the job), understanding of ETL processes, data validation techniques, and familiarity with tools like Informatica, Talend, or SSIS.
Companies like Deloitte, Accenture, Cognizant, and every organization with a data warehouse test these skills. This guide covers the actual questions asked — organized by topic, with SQL examples and the depth interviewers expect for different experience levels.
ETL testing is not functional testing with a different name. It is data validation at scale — comparing millions of rows between source and target, verifying transformation logic, and catching the silent data corruption that breaks dashboards downstream.
ETL Fundamentals
These foundational questions appear in every ETL testing interview. Getting them right with concrete examples shows hands-on experience.
Q1: What is ETL? Explain each phase with an example.
Why they ask: Opening question in 90% of ETL interviews. They want a concrete example, not a textbook definition.
-- EXTRACT: Pull data from source systems (databases, flat files, APIs) SELECT order_id, customer_id, product_id, amount, order_date FROM sales_orders WHERE order_date = CURRENT_DATE - 1; -- TRANSFORM: Apply business rules, cleanse, aggregate -- Cleansing, business rules (tax = amount * 0.18), -- aggregation, lookup mapping, type conversion, deduplication -- LOAD: Insert transformed data into the target warehouse INSERT INTO warehouse.fact_sales (order_id, customer_name, product_name, amount, tax, order_date) VALUES (transformed_data); -- Full load: Truncate + reload | Incremental: Only new/changed -- Tester validates EACH phase: -- Extract: All source data captured? -- Transform: Business rules applied correctly? -- Load: Target data matches expected output?
Q2: What is the difference between ETL and ELT?
Why they ask: Cloud warehouses have made ELT common. They want to know when to use each and the testing implications.
-- ETL: Transform BEFORE loading. Tools: Informatica, Talend, SSIS -- Use when: Limited target compute, heavy cleansing, compliance -- Flow: Source → Staging (transform) → Target warehouse -- ELT: Load raw FIRST, transform in warehouse. Tools: dbt, Glue -- Use when: Cloud warehouse, raw data for auditing -- Flow: Source → Raw layer → Transform in warehouse → Target -- Testing: ETL validates staging + target -- ELT validates raw layer + transformed layer -- ELT is standard in cloud; ETL dominates on-premise
Q3: What types of testing do you perform in ETL?
Why they ask: Tests whether you understand the full scope beyond row count comparisons.
-- 1. DATA COMPLETENESS: All source records reach target SELECT COUNT(*) FROM source_table; SELECT COUNT(*) FROM target_table; -- 2. DATA TRANSFORMATION: Business rules applied correctly -- 3. DATA QUALITY: NULLs, duplicates, referential integrity SELECT * FROM fact_sales WHERE customer_id NOT IN (SELECT customer_id FROM dim_customer); -- 4. PERFORMANCE: Completes within batch window -- 5. REGRESSION: After code changes, existing data intact -- 6. INCREMENTAL LOAD: Only new/modified records loaded
Data Validation
Data validation is the core of ETL testing. These questions test your ability to write SQL that catches missing records, broken transformations, and data quality problems.
Q1: How do you validate data completeness?
Why they ask: First check in any ETL validation. Interviewers want you to go beyond simple row counts.
-- Row count comparison SELECT 'source' AS system, COUNT(*) AS row_count FROM source_db.orders UNION ALL SELECT 'target', COUNT(*) FROM warehouse.fact_orders; -- NULL checks on critical columns SELECT COUNT(*) AS total, SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer, SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amount FROM warehouse.fact_orders; -- Duplicate detection SELECT order_id, COUNT(*) AS dup_count FROM warehouse.fact_orders GROUP BY order_id HAVING COUNT(*) > 1; -- Count by date to catch partial loads SELECT order_date, COUNT(*) FROM source_db.orders GROUP BY order_date; -- Compare with target — mismatch pinpoints where records dropped
Q2: How do you validate data transformations?
Why they ask: Transformation testing is the hardest part. They want specific examples — date conversions, currency calculations, lookup mappings.
-- Date format: VARCHAR '15-03-2024' → DATE 2024-03-15
SELECT s.order_id, s.order_date_str, t.order_date,
CASE WHEN t.order_date = STR_TO_DATE(s.order_date_str, '%d-%m-%Y')
THEN 'PASS' ELSE 'FAIL' END AS result
FROM source_db.orders s
JOIN warehouse.fact_orders t ON s.order_id = t.order_id
WHERE t.order_date != STR_TO_DATE(s.order_date_str, '%d-%m-%Y');
-- Currency conversion: INR → USD
SELECT s.order_id, s.amount_inr, t.amount_usd,
ROUND(s.amount_inr / r.exchange_rate, 2) AS expected_usd
FROM source_db.orders s
JOIN warehouse.fact_orders t ON s.order_id = t.order_id
JOIN ref_exchange_rates r ON s.currency = r.currency_code
WHERE t.amount_usd != ROUND(s.amount_inr / r.exchange_rate, 2);
-- Lookup mapping: region_code → region_name
SELECT s.order_id, s.region_code, t.region_name, m.expected_region_name
FROM source_db.orders s
JOIN warehouse.fact_orders t ON s.order_id = t.order_id
JOIN mapping_table m ON s.region_code = m.region_code
WHERE t.region_name != m.expected_region_name;Q3: What is data profiling and why is it important?
Why they ask: Data profiling happens before ETL testing. It reveals data types, NULL percentages, value distributions, and outliers.
-- Column-level profiling
SELECT COUNT(*) AS total_rows,
COUNT(DISTINCT customer_id) AS unique_customers,
ROUND(SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)
* 100.0 / COUNT(*), 2) AS null_pct
FROM source_db.orders;
-- Value distribution
SELECT region, COUNT(*) AS frequency,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM source_db.orders GROUP BY region ORDER BY frequency DESC;
-- Outlier detection
SELECT * FROM source_db.orders WHERE amount <= 0 OR amount > 1000000;
-- Profiling reveals quality issues before ETL runs, defines
-- NULL thresholds, identifies edge cases for test scenarios
SQL is the primary tool for ETL testers — every validation query you write catches data issues before they reach production dashboards.
SQL for ETL Testing
SQL is the backbone of ETL testing. Interviewers expect you to write these queries on a whiteboard or live SQL editor.
Q1: Write a query to compare row counts between source and target.
Why they ask: They want more than COUNT(*). Compare counts by key dimensions to pinpoint missing records.
-- Basic comparison SELECT 'source' AS system, COUNT(*) FROM source_db.sales_orders UNION ALL SELECT 'target', COUNT(*) FROM warehouse.fact_sales; -- Better: Compare by date to find partial load failures SELECT COALESCE(s.order_date, t.order_date) AS check_date, COALESCE(s.cnt, 0) AS source_count, COALESCE(t.cnt, 0) AS target_count, COALESCE(s.cnt, 0) - COALESCE(t.cnt, 0) AS difference FROM (SELECT order_date, COUNT(*) AS cnt FROM source_db.sales_orders GROUP BY order_date) s FULL OUTER JOIN (SELECT order_date, COUNT(*) AS cnt FROM warehouse.fact_sales GROUP BY order_date) t ON s.order_date = t.order_date WHERE COALESCE(s.cnt, 0) != COALESCE(t.cnt, 0); -- Total count match can hide offsetting errors
Q2: Write a query to find duplicate records in the target table.
Why they ask: Duplicates are the most common ETL defect — they inflate metrics and break reports.
-- Duplicates on primary key SELECT order_id, COUNT(*) AS occurrence FROM warehouse.fact_sales GROUP BY order_id HAVING COUNT(*) > 1; -- Duplicates on composite business key SELECT customer_id, product_id, order_date, COUNT(*) FROM warehouse.fact_sales GROUP BY customer_id, product_id, order_date HAVING COUNT(*) > 1; -- Summary: total vs unique vs duplicate SELECT COUNT(*) AS total, COUNT(DISTINCT order_id) AS unique_orders, COUNT(*) - COUNT(DISTINCT order_id) AS duplicates FROM warehouse.fact_sales; -- Causes: missing dedup step, job retry, source dupes, wrong join
Q3: Write a query to validate a transformation rule.
Why they ask: Checks if you can translate a business rule into a SQL validation query.
-- Find records where transformation is incorrect
SELECT s.customer_id, s.first_name, s.last_name,
t.full_name AS actual,
CONCAT(s.first_name, ' ', s.last_name) AS expected
FROM source_db.customers s
JOIN warehouse.dim_customer t ON s.customer_id = t.customer_id
WHERE t.full_name != CONCAT(TRIM(s.first_name), ' ', TRIM(s.last_name))
OR s.first_name IS NULL OR s.last_name IS NULL;
-- Summary: count pass vs fail
SELECT COUNT(*) AS total,
SUM(CASE WHEN t.full_name = CONCAT(TRIM(s.first_name), ' ', TRIM(s.last_name))
THEN 1 ELSE 0 END) AS passed,
SUM(CASE WHEN t.full_name != CONCAT(TRIM(s.first_name), ' ', TRIM(s.last_name))
THEN 1 ELSE 0 END) AS failed
FROM source_db.customers s
JOIN warehouse.dim_customer t ON s.customer_id = t.customer_id;Q4: How do you validate incremental loads?
Why they ask: Most production ETL jobs are incremental. You must verify existing records are untouched, new records added, and updates applied.
-- New records loaded? (result should be EMPTY)
SELECT s.order_id FROM source_db.orders s
LEFT JOIN warehouse.fact_orders t ON s.order_id = t.order_id
WHERE s.modified_date > '2024-01-15 00:00:00' AND t.order_id IS NULL;
-- Updated records reflect latest values?
SELECT s.order_id, s.amount AS source, t.amount AS target
FROM source_db.orders s
JOIN warehouse.fact_orders t ON s.order_id = t.order_id
WHERE s.modified_date > '2024-01-15 00:00:00' AND s.amount != t.amount;
-- Old records untouched?
SELECT t.order_id FROM warehouse.fact_orders t
WHERE t.load_timestamp > '2024-01-15 00:00:00'
AND t.order_id NOT IN (SELECT order_id FROM source_db.orders
WHERE modified_date > '2024-01-15 00:00:00');
-- Duplicates after load? (common: insert instead of upsert)
SELECT order_id, COUNT(*) FROM warehouse.fact_orders
GROUP BY order_id HAVING COUNT(*) > 1;Practice SQL Validation Questions Live
Reading SQL queries is not the same as writing them under interview pressure. Practice with timed mock interviews that test ETL validation queries, data pipeline concepts, and debugging transformation failures.
TRY INTERVIEW PRACTICE →Tools & Frameworks
Tool questions test hands-on experience with real ETL platforms. Interviewers want depth in at least one tool and awareness of the ecosystem.
Q1: What ETL tools have you worked with?
Why they ask: They want hands-on experience. Be specific — version, project type, your role.
-- INFORMATICA POWERCENTER: Enterprise ETL, large organizations -- Testing: Validate mapping logic, check session logs -- TALEND: Open-source, drag-and-drop interface -- Testing: Validate tMap transformations, check reject files -- SSIS: Microsoft ETL, common in .NET environments -- Testing: Validate package execution, check error outputs -- AWS GLUE: Serverless ETL, uses PySpark -- APACHE SPARK: Big data processing, Spark SQL -- Pick 1-2 tools, describe a specific project: -- "I validated 200+ transformation rules in Informatica -- using automated SQL scripts" beats "I used Informatica"
Q2: How do you automate ETL test cases?
Why they ask: Manual ETL testing does not scale. With hundreds of tables, automation is essential.
-- 1. SQL SCRIPTS: Store queries, run after each load
CREATE TABLE etl_test_results (
test_id INT AUTO_INCREMENT PRIMARY KEY,
test_name VARCHAR(200),
source_count INT, target_count INT,
status VARCHAR(10),
run_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO etl_test_results (test_name, source_count, target_count, status)
SELECT 'fact_sales row count',
(SELECT COUNT(*) FROM source_db.orders),
(SELECT COUNT(*) FROM warehouse.fact_sales),
CASE WHEN (SELECT COUNT(*) FROM source_db.orders)
= (SELECT COUNT(*) FROM warehouse.fact_sales)
THEN 'PASS' ELSE 'FAIL' END;
-- 2. PYTHON: pandas/SQLAlchemy for programmatic validation
-- 3. SCHEDULING: Airflow DAG or cron after ETL completion
-- 4. FRAMEWORKS: Great Expectations, dbt tests, QuerySurgeQ3: What is data lineage and why does it matter for testing?
Why they ask: Data lineage tracks data from source to target through every transformation. Without it, you cannot trace where issues originated.
-- Lineage: Source → Staging → Transform → Target → Dashboard -- MySQL orders → staging.raw_orders → staging.transformed_orders -- → warehouse.fact_sales → Tableau/Power BI -- Why it matters: -- TRACEABILITY: Wrong numbers → trace to transformation -- IMPACT ANALYSIS: Source change → find affected tables -- TEST COVERAGE: Reveals untested transformations -- DEBUGGING: Failed records traced back to source -- Tools: Apache Atlas, Collibra, dbt lineage graph -- Each row in source-to-target mapping = one test case
Scenario Questions
Scenario questions test your real-world debugging skills. Interviewers evaluate your systematic approach — there is no single correct answer.
Q1: Data in the target table does not match the source after an ETL run. How do you investigate?
Why they ask: Most common real-world ETL issue. They want a structured debugging approach.
-- STEP 1: Quantify the mismatch SELECT COUNT(*) AS mismatched FROM source_db.orders s JOIN warehouse.fact_orders t ON s.order_id = t.order_id WHERE s.amount != t.amount; -- STEP 2: Check ETL job logs for errors, rejected records -- STEP 3: Check staging — isolate which phase failed SELECT s.order_id, s.amount AS source, stg.amount AS staging FROM source_db.orders s JOIN staging.orders stg ON s.order_id = stg.order_id WHERE s.amount != stg.amount; -- Correct in staging but wrong in target = Load issue -- Wrong in staging = Extract or Transform issue -- STEP 4: Validate transformation logic on sample records -- STEP 5: Check timing — source changing during extraction? -- STEP 6: Check types — FLOAT vs DECIMAL, VARCHAR truncation
Q2: A nightly ETL job failed at 3 AM. What is your troubleshooting approach?
Why they ask: ETL jobs run overnight and failures need quick resolution before business hours.
-- STEP 1: Check scheduler — which task failed? Upstream OK? -- STEP 2: Read error logs — common categories: -- Connection: Source DB unreachable, credentials expired -- Data: Constraint violation, type mismatch -- Resource: Out of memory, disk full, timeout -- Permission: User lacks access to source/target -- STEP 3: Check source — maintenance? Schema change? Network? -- STEP 4: Check target — table locked? Disk space? -- STEP 5: Check volume — unexpected spike causes timeouts -- STEP 6: Fix root cause, rerun, validate data -- STEP 7: Add monitoring alerts to prevent recurrence
How to Prepare — By Level
The depth of ETL testing knowledge varies by experience level. Here is what each expects:
Junior (0-2 years) — 1 Week Preparation
Focus on ETL concepts (Extract, Transform, Load), basic SQL validation queries (row counts, NULL checks, duplicate detection), and the difference between full and incremental loads. Be able to write a source-to-target row count comparison. Understand what Informatica or SSIS does at a high level.
Mid-Level (2-5 years) — 2 Weeks Preparation
Add complex SQL (transformation validation with JOINs, incremental load testing, data profiling), tool proficiency in at least one ETL platform, and test automation with SQL scripts or Python. Know slowly changing dimensions (SCD Type 1 and 2). Demonstrate experience with automation and scheduling.
Senior (5+ years) — 1 Week Preparation
Add data architecture (star schema, snowflake schema, data vault), pipeline optimization, performance testing for large-scale jobs, and test framework design. Be ready to discuss ETL vs ELT trade-offs, cloud migration testing, and real-time streaming pipelines alongside batch ETL.
Practice With Real Interview Simulations
Reading ETL testing questions is not the same as answering them under pressure. Practice with timed mock interviews that test SQL validation queries, data pipeline concepts, and debugging scenarios.
TRY INTERVIEW PRACTICE →ETL testing is about understanding data flow — knowing where data comes from, what happens during transformation, and how to prove it arrived correctly at the target. The tester who can trace a single record from source to dashboard gets the job.
ETL testing interviews are SQL-heavy because the job is SQL-heavy. Master row count comparisons, learn to validate transformation rules with JOINs, and practice debugging data mismatches systematically. Every validation query you write in practice is one less you will struggle with in the interview.
Prepare for Your ETL Testing Interview
Practice with AI-powered mock interviews, get your resume ATS-ready, and walk into your next data testing interview with confidence.
Free · AI-powered · Instant feedback
Related Reading
Interview Prep
Interview Questions for Data Analyst
SQL, Python, statistics, and what analysts actually get asked
12 min read
Interview Prep
Interview Questions on Excel
Formulas, pivot tables, and what analysts actually get asked
14 min read
Resume Guide
Data Analyst Resume — India
Build a resume that highlights your analytical and SQL skills
10 min read