ETL testing is an old, proud tradition, which has been practiced by large organizations for the past two decades. Because traditional ETL processes are highly complex and extremely sensitive to change, ETL testing is hard.
What is ETL Testing?
Extract Transform Load (ETL) is the process of taking large volumes of data from multiple data sources, modifying and restructuring it for reporting and analytics purposes, and loading it to a data warehouse. ETL testing verifies whether the ETL process is working properly. ETL testing is a crucial part of ETL, because ETL is typically performed on mission critical data.
There are several types of ETL testing: testing the accuracy of the data; its completeness (whether any parts are missing); validating that the data hasn’t changed in transition and complies with business rules; testing metadata to ensure it hasn’t changed in transit; testing syntax of formally-defined data types; reference testing against business dictionaries and master data; and interface and performance testing for the ETL system.
The ETL Testing Process
Just to get us on the same page, we’ll run down the ETL testing process and a few key ETL testing concepts:
- Business requirements—identifying the project scope, expected outcome, data model and relevant business rules.
- Data sources—checking data source locations and formats, performing a data count and verifying that the columns and data types meet requirements, reviewing keys and removing duplicate data.
- Test cases—identifying key ETL mapping scenarios and creating SQL queries that simulate them. Each ETL test case should be able to replicate one or more ETL mappings and compare them against real output.
- Extract data and run tests—running tests on real extracted data, and fixing bugs, before moving forward with transformation.
- Transform data—when transformations are executed, ensuring target data types and values match the required mappings and business rules.
- Load into data warehouse—confirming that all data was loaded, and that invalid data was rejected and loaded to the appropriate location (rejects table, etc).
- ETL testing report—producing a report that informs the organization about the results of ETL testing, where problems were encountered and what was done.
Responsibilities of an ETL Tester
What does an ETL tester do?
- Writing ETL test cases—crafting SQL queries that can simulate important parts of the ETL process.
- Verifying source system tables—displaying expertise in data sources, their meaning, and how they contribute to the final data in the data warehouse.
- Applying transformation logic—ETL testers are qualified ETL engineers who can run and design transformation scripts.
- Loading data—loading data into staging environments to enable testing.
- ETL tool testing—checking and troubleshooting the underlying ETL system.
- Testing end user reports and dashboards—ensuring that business analysts and data scientists have a complete and accurate view of the data they need.
Challenges of ETL Testing
The primary challenge of ETL testing is its complexity. ETL processes at large enterprises can include hundreds or even thousands of steps, and include a large codebase of custom transformations. It is difficult to understand and comprehensively test this complex environment.
Additional difficulties encountered by ETL testers include loss or corruption of data, incorrect or incomplete source data, unstable testing environments, and large volumes of historical data, which make it difficult to predict the results of ETL in the target data warehouse.
Becoming a Great ETL Tester: 8 ETL Best Practices
1. Emphasize Analysis of the Data
It’s extremely important to understand source data and what happens to it during the ETL process. A clear understanding of ETL mappings and transformations, as well as a full analysis of the state of source data, can solve many ETL problems early on.
2. Clean Bad Data in the Source System
As much as possible, don’t let data quality issues enter the ETL pipeline. Diagnose them and work with data custodians to clean data at its source.
3. Create Multiple Tests for Data Validation
Take nothing for granted. At every stage of the data pipeline, create tests that validate the data that goes in, and ensure the data that goes out underwent the correct transformations. Creating a system of “checkpoints” at which data stops for testing allows you to quickly diagnose and solve problems.
4. Find a Great ETL Tool that Supports Your Data Sources
The primary value of ETL tools for testing is that they can automatically generate the SQL scripts that test your transformations, which is a huge time saver. However, each data source has its peculiarities, and some transformations are non-standard or unusually complex. Find a powerful ETL system that supports your data sources, and test them against sample data to see if it can generate the required transformations effectively.
5. Don’t Forget to Monitor ETL Jobs
Just because ETL jobs are running, doesn’t mean they’re running correctly. A small change like a change of data syntax, a new set of values or an additional column can cause ETL batches to break. Catching these problems early, solving them and resuming data flow is critical to maintaining confidence of stakeholders in the ETL process.
6. Load Incrementally to Deal with Scale
When dealing with huge volumes of historical data in a data warehouse, the only way to cope and ensure reasonable ETL performance is incremental load. Work with the designers of the ETL process to ensure that new data is loaded incrementally, to make all stages of ETL testing and execution easier to run and troubleshoot.
7. ETL Tests are Code: Invest in Error Handling, Logging and Alerting
An ETL test is not something you create and throw away. ETL test cases are an important codebase that is used repeatedly by your organization. Treat your test cases with respect—implement error handling for unexpected conditions, log everything and create informative alerts and notifications.
8. Performance is Everything
Especially in large ETL deployments, everything rises and falls on performance. An ETL job that takes too long to run disrupts data flows, delays updates, upsets decision makers, and also makes it difficult to run tests on realistic volumes of data. Monitor performance and understand where the bottlenecks are—optimize your scripts or upgrade systems when necessary.
ETL Testers in a Post-ETL World
ETL testing is a notoriously difficult job. But it doesn’t have to be. ETL testers have exceptional data analysis, data quality and data manipulation expertise that can have a huge impact on enterprise data projects. Instead of being the “policeman” that regulates traffic down the ETL highway, ETL testers could be doing much more.
Modern data architectures are transitioning towards automated data management. For example, Panoply is next-generation automated data warehouse based on an Extract Load Transform (ELT) paradigm.
Panoply doesn’t require ETL at all and can load source data as is, directly to the data warehouse. It uses machine learning and natural language processing to understand data schemas, clean and optimize data for OLAP-style analysis. It then allows analysts and data engineers to perform ad hoc multi-step transformations on the data, to prepare it for just about any analysis. Expert ETL testers can use this infrastructure to setup data for analysis in minutes instead of weeks or months.
What are data projects like in a world without ETL? Get a free trial of Panoply and find out.