While the traditional ETL process is the predominant data processing flow in many organizations, there are newer, more exciting data processing methods. One is stream processing - agile and able to deal with real time data on the fly. Another is automated data management - bypassing traditional ETL and using an “ELT” paradigm - Extract, Load, then Transform.
In this page, we’ll explain how to build an ETL process the heavy, traditional way - from data extraction to loading precise structured data to a data warehouse.
But we’ll also take you beyond the old school, showing how to build ETL pipelines based on stream processing using Kafka, and how to fully automate ETL using Panoply’s data management platform.
What is the ETL Process?
Extract, Transform, Load (ETL), an automated process which takes raw data, extracts the information required for analysis, transforms it into a format that can serve business needs, and loads it to a data warehouse. ETL typically summarizes data to reduce its size and improve performance for specific types of analysis.
When you build an ETL infrastructure, you must integrate data sources, and carefully plan and test to ensure you transform source data correctly. Below we explain three ways to build an ETL infrastructure, and one more way to build a data pipeline without using ETL at all.
1. Building an ETL Pipeline with Batch Processing
Follow the process below to build a traditional ETL process, in which you transfer and process data in batches from source databases to data warehouse. It’s challenging to build an enterprise ETL pipeline from scratch - you will typically rely on ETL tools such as Stitch or Blendo, which simplify and automate much of the process.
Building ETL with batch processing, following ETL best practices, involves:
Reference data - create a set of data that defines the set of permissible values your data may contain. For example, in a country data field, you can define the list of country codes allowed.
Extract from data sources - the basis for the success of subsequent ETL steps is to extract data correctly. Most ETL systems combine data from multiple source systems, each with its own data organization and format - including relational databases, non-relational databases, XML, JSON, CSV files, etc. Successful extraction converts data into a single format for standardized processing.
Data validation - an automated process confirms whether data pulled from sources has the expected values - for example, in a database of financial transactions from the past year, a date field should contain valid dates within the past 12 months. The validation engine rejects data if it fails the validation rules. You analyze rejected records, on an ongoing basis, to identify what went wrong, correct the source data, or modify extraction to resolve the problem in the next batches.
Transform data - removing extraneous or erroneous data (cleaning), applying business rules, checking data integrity (ensuring that the data was not corrupted in source, or corrupted by ETL, and that no data was dropped in previous stages), and creating aggregates as necessary. For example, if you must analyze revenue, you can summarize the dollar amount of invoices into a daily or monthly total. You will need to program and test a series of rules or functions that can achieve the required transformations, and run them on the extracted data.
Stage - you will not typically load transformed data directly into the target data warehouse. Data should first enter a staging database, making it easier to roll back if something goes wrong. At this point, you can also generate audit reports for regulatory compliance, or diagnose and repair data problems.
Publish to data warehouse - loading the data to the target tables. Some data warehouses overwrite existing information every time the ETL pipeline loads a new batch - this might happen daily, weekly or monthly. In other cases, ETL can add new data without overwriting, with a timestamp indicating it is new. You must do this carefully to prevent the data warehouse from “exploding” due to disk space and performance limitations.
2. Building an ETL Pipeline with Stream Processing
Modern data processes often include real time data - for example, web analytics data from a large ecommerce website. In these use cases, you cannot extract and transform data in large batches: the need arises to perform ETL on data streams. This means that as client applications write data to the data source, data should be treated, transformed and saved immediately to the target data store.
Many stream processing tools are available today - including Apache Samza, Apache Storm, and Apache Kafka. The diagram below illustrates an ETL pipeline based on Kafka, described by Confluent:
Building streaming ETL based on Kafka involves:
Extracting data into Kafka - the Confluent JDBC connector pulls each row of the source table and writes it as a key/value pair into a Kafka topic (you can implement a database connection yourself without Confluent’s commercial product). Applications interested in the state of this table read from this topic. As client applications add rows to the source table, Kafka automatically writes them as new messages to the Kafka topic, enabling a real time data stream.
Pulling data from Kafka topics - the ETL application extracts messages from the Kafka topic as Avro records, creates an Avro schema file and deserializes them, and creates KStream objects from the messages.
Transform data in KStream object - using the Kafka Streams API, the stream processor receives one record at a time, processes it, and can produce one or more output records for downstream processors. These can transform messages one at a time, filter them based on conditions, or perform data operations on multiple messages, for example aggregation.
Load data to other systems - the ETL application still holds the enriched data, and now needs to stream it into target systems, such as a data warehouse or data lake. In Confluent’s example, they propose using their S3 Sink Connector to stream the data to Amazon S3. You can implement integration with other systems, for example, stream data to a Redshift data warehouse using Amazon Kinesis.
What is ELT?
In the ELT process, after you extracted data from all data sources, you immediately start moving it into a centralized data repository. Today’s cloud-based data warehouse and data lake infrastructure supports large storage and scalable compute. So it’s no longer necessary to prevent the data warehouse from “exploding” by keeping data small and summarized. Many organizations today maintain massive data pools in the cloud at low cost, leveraging ELT tools to process the data.
3. Building a Pipeline without ETL Using an Automated Cloud-Based Data Warehouse
New cloud-based data warehouse technology makes it possible to achieve the original goal of ETL without building an ETL system at all.
For example, Panoply’s cloud-based automated data warehouse has end-to-end data management built in. It uses a self-optimizing architecture with machine learning and natural language processing (NLP), which automatically extracts and transforms data to match analytics requirements. Panoply comes pre-integrated with dozens of data sources, including analytics systems, BI tools, databases, social and advertising platforms.
Building a data pipeline without ETL in Panoply involves:
Select data sources and import data - select your data sources from a list, enter credentials and define destination tables, click Collect and Panoply automatically pulls the data for you. Panoply automatically takes care of schemas, data preparation, data cleaning, and more.
Run transformation queries - select a table and run a SQL query against the raw data. You can save the query as a transformation, or export the resulting table into your own system. Panoply supports both simple views and materialized transformation views. You can run several transformations, until you achieve a data format that enables analysis. Panoply’s Query Log allows you to easily roll back to previous processing steps. You shouldn’t be concerned about “ruining” the data - Panoply lets you perform any transformation, but keeps your raw data intact.
Data analysis with BI tools - you can now connect any BI tool such as Tableau or Looker to Panoply and explore the transformed data.
The above process is agile and flexible, allowing you to quickly load data, transform it into a useful form, and use it to perform business analysis. For more details see Getting Started with Panoply.
Out with the Old, in with the New
We have shown three ways to build an ETL process, which you can think of as three stages in the evolution of ETL:
Traditional ETL batch processing - meticulously preparing data and transforming it using a rigid, structured process.
ETL with stream processing - using a modern stream processing framework like Kafka, you can pull data in real time from source, manipulate it on the fly using Kafka’s Stream API, and load it to a target system such as Amazon Redshift.
Automated data pipeline without ETL - we showed how to use our automated data warehouse, Panoply, to pull data from multiple sources, automatically prep it without requiring a full ETL process, and immediately begin analyzing it using BI tools.
Traditional ETL works. But it is complex to build, rigid, and unsuitable for many modern requirements. If your organization and skill set are strongly rooted in the “old ETL”, we strongly recommend you learn more about new technologies to achieve the same objectives more quickly and efficiently. As an ETL or data engineer, you can quickly make the shift in mindset. As an early adopter of stream processing, or even fully automated data warehousing, you have the chance to develop expertise in building next-generation data pipeline.