Data Warehouse Infrastructure: Full vs Incremental Loading in ETL
One of the end-goals of having an effective ETL process and ETL Data Warehouse, is the ability to reliably query data, obtain insights, and generate visualizations. An ETL Data Warehouse holds a number of advantages for organizations, allowing them to gather all of their data across the organization (think ERP, CRM, payment information, sales figures) and query this data, utilizing it to find patterns and relationships, identify inefficiencies, create reports and evaluate KPIs.
Traditionally, once data is loaded to the ETL Data Warehouse, it can be viewed using tools such as Tableau and Qlik.
One of the most critical elements of the ETL process is the flow of data into the ETL Data Warehouse.
Transforming and Staging the Source Data
With data being collected and stored in many different systems, each with its own way of storing data, the process of collecting and collating this data, and making it useful for end users, is where ETL comes in.
The data is first extracted from these different systems, each one with its own format. It is then transformed into a format that is standardized and useable across the system. This stage consists primarily of removing extraneous or erroneous data, applying rules, and checking data integrity.
Then, there is staging - usually transformed data is not loaded directly into the target ETL Data Warehouse. Typically, data would enter a staging database, so that if something goes wrong, it is easier to roll back. It is also at this point that audit reports can be generated for regulatory compliance, and problems can be identified and corrected.
Finally, data is published to the ETL Data Warehouse and loaded to the target tables. However, if there has been an error, a corruption of data, a duplication, or missing records, the reliability of the ETL Data Warehouse is compromised. Data loaded will consist of historical data – records that already exist, such as transactions for the previous year – as well as new data that is constantly added.
This is where ETL Data Warehouse testing comes in, to ensure that the data in the ETL Data Warehouse is valid, accurate and complete. With the ETL process, data flows through a pipeline before reaching the ETL Data Warehouse. The entire ETL pipeline must be tested to ensure each type of data is transformed or copied as expected.
But how does this ETL process in Data Warehouse flow into a ETL Data Warehouse?
The Difference Between Full and Incremental Loading
There are two primary methods to load data into a warehouse:
Full load: with a full load, the entire dataset is dumped, or loaded, and is then completely replaced (i.e. deleted and replaced) with the new, updated dataset. No additional information, such as timestamps, is required.
For example, take a store that uploads all of its sales through the ETL process in data warehouse at the end of each day. Let’s say 5 sales were made on a Monday, so that on Monday night a table of 5 records would be uploaded. Then, on Tuesday, another 3 sales were made which need to be added. So on Tuesday night, assuming a full load, Monday’s 5 records as well as Tuesday’s 3 records are uploaded – an inefficient system, although relatively easy to set up and maintain. While this example is overly simplified, the principle is the same.
Incremental load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the volume of data you’re loading; streaming incremental load and batch incremental load.
Following the previous example, the store that made 3 sales on Tuesday will load only the additional 3 records to the sales table, instead of reloading all records. This has the advantage of saving time and resources, but increases complexity.
Incremental loading is of course much faster than a full load. The main drawback to this type of loading is maintainability. Unlike a full load, with an incremental load you can’t re-run the entire load if there’s an error. In addition to this, files need to be loaded in order, so errors will compound the issue as other data queues up.
Today, organizations are moving away from processing and loading data in large batches, preferring – or driven by business needs – to process in real-time using stream processing, meaning that as client applications write data to the data source, data is treated, transformed and saved to the target data store. Today, tools exist to enable this process, such as Apache Samza, Apache Storm, and Apache Kafka. This eliminates many of the drawbacks of traditional processing and loading, increases speed and decreases complexity.
Challenges With ETL and Data Warehouse
When it comes to loading the data through the ETL process in Data Warehouse to the ETL Data Warehouse itself, particularly with incremental loading, a number of challenges are encountered.
Monitoring: as data is extracted from disparate sources and transformed, there are bound to be errors or anomalies. These can be caused by such things as the expiration of API credentials, or difficulty in communicating with an API. These errors or anomalies have to be identified and rectified as quickly as possible.
Incompatibility: new records can be added that render the existing data invalid; for example, a field that is expected to be an integer receives a date value. This is particularly problematic when real-time data is being added, and end-users are querying this data and receiving incorrect or incomplete results, and a bottleneck is created as new records cannot be added.
Order: data pipelines are often distributed systems, so that availability is maximized. This can result in data being processed in a different order than how that data is received, and is especially pertinent if data is being updated or deleted.
Dependencies: understanding dependencies between processes, or sub-processes, is critical when it comes to ETL management. For example, will you execute process 2 if process 1 failed? This becomes more complex as processes and sub-processes increase.
Reconciliation: another key challenge is the reconciliation process to ensure data in an ETL Data Warehouse is correct and consistent. While ETL testing should be done regularly, Data Warehouse reconciliation is a continuous process.
These challenges have to be understood, mitigated, and managed.
An example of managing this is finding the balance between parallel and serial processing.
Serial processing is when one task is completed at a time, and tasks are run by the processor in a sequence. So one task can start only when the previous one has ended. Parallel processing, as its name implies, is a type of processing in which multiple tasks are completed at the same time.
When it comes to loading data, if processes have no dependencies on one another, they can be processed in parallel, resulting in a significant time saving. If there are dependencies however, this can result in complications, and data will have to be processed serially.
Parallelism can also be achieved with data – splitting a single sequential file into smaller data files to provide for parallel access – pipeline, by allowing several components on the same data stream to run simultaneously, and component, that is running multiple processes on different data streams in the same job simultaneously.
Aiming to achieve these efficiencies can also be seen with ETL tools in data warehouse such as Amazon Redshift and Google’s BigQuery. With Redshift, for example, Columnar Storage and MPP Processing enable high performance analytical query processing. BigQuery is serverless and provides data warehouse as a service, managing the data warehouse and enabling the running of very fast queries on large data sets.
What’s Needed To Be Successful In Production
The ETL process in Data Warehouse is complex and involves many stakeholders. These can include data scientists, data engineers and infrastructure engineers, not mention those at the data generating and/or consuming the data, such as marketing executives. Ultimately of course, the entire ETL process, including the ETL Data Warehouse, is useless if the various pieces don’t come together resulting in data that can be trusted by all parties, and acted upon where necessary. This data also has to be available, operational downtime must be minimized, data must be queryable quickly and easily, and all of this should serve the needs of the different types of users using the system.
This of course creates frustration, as often the data scientists get exasperated at the slow pace of the data engineers, while the data engineers get upset about about the unrealistic expectations and poor technical understanding of the data scientists, the infrastructure engineers have to keep adapting to ever-increasing resource demands, and end-users demand more features and better performance.
In order to assist those building and consuming dashboards and data visualizations better understand what’s needed to be successful in production, it is important that there is general understanding of the process.
Often there are complaints from end users such as “The dashboard is taking too long to load”, or expectations that might be unrealistic, such as “I need this data before I get into work Monday morning”.
By all users being aware of the process and pitfalls – from those users seeing the ETL architecture in data warehouses, to those consuming the data in the form of dashboards or other BI visualization tools – an organization can successfully leverage the data to succeed in their business goals.
Having said this, there are some concrete steps that can be taken to ensure that the ETL process loads data successfully to an ETL Data Warehouse, and that everyone benefits:
Gaining an understanding the overall system, including both the Data Source and the Target. This will help keep the team focused, minimize the impact of irrelevant data, and ensure seamless and speedy loading of data.
Planning: this is arguably the most important step in the process, and yet the one where many teams spend the least amount of time or effort. It’s so true that too many teams jump straight into the ETL process itself, trying to get the data loaded quickly, end up cutting corners, and costing the organization even more down the line when it comes to increased maintenance costs, problems scaling, fixing errors, and the disruption of business operations.
Use the Team: by involving all the relevant stakeholders in the process, you are more likely to get buy-in, assistance, insights, and help when things go wrong or take an unexpected turn. Everyone from the marketing team to the engineers should be involved to ensure a smooth process that everyone owns and in whose success everyone is invested.
Review often. Ensure performance is consistent, get feedback from users, and review plans and back-ups to ensure every foreseeable situation is planned for.
Keep Things Simple
With the traditional ETL methodology, every aspect of the process, from extracting data, to transforming it and finally loading it to the ETL Data Warehouse, is fraught with risk. The complexity, the moving parts, and the size of the team involved, mean that there are multiple things that can go wrong.
Traditional ETL, while having faithfully served for so long, has drawbacks that make coping with modern business demands increasingly difficult and resource-intense.
With Panoply, you can automate the ingestion of any data from diverse sources and make tables that are clear, configurable, and immediately queryable. It also seamlessly connects you to any BI tool you need so you can start visualizing, analysing, and sharing data insights in just minutes.
The platform enables you to collect data with zero coding – from Salesforce and Facebook Ads, to PostgreSQL and everything in between. With over 100 native data integrations and an SDK to build custom integrations, just select your data source and start loading data into your Panoply ETL Data Warehouse. As new data – or even new data types – become available, they’re instantly added to your cloud ETL Data Warehouse. Find out more at https://panoply.io/platform/