It’s easier than ever to load data into the Amazon Redshift data warehouse. In 2018 there are three primary ways to ETL data into the Redshift data warehouse:
- Build your own ETL workflow
- Use Amazon’s managed ETL service, Glue
- Use one of several third-party cloud ETL services that work with Redshift
In this post you’ll learn how AWS Redshift ETL works and the best method to use for your use case.
What is ETL?
Extract-Transform-Load (ETL) is the process of pulling structured data from data sources like OLTP databases or flat files, cleaning and organizing the data to facilitate analysis, and loading it to a data warehouse.
What is Amazon Redshift?
Redshift is a petabyte-scale, managed data warehouse from Amazon Web Services. You can easily build a cluster of machines to store data and run very fast relational queries. Start small and scale up indefinitely by adding more machines or more Redshift clusters (for higher concurrency).
1. Build your own Redshift ETL Pipeline
Amazon recommends you design your ETL process around Redshift’s unique architecture, to leverage its performance and scalability.
Follow these best practices to design an efficient ETL pipeline for Amazon Redshift:
COPY from multiple files of the same size—Redshift uses a Massively Parallel Processing (MPP) architecture (like Hadoop). Workloads are broken up and distributed to multiple “slices” within compute nodes, which run tasks in parallel. Ensure each slice gets the same amount of work by splitting data into equal-sized files, between 1MB-1GB.
Bulk load data from S3—retrieve data from data sources and stage it in S3 before loading to Redshift. Use Amazon manifest files to list the files to load to Redshift from S3, avoiding duplication. Use temporary staging tables to hold data for transformation, and run the
ALTER TABLE APPENDcommand to swap data from staging tables to target tables. This is faster than
CREATE TABLE ASor
Multiple steps in a single transaction—commits to Amazon Redshift are expensive. If you have multiple transformations, don’t commit to Redshift after every one. Run multiple SQL queries to transform the data, and only when in its final form, commit it to Redshift. Below is an example provided by Amazon:
Begin CREATE temporary staging_table; INSERT INTO staging_table SELECT .. FROM source (transformation logic); DELETE FROM daily_table WHERE dataset_date =?; INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate); DELETE FROM weekly_table WHERE weekending_date=?; INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate); Commit
Perform table maintenance regularly—Redshift is a columnar database. To avoid performance problems over time, run the
VACUUMoperation to re-sort tables and remove deleted blocks. Frequently run the
ANALYZEoperation to update statistics metadata, which helps the Redshift Query Optimizer generate accurate query plans. The Analyze & Vacuum Utility helps you schedule this automatically.
Use workload management—Redshift is optimized primarily for read queries. To avoid commit-heavy processes like ETL running slowly, use Redshift’s Workload Management engine (WLM). Define a separate workload queue for ETL runtime. Configure to run with 5 or fewer slots, claim extra memory available in a queue, and take advantage of dynamic memory parameters.
UNLOADto extract large result sets—in Redshift, fetching a large number of rows using
SELECTstalls the cluster leader node, and thus the entire cluster. Instead, use the
UNLOADcommand to extract large result sets directly to S3, writing data in parallel to multiple files, without stalling the leader node.
Use Amazon Redshift Spectrum for ad hoc processing—for ad hoc analysis on data outside your regular ETL process (for example, data from a one-time marketing promotion) you can query data directly from S3. Amazon Redshift Spectrum can run ad-hoc relational queries on big data in the S3 data lake, without ETL.
Monitor daily ETL health using diagnostic queries—use monitoring scripts provided by Amazon to monitor ETL performance, and resolve problems early before they impact data loading capacity.
For more details on these best practices, see this excellent post on the AWS Big Data blog.
2. Redshift with AWS Glue
Amazon Web Services offers a managed ETL service called Glue, based on a serverless architecture, which you can leverage instead of building an ETL pipeline on your own.
The advantage of AWS Glue vs. setting up your own AWS data pipeline, is that Glue automatically discovers data model and schema, and even auto-generates ETL scripts. However, it comes at a price—Amazon charges $0.44 per Digital Processing Unit hour (between 2-10 DPUs are used to run an ETL job), and charges separately for its data catalog and data crawler.
AWS Glue offers the following capabilities:
Integrated Data Catalog—a persistent metadata store that stores table definitions, job definitions, and other control information to help you manage the ETL process. Glue automatically creates partitions to make queries more efficient.
Automatic schema discovery—Glue crawlers connect to your data, runs through a list of classifiers to determine the best schema for your data, and creates the appropriate metadata in the Data Catalog.
Code generation—Glue automatically generates Scala or Python code, written for Apache Spark, to extract, transform, flatten, enrich, and load your data.
Developer endpoints—Glue connects to your IDE and let you edit the auto-generated ETL scripts. Add custom readers, writers, or transformations as custom libraries.
Job scheduler—Glue runs ETL jobs in parallel, either on a pre-scheduled basis, on-demand, or triggered by an event. It lets you define dependencies to build complex ETL processes. Logs are pushed to CloudWatch.
3. Third-Party Redshift ETL Tools
You can leverage several lightweight, cloud ETL tools that are pre-integrated with Amazon Redshift. The main advantages of these services is that they come pre-integrated with dozens of external data sources, whereas Glue is only integrated with Amazon infrastructure.
If all your data is on Amazon, Glue will probably be the best choice. Run a simulation first to compare costs, as they will vary depending on use case.
Stitch to Redshift
Stitch lets you select from multiple data sources, connect to Redshift, and load data to it. Stitch provides detailed documentation on how data loading behaves depending on the status of keys, columns and tables in Redshift. Stitch does not allow arbitrary transformations on the data, and advises using tools like Google Cloud Dataflow to transform data once it is already in Redshift.
Blendo to Redshift
Blendo lets you pull data from S3, Amazon EMR, remote hosts, DynamoDB, MySQL, PostgreSQL or dozens of cloud apps, and load it to Redshift. Blendo offers automatic schema recognition and transforms data automatically into a suitable tabular format for Amazon Redshift.
Leveraging the Power of Redshift without ETL
Panoply is a pioneer of data warehouse automation. Our data warehouse is based on Amazon infrastructure and provides similar or improved performance compared to Redshift. It offers the advantage of loading data, and making it immediately available for analysis, without requiring an ETL pipeline at all.
Panoply uses machine learning and natural language processing (NLP) to model data, clean and prepare it automatically, and move it seamlessly into a cloud-based data warehouse. Perform transformations on the fly using Panoply’s UI, and then immediately start analyzing data with a BI tool of your choice.
To see how Panoply offers the power of Redshift without the complexity of ETL, sign up for our free trial.
Learn More About Amazon Redshift, ETL and Data Warehouses