Redshift is a managed data warehouse provided by Amazon Web Services. One of its key features, and part of the secret behind its amazing performance, is its columnar data structure.
In this article:
- Quickly understand what a columnar data structure means, and how it is implemented in Amazon Redshift.
- Learn what column-based compression and column encoding is, and how to set column encoding in Amazon Redshift: either automatically using the COPY command, and manually for advanced users.
- A few challenges you might experience with Redshift’s columnar structure.
What is a Columnar Database?
Many databases store their data by rows, which look similar to a spreadsheet. All the data about a record is stored in one row. By contrast, a columnar database stores its data by columns, with the data for each column stored together.
Take this data for example:
111, Bob Smith, 21 Main Street, 555-1212 112, Jane Doe, 14 First Avenue, 444-2222
In a SQL or standard flat file database the data would be stored like this:
|111||Bob Smith||21 Main Street||555-1212|
|112||Jane Doe||14 First Avenue||444-2222|
In a columnar database, the data would be stored like this:
|ID Number||111, 112|
|Name||Bob Smith, Jane Doe|
|Address||22 Main Street, 14 First Avenue|
Columnar databases are typically used to store large amounts of data for analysis, such as in Business Intelligence systems. By storing all the records for one field together, columnar databases can query and perform analysis on that similar data far quicker than row-based databases.
As an example, if you want to know the mean order total for all of your customers, a columnar database would only need to look at the order total column to pull the data, and can quickly calculate the mean. Performing the same operation in a row-based database might require scanning millions or billions of rows to gather all the values.
Amazon Redshift Columnar Storage
As a large scale, cloud-hosted data warehouse solution, Redshift is optimized for fast data analysis on vast amounts of data. Because Redshift is used for analytical queries, which are concerned with aggregates across many rows, a columnar structure performs much better:
- Using columnar storage, each data block can hold column field values for as many as three times the records as row-based storage. This reduces the number of I/O operations by 2/3. In tables with very large numbers of columns and a large number of rows, storage efficiency is even greater.
- An added advantage is that a columnar structure enables much better compression. In a database table, each column contains the same data type, typically with similar data. This creates an opportunity for much more efficient compression compared to a traditional database structure.
- The savings in storage space also carry over to retrieving and storing the data in memory. Many database operations only need to access or operate on a small number of columns at a time, and so you can save memory space by only retrieving the columns you actually need for your query.
For example, consider a table that contains 100 columns. A query that uses five columns will only need to read about 5% of the data. This saving is repeated for possibly billions or even trillions of records in large datasets. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Column Compression in Redshift
By default, Redshift stores data in a raw, uncompressed format, and you can choose whether to compress data. Each column within a table can use a different type of compression.
There are several ways to encode columnar data when compressing it; choosing the right type of encoding for each data type is key to achieving efficient compression. Redshift supports seven column encoding formats:
|Encoding type||Data types|
|Raw (no compression)||All|
|Byte dictionary||All except BOOLEAN|
|Delta||SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
|LZO||All except BOOLEAN, REAL, and DOUBLE PRECISION|
|Mostlyn||SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
It is possible to let Redshift automatically select encoding for column compression, or select it manually when creating a table.
Automatic Compression with the COPY Command
Columns defined as sort keys are assigned a RAW compression, which means that they are not compressed. Analysts should consider this when selecting a column as a sort key.
- Columns of a numerical type, like REAL and DOUBLE PRECISION, as well as BOOLEAN, are also assigned a RAW compression.
- Any other Column is assigned the LZO compression, which provides a tradeoff between performance and compression ratio.
To apply automatic compression:
Ensure the table is empty, and run the following command:
Load the table with a single COPY command, set the COMPUPDATE parameter to ON to overwrite previous compression settings in the table. By default Redshift will select 100,000 rows as its sample for analyzing the data for compression.
copy biglist from 's3://mybucket/biglist.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|' COMPUPDATE ON;
- Examine the new schema for the table to see the auto-selected encoding schemes:
select "column", type, encoding from pg_table_def where tablename = 'biglist';
- Finally, verify that the correct number of rows was loaded to the table:
Selecting Compression Encoding Manually
Advanced users can select encoding types manually for certain columns. This should be done based on an analyst’s understanding of the data. For example, if the analyst knows that a certain column contains a limited number of “dictionary” terms, such as countries, the Byte Dictionary encoding type can be manually applied.
To manually apply compression encodings, specify them in the create table command like this:
create table customer( custkey int encode delta, custname varchar(30) encode raw, -- this text column uses raw encoding gender varchar(7) encode text255, -- this text column uses text255 encoding address varchar(200) encode text255, city varchar(30) encode text255, state char(2) encode raw, zipcode char(5) encode bytedict, start_date date encode delta32k); -- this date column uses deta32k encoding
This was only a brief introduction to Redshift’s compression and encoding options - there are more details to be considered. We recommend you review the Choosing a Column Compression Type section in the Redshift documentation for more information.
Challenges with Redshift’s Columnar Structure
For each column, you’ll need to define one of the supported Redshift data types. Unlike other RDBMS systems, Redshift is used solely for data warehousing. You will not see many of the data types like LOBs or LONG in Redshift.
You can design a table for data warehousing using data types like SMALLINT, INTEGER, and BIGINT to store whole numbers of various ranges. Use DECIMAL or NUMERIC to store values with user-defined precision. Similarly, you can use VARCHAR, DATE, TIMESTAMP for its respective data.
The challenge lies in data integrity. You need to be careful about how the conversion and compatibility of number data type works while manipulating or querying data. Make sure you select only the data types the table needs, rather than choosing data types at random. Careful selection of data type reduces the size of the rows, and the database, making reads and writes faster and more efficient.
Do not select data types nor design the ETL in a hurry as it can cause errors, or worse, data corruption. Knowing how each data type behaves and the events that occur when the checks are performed will be of help when working on table design. Compatibility checks and conversion occur during various database operations, and if the wrong data type is defined or data is not converted properly, data integrity can be compromised.
Panoply is a smart data warehouse based on Redshift, which can help you automatically map your data to Redshift’s data types, guaranteeing data integrity. Panoply learns your data and makes intelligent choices, making sure all data makes the proper transition from the old data types to the types allowed in Redshift.
Learn more about Panoply’s smart data warehouse tool, based on Amazon Redshift.