Redshift Columnar Storage 101

What is Redshift's columnar storage? How do you implement it? Find out why columnar data storage is part of the secret behind Redshift's amazing performance.

Redshift is a managed data warehouse provided by Amazon Web Services. One of its key features, and part of the secret behind its amazing performance, is its columnar data structure.

In this article:

What is a Columnar Database?

Many databases store their data by rows, which look similar to a spreadsheet. All the data about a record is stored in one row. By contrast, a columnar database stores its data by columns, with the data for each column stored together.

Take this data for example:

111, Bob Smith, 21 Main Street, 555-1212
				112, Jane Doe, 14 First Avenue, 444-2222

 

In a SQL or standard flat file database the data would be stored like this:

ID Number Name Address Telephone
111 Bob Smith 21 Main Street 555-1212
112 Jane Doe 14 First Avenue 444-2222

In a columnar database, the data would be stored like this:

ID Number 111, 112
Name Bob Smith, Jane Doe
Address 22 Main Street, 14 First Avenue
Telephone 555-1212, 444-2222

Columnar databases are typically used to store large amounts of data for analysis, such as in Business Intelligence systems. By storing all the records for one field together, columnar databases can query and perform analysis on that similar data far quicker than row-based databases.

As an example, if you want to know the mean order total for all of your customers, a columnar database would only need to look at the order total column to pull the data, and can quickly calculate the mean. Performing the same operation in a row-based database might require scanning millions or billions of rows to gather all the values.

Amazon Redshift Columnar Storage

As a large scale, cloud-hosted data warehouse solution, Redshift is optimized for fast data analysis on vast amounts of data. Because Redshift is used for analytical queries, which are  concerned with aggregates across many rows, a columnar structure performs much better:

  • Using columnar storage, each data block can hold column field values for as many as three times the records as row-based storage. This reduces the number of I/O operations by 2/3. In tables with very large numbers of columns and a large number of rows, storage efficiency is even greater.
  • An added advantage is that a columnar structure enables much better compression. In a database table, each column contains the same data type, typically with similar data. This creates an opportunity for much more efficient compression compared to a traditional database structure.
  • The savings in storage space also carry over to retrieving and storing the data in memory. Many database operations only need to access or operate on a small number of columns at a time, and so you can save memory space by only retrieving the columns you actually need for your query.

For example, consider a table that contains 100 columns. A query that uses five columns will only need to read about 5% of the data. This saving is repeated for possibly billions or even trillions of records in large datasets. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Column Compression in Redshift

By default, Redshift stores data in a raw, uncompressed format, and you can choose whether to compress data. Each column within a table can use a different type of compression.

There are several ways to encode columnar data when compressing it; choosing the right type of encoding for each data type is key to achieving efficient compression. Redshift supports seven column encoding formats:

Encoding type Data types
Raw (no compression) All
Byte dictionary All except BOOLEAN
Delta SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
LZO All except BOOLEAN, REAL, and DOUBLE PRECISION
Mostlyn SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length All
Text VARCHAR only
VARCHAR only
Zstandard All

It is possible to let Redshift automatically select encoding for column compression, or select it manually when creating a table.

Automatic Compression with the COPY Command

Columns defined as sort keys are assigned a RAW compression, which means that they are not compressed. Analysts should consider this when selecting a column as a sort key.

  • Columns of a numerical type, like REAL and DOUBLE PRECISION, as well as BOOLEAN, are also assigned a RAW compression.
  • Any other Column is assigned the LZO compression, which provides a tradeoff between performance and compression ratio.

To apply automatic compression:

  1. Ensure the table is empty, and run the following command:
    truncate biglist;

  2. Load the table with a single COPY command, set the COMPUPDATE parameter to ON to overwrite previous compression settings in the table. By default Redshift will select 100,000 rows as its sample for analyzing the data for compression. 

copy biglist from 's3://mybucket/biglist.txt'

iam\_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

delimiter '|' COMPUPDATE ON;

 

  1. Examine the new schema for the table to see the auto-selected encoding schemes:
select "column", type, encoding

from pg\_table\_def where tablename = 'biglist';

 

Column Type Encoding
listid integer delta
sellerid integer delta32k
+numtickets smallint delta
priceperticket numeric(8,2) delta32k
totalprice numeric(8,2) mostly32
  1. Finally, verify that the correct number of rows was loaded to the table:
    select count(\*)from biglist;

Selecting Compression Encoding Manually

Advanced users can select encoding types manually for certain columns. This should be done based on an analyst’s understanding of the data. For example, if the analyst knows that a certain column contains a limited number of “dictionary” terms, such as countries, the Byte Dictionary encoding type can be manually applied.

To manually apply compression encodings, specify them in the create table command like this:

create table customer(

custkey int encode delta,

custname varchar(30) encode raw,   -- this text column uses raw encoding

gender varchar(7) encode text255,    -- this text column uses text255 encoding

address varchar(200) encode text255,

city varchar(30) encode text255,

state char(2) encode raw,

zipcode char(5) encode bytedict,

start\_date date encode delta32k); -- this date column uses deta32k encoding

 

This was only a brief introduction to Redshift’s compression and encoding options - there are more details to be considered. We recommend you review the Choosing a Column Compression Type section in the Redshift documentation for more information.

Challenges with Redshift’s Columnar Structure

For each column, you’ll need to define one of the supported Redshift data types. Unlike other RDBMS systems, Redshift is used solely for data warehousing. You will not see many of the data types like LOBs or LONG in Redshift.

You can design a table for data warehousing using data types like SMALLINT, INTEGER, and BIGINT to store whole numbers of various ranges. Use DECIMAL or NUMERIC to store values with user-defined precision. Similarly, you can use VARCHAR, DATE, TIMESTAMP for its respective data.

The challenge lies in data integrity. You need to be careful about how the conversion and compatibility of number data type works while manipulating or querying data. Make sure you select only the data types the table needs, rather than choosing data types at random. Careful selection of data type reduces the size of the rows, and the database, making reads and writes faster and more efficient.

Do not select data types nor design the ETL in a hurry as it can cause errors, or worse, data corruption. Knowing how each data type behaves and the events that occur when the checks are performed will be of help when working on table design. Compatibility checks and conversion occur during various database operations, and if the wrong data type is defined or data is not converted properly, data integrity can be compromised.

Panoply is a smart data warehouse based on Redshift, which can help you automatically map your data to Redshift’s data types, guaranteeing data integrity. Panoply learns your data and makes intelligent choices, making sure all data makes the proper transition from the old data types to the types allowed in Redshift.

Learn more about Panoply’s smart data warehouse tool, based on Amazon Redshift.

Learn more about Data Warehouses