Documentation

Archiving

By default, Panoply.io always stores all data consumed by the data sources you defined, regardless of how frequently it’s accessed. This is simple, easy and safe, allowing you to always have instant access to your data, even if it’s a table or rows several years in the past that you normally don’t have to access.

However, this can have two negative implications:

Cost. You’re storing all of the data, even if you don’t normally access large parts of it.

Performance. Each of your queries need to run through massive tables only to filter out the rows you’re not interested in.

To overcome this, you can configure your tables to be automatically archived at regular intervals. First, you need to define an archive attribute in the table, which is the column to be used to determine if a row should be kept in the data warehouse, or should be archived. This is usually a date column indicating when the row was first created. Then, you need to define the retention value: how many days in the past you want to keep unarchived.

For example, if you have an events table with records like this:

{type: ‘click’, created_on: ‘…’}

You can set it to archive on the created_on field, and retain only the past three months, for example. Everything else will be archived daily.

You can define aggregation transformations that will generate aggregated, unarchived data from the raw data before it’s archived, allowing you to retain access to the information withheld by the data, without paying the cost and performance penalties for keeping it in your data warehouse.