Incremental Key

Incremental Key

Panoply uses an incremental key to only pull the information that was updated since the last pull and then only updates the rows that have been changed. Incremental keys are great for providing shorter update times and faster scheduling. An incremental key can be used to collect only the most up-to-date rows, saving vast amounts of time in the collection process and improving performance.

By default, Panoply’s data sources are configured to read the entire data from the source every time they’re executed. For many data sources, like APIs of small data sets, that’s fine. In these cases, the data set is small enough to be fully consumed in just a few minutes. Due to Panoply’s upsert mechanism, no duplicate entries are created.

However, with larger data sets, this can quickly become unfeasible. With a fast-growing data set, reading the entire source might take hours, most of which is spent re-reading unchanged data. For these data sources, Panoply supports the ability to define an incremental key. In most sources, incremental keys are already included, so there’s no need to make any schema changes in your data source.

The incremental key is an attribute of the generated data that can be reliably assumed to indicate the last update point for the rows in that data source. Simple examples are attributes like update_date. These attributes, once configured, can be used by Panoply to skip over all unchanged rows in the data source, and only fetch the new or updated rows on every iteration – thus reducing the processing time from hours to minutes or seconds.

Whenever Panoply reads a data source, it first reads the last value stored for that incremental key. Then, it uses this value to skip over all rows below that point, assumed to be unchanged since the last time the data source was consumed.

In some data sources, this is simple to reason out and set up. For example, using Amazon S3, we can use the ModifiedDate to avoid re-reading files that were already processed and weren’t modified. But in other data sources, like Postgres or Mysql, we really can’t make any assumption about the data, unless you explicitly define an incremental key from that data source.

In order to configure the incremental key, enter the name of the relevant field (unlike the primary key option). For database and file system sources, you can enter anything as the incremental key, but certain types of fields are most common, such as timestamps or incremental IDs. For APIs, the most common timestamp syntax is ISO 8601 format: yyyy-mm-ddThh:mm:ss.sssZ.

Destination Table

A destination table is a configuration assigned to each data source to control the name of the target destination table where the data should be saved. Each data source effectively generates a single stream of records – rows or events. These records are processed and then stored in the preconfigured destination table. If the destination table doesn’t already exist, or doesn’t have the correct schema, Panoply creates the table and any missing columns to accommodate the structure of the incoming data.

In many cases the incoming records represent the same logical object, like users or pageviews. But in other cases, a data source might produce different records with completely unrelated structures. One simple example is a Postgres data source that reads multiple tables in parallel – producing records that are completely unrelated. In these cases we don’t want a single destination table to collect all of these unrelated records, instead we’d like to break up the stream of records into multiple destinations.

To achieve this, the destination table configuration can be a dynamic pattern rather than a flat string value. Using curly brackets, you can construct any destination table per record based on the available attributes from the data source. When the data is read from the source, Panoply uses the destination field to construct the name of the destination table per row, instead of once for the whole data source. The records are then broken up and sent to their separate destinations.

Example

If we have a data source that produces the following records:

{table: 'users', user_id: 15, name: 'Alice'} {table: 'users', user_id: 16, name: 'Bob'} {table: 'pageview', user_id: 16, page: '/'} {table: 'pageview', user_id: 16, page: '/profile'}

We can use the following destination name: postgres_{table}. This generates a separate destination value per record, where {table} is replaced by the value of the table attributes for that record, effectively generating two output tables: postgres_users and postgres_pageview.

For all of the data sources (except for the file systems data sources), there’s a default destination name (in databases, for example, it’s the original table name). Note that due to nested values in the stream, additional subtables might be created.