Documentation

Incremental Key

An incremental key is a configuration assigned to data sources to control which attribute can be used to incrementally load data from the source, instead of repeatedly re-loading the entire dataset from the source.

By default, Panoply.io’s data sources are configured to read the entire data from the source every time they’re executed. For many data sources, like APIs of small data sets, that’s fine. In these cases, the data set is small enough to be fully consumed in just a few minutes, and due to Panoply.io’s upsert-mechanism, no duplicate entries are created.

However, with larger data sets, this can quickly become unfeasible. With a fast-growing data set, reading the entire source might take hours, most of which is spent re-reading unchanged data. For these data sources, Panoply.io supports the ability to define an incremental key.

The incremental key is an attribute of the generated data that can be reliably assumed to indicate the last update point for the rows in that data source. Simple examples are attributes like update_date. These attributes, once configured, can be used by Panoply.io to skip over all unchanged rows in the data source, and only fetch the new/updated rows on every iteration – thus reducing the processing time from hours to minutes or seconds.

Whenever Panoply.io reads a data source, it first reads the last value stored for that incremental key. Then, it uses this value to skip over all rows below that point, assumed to be unchanged since the last time the data source was consumed.

In some data sources, this is simple to reason out and set up. For example, using Amazon S3, we can use the ModifiedDate to avoid re-reading files that were already processed and weren’t modified. But in other data sources, like Postgres or Mysql, we really can’t make any assumption about the data, unless you explicitly define an incremental key from that data source.

In order to configure the incremental key, you can simply type it in as is (unlike the primary key option).

Destination Table

A destination table is a configuration assigned to each data source to control the name of the target destination table where the data should be saved. Each data source effectively generates a single stream of records – rows or events. These records are processed and then stored in the preconfigured destination table. If the destination table doesn’t already exist, or doesn’t have the correct schema, Panoply.io creates the table and any missing columns to accommodate the structure of the incoming data.

In many cases the incoming records represent the same logical object, like users or pageviews. But in other cases, a data source might produce different records with completely unrelated structures. One simple example is a Postgres data source that reads multiple tables in parallel – producing records that are completely unrelated. In these cases we don’t want a single destination table to collect all of these unrelated records, instead we’d like to break up the stream of records into multiple destinations.

To achieve this, the destination table configuration can be a dynamic pattern rather than a flat string value. Using curly brackets, you can construct any destination table per record based on the available attributes from the data source. When the data is read from the source, Panoply.io uses the destination field to construct the name of the destination table per row, instead of once for the whole data source. The records are then broken up and sent to their separate destinations.

Example:

If we have a data source that produces the following records:

{table: ‘users’, user_id: 15, name: ‘Alice’} {table: ‘users’, user_id: 16, name: ‘Bob’} {table: ‘pageview’, user_id: 16, page: ‘/’} {table: ‘pageview’, user_id: 16, page: ‘/profile’}

We can use the following destination name: ‘postgres_{table}’.

This generates a separate destination value per record, where the ‘{table}’ part is replaced by the value of the table attributes for that record, effectively generating two output tables: ‘postgres_users’ and ‘postgres_pageview’.

For all of the data sources (except for the file systems data sources), there’s a default destination name (in databases, for example, it’s the original table name).

Note that due to nested values in the stream, additional subtables might be created.