On data sources that support it, Panoply uses an incremental key to only pull the information that was updated since the last pull and then updates the rows that have been changed. An incremental key can be used to collect only the most up-to-date rows, saving vast amounts of time in the collection process and improving performance.
For API sources, Panoply by default collects the entire data set from the source every time they’re executed. For many data sources with small data sets, that’s fine. In these cases, the data set is small enough to be fully consumed in just a few minutes. Due to Panoply’s upsert mechanism, no duplicate entries are created.
However, with larger data sets, such as Shopify or Stripe, this can quickly become unfeasible. With a fast-growing data set, reading the entire source might take hours, most of which is spent re-reading unchanged data. For these data sources, Panoply supports the ability to define an incremental key. In most API sources, incremental keys are already included, so there’s no need to make any schema changes in your data source.
The incremental key is an attribute of the generated data that can be reliably assumed to indicate the last update point for the rows in that data source. Simple examples are attributes like
update_date. These attributes, once configured, can be used by Panoply to skip over all unchanged rows in the data source, and only fetch the new or updated rows on every iteration – thus reducing the processing time from hours to minutes or seconds.
Whenever Panoply reads a data source, it first reads the last value stored for that incremental key. Then, it uses this value to skip over all rows below that point, assumed to be unchanged since the last time the data source was consumed.
In some data sources, this is simple to reason out and set up. For example, using Amazon S3, we can use the
ModifiedDate to avoid re-reading files that were already processed and weren’t modified. But in other data sources, like Postgres or Mysql, we really can’t make any assumption about the data, unless you explicitly define an incremental key from that data source.
For database and file system sources, you can enter anything as the incremental key, but certain types of fields are most common, such as timestamps or incremental IDs. For APIs, the most common timestamp syntax is ISO 8601 format: