- 26 Jul 2021
- 3 Minutes to read
- Print
- DarkLight
- PDF
Duplicate Data
- Updated on 26 Jul 2021
- 3 Minutes to read
- Print
- DarkLight
- PDF
Primary Keys
When Panoply ingests data, we upsert the data (update and then insert) based on the selected primary key. That primary key can be the default for a data source, inserted by Panoply, or it can be user-configured.
API data sources have a default primary key provided by the API response. For other data sources, such as file systems (Google Drive, Amazon S3, etc) or database systems (MongoDB, Postrgres, etc), Panoply will look for an id
field to use.
If there is no id
field and the user does not specify the primary key in the Advanced Settings when setting up the data source, Panoply will insert an id
, formatted as a GUID, such as 2cd570d1-a11d-4593-9d29-9e2488f0ccc2
.
Update VS Insert
Panoply decides whether to update an existing data source or to insert new rows to an existing data source based on the defined/default Primary Key. For example, if a user collects a CSV file (via File Upload) with five rows on Tuesday, then collects it again on Wednesday with five new rows, Panoply can not know whether to update or insert.
With no primary key, this table now has ten rows. However, if one of the additional rows is an update to an existing row, rather than updating the existing record, Panoply added a new record, creating a duplicate.
Truncate
Primary Keys are one method to prevent duplication. Panoply also provides the option to Truncate data (see Advanced Settings), which means that each time the data source is collected, Panoply first deletes the data in the existing tables. Panoply never deletes data during the ingestion process unless you have selected the truncate option in the data source setup.
Without a set primary key, and without truncate enabled, you'll get a duplicate set of data every time you ingest. That's because we can't tell which of the incoming records are already in the system and because we do not delete records unless the truncate option is checked.
What distinguishes one record from another is the primary key, and not the data in any other fields. Without a defined primary key, Panoply protects against data loss by erring on the side of duplication.
Without that Primary Key set up, even if the entire row is identical to an existing row, Panoply adds the row, resulting in duplicated data.
Deleted Data
To further complicate matters, data sources deal differently with deleted data. For example, Salesforce has two different ways to delete a record. Their "soft" delete puts items in the Recycle Bin by setting the Boolean isdeleted
attribute to True, but it but doesn't actually remove the record.
From the perspective of the ingestion process, there is often no definitive answer to whether a record was deleted from the data source. Consider a case in which a destination table in Panoply has 100 records in it, but during the last ingestion run, the data source had only 10 records. What does this mean? Were 90 records deleted? It is impossible to know if the data source contains the full set of source records, holds a subset of previously-loaded data, or contains entirely new data, especially if the data source is a database or file system.
What can you do to avoid ending up with duplicate data? There's no one-size-fits-all solution. One approach to solving data duplication issues is to do a full sync every time with the truncate option turned on. In other words: Always load all the data. Because truncate is enabled, this removes current records and replaces them with source data. Records that were removed from the source since the last ingestion are not re-created in Panoply.
A "load it all" approach does not work for data sources that load data incrementally because each incremental pass would remove the previously loaded data. Consult the data source documentation for details on whether a given data source uses an incremental loading strategy.
Ultimately, users need to know their data well enough to specify the appropriate primary key, determine whether you should be truncating your data on each ingestion, and plan for how to identify and remove records that were deleted in the source data. If you have records that you think are unexpected, you can reach out to Panoply's Support team to help identify the origin of these records.