Duplicate Source Records
You may come across a scenario in which it appears you have duplicate data. This section aims to explain why this can occur and what to do about it.
When Panoply ingests data, we upsert the data (update and then insert) based on the selected primary key. That primary key can be the default for a data source, or it can be user-configured.
If the data source has a default primary key, Panoply knows which field in the source data is the unique identifier; otherwise, Panoply creates its own unique identifier for each record. However, for some kinds of data sources, such as databases or CSV file uploads, Panoply often can’t know which fields are the primary key. Unless your data has an id field, you need to specify the primary key.
Here’s an example of how you can end up with unexpected records. Imagine we import a CSV file on Tuesday. Then on Wednesday, we re-import the same CSV file but with five new rows. Panoply can’t know whether to update or insert unless you specify a primary key. Without a user-defined primary key, Panoply creates new records for each row in the newly uploaded CSV file.
When you re-upload data, it is possible to create duplicate records. One factor in this is that we never delete data during the ingestion process unless you have selected the truncate option in the data source setup. Without a set primary key, and without truncate enabled, you’ll get a duplicate set of data every time you ingest. That’s because we can’t tell which of the incoming records are already in the system and because we do not delete records unless the truncate option is checked.
You might expect that Panoply will not create duplicate records because Panoply should be able to detect that identical data already exists in the system. However, it’s not relevant that the record’s other values are identical. What distinguishes one record from another is the primary key. Without a defined primary key, Panoply protects against data loss by erring on the side of duplication.
To further complicate matters, data sources deal differently with deleted data. For example, Salesforce has two different ways to delete a record. Their “soft” delete puts items in the Recycle Bin by setting the Boolean isdeleted attribute to True, but it but doesn’t actually remove the record.
From the perspective of the ingestion process, there is often no definitive answer to whether a record was deleted from the data source. Consider a case in which a destination table in Panoply has 100 records in it, but during the last ingestion run, the data source had only 10 records. What does this mean? Were 90 records deleted? It is impossible to know if the data source contains the full set of source records, holds a subset of previously-loaded data, or contains entirely new data, especially if the data source is a database or file system.
What can you do to avoid ending up with duplicate data? There’s no one-size-fits-all solution. One approach to solving data duplication issues is to do a full sync every time with the truncate option turned on. In other words: Always load all the data. Because truncate is enabled, this removes current records and replaces them with source data. Records that were removed from the source since the last ingestion are not re-created in Panoply.
Be aware that a “load it all” approach does not work for data sources that load data incrementally because each incremental pass would remove the previously loaded data. Consult the data source documentation for details on whether a given data source uses an incremental loading strategy.
Ultimately, there is no source of truth to say whether a record should or shouldn’t exist. You have to know your data well enough to specify the appropriate primary key, determine whether you should be truncating your data on each ingestion, and plan for how to identify and remove records that were deleted in the source data. If you have records that you think are unexpected, you can reach out to Panoply’s Support team to help identify the origin of these records.