Primary Keys

Primary keys are used to ensure the uniqueness of the rows in a table. The Primary Key is comprised of one or more columns that can be used to identify the row in the data warehouse. The process guarantees that no two rows have the same value in their primary key.

When new data is inserted into a table, it is first checked against the existing data to determine if a row with the same value in the primary key already exists. If it does, the new row is used to update the existing row, rather than inserting a new one. However, unlike many data warehouses, Panoply doesn't do that by rejecting attempts to insert the new conflicting row with a unique error. Instead, it uses the new row to update some, or all, of the attributes (columns) in the existing row.

Default Primary Keys

Default primary keys are different based on whether your data warehouse is built on Amazon Redshift or Google BigQuery.

Amazon Redshift

All of the API data sources have a default primary key for each and every table created from them. All the data sources use a field named id as the default primary key with some exceptions (MongoDB, for example, uses the _id attribute). If no primary key is given at set up, and there is no default one in the data source (id field, for example), then a random value is generated instead. To successfully set the primary key, use dynamic fields by placing each attribute name in curly brackets, for example: {first_name}-{last_name} will use the first_name and last_name attributes as the primary key of each row.

Redshift Primary Key Scenarios

Source `id` column	Enter a primary key	Outcome
yes	no	Panoply will automatically select the `id` column and use it as the primary key.
yes	yes	Not recommended. Panoply will use the `id` column but will overwrite the original source values. If you want Panoply to use your source's `id` column, do not enter a value into the Primary Key field.
no	no	Panoply creates an `id` column formatted as a GUID, such as `2cd570d1-a11d-4593-9d29-9e2488f0ccc2`.
no	yes	Panoply creates a hashed `id`column using the primary key values entered, while retaining the source columns.

Google BigQuery

For Panoply on BigQuery, the default primary key will be __panoply_id. Users may specify a different primary key in the Advanced Settings when configuring the data source, and Panoply will use that to create the __panoply_id field. If there are any sub-tables, this becomes the foreign key in that sub-table in the format of __{parent_table}_panoply_id.

Primary Key Scenarios

Source `id` column	Enter a primary key	Outcome
yes	no	Panoply will automatically select the `id` field and use it to populate the `__panoply_id` column.
yes	yes	Panoply will automatically select the `id` field and use it to populate the `__panoply_id`.
no	no	Panoply creates the `__panoply_id` column formatted as a GUID, such as `2cd570d1-a11d-4593-9d29-9e2488f0ccc2`.
no	yes	Panoply creates the `__panoply_id`column using the primary key values entered, while retaining the source columns.

Any user-entered primary key will be used across all the destination tables.