Data Ingestion Engine

Data Ingestion Engine

As your data travels from a data source into your Panoply database it passes through Panploy’s Data Ingestion Engine. This article explains the Data Ingestion Engine’s constraints, standards it adheres to, and conversions it performs.

For example, you may have three data sources that each format dates differently. As data passes from those sources into your Panoply database, the Data Ingestion Engine standardizes the disparate formats into one consistent date format.

Data Ingestion Engine Specifications

The following sections explain how the Data Ingestion Engine handles destinations, dates, timestamps, numbers, materlization, and deleted source records.

Destinations

A destination is a string of characters used to define the table(s) in your Panoply database where your data will be stored. When you set up a data source, you can supply a destination or leave this field blank and use the default destination. A destination can include a combination of literals and symbols, as defined below. The Data Ingestion Engine converts all alphabetic characters to lowercase.

Literal

A literal is a raw string of characters.

Allowed characters:

  • A-Z (will be converted to lowercase)
  • a-z
  • 0-9
  • _ (underscore)
  • - (dash)
  •  (space)

Symbol

A symbol is the name of an object key. The Data Ingestion Engine substitutes each symbol with its related value.

Allowed characters: all

Example

A destination of {category}-{sub} includes one literal (-) between two symbols ({category} and {sub}).

For an object {"category": "toys", "sub": "puzzles"}, a destination of {category}-{sub} resolves to toys-puzzles.

Dates

Dates are converted to strings and saved in the format: YYYY-MM-DDThh:mm:ss.sssZ. This is compliant with ISO-8601.

Panoply supports these date formats:

Date format Example
ANSI C Mon Jan _2 15:04:05 2006
Unix Date Mon Jan _2 15:04:05 MST 2006
Ruby Date Mon Jan 02 15:04:05 -0700 2006
RFC 1123 Mon, 02 Jan 2006 15:04:05 -0700
RFC 3339 (ISO 8601 profile) 2013-03-31T10:05:04.9385623+03:00
year/month/day 2013-03-28 10:05:00 +0000 UTC
2-digit year 08/21/71
Date without day 2014-04

Timestamps

Panoply supports both string and integer timestamps. Timestamp length must be between 8 and 14. Longer or shorter timestamps are not considered applicable.

Timestamp resolution is in seconds. The Data Ingestion Engine resolves 1432399705 and 1432399705000 to the same UTC date of 2015-05-23T16:48:25Z.

Numbers

Panoply uses double-precision floating-point format for numbers. This means the largest number Panoply can parse is 9,007,199,254,740,991.

Transformations

Although it is not possible to add transformations during ingestion, Panoply solves the same problem another way. Panoply supports the creation of materialized views right after an ingestion.

Materialized views cache the results of a query as a table rather than a non-cached, virtual table. A materialized view is especially useful for frequently accessed data. Materialized views increase query performance because queries go to the materialized views rather than to the underlying detail tables.

Deleted Source Records

You may come across a scenario in which it appears you have duplicate data. This section aims to explain why this can occur and what to do about it.

When Panoply ingests data, we upsert the data (update and then insert) based on the selected primary key. That primary key can be the default for a data source, or it can be user-configured.

If a data source has a default primary key, Panoply either knows which field in the source data is the unique identifier or else Panoply creates its own unique identifier for each record. However, for some kinds of data sources, such as databases or CSV file uploads, Panoply often can’t know which fields are the primary key. Unless your data has an id field, we need you to specify the primary key.

Here’s an example of how you can end up with unexpected records. Imagine we import a CSV file on Tuesday. Then on Wednesday we re-import the same CSV file but with five new rows. Panoply can’t know whether to update or insert unless you specify a primary key. Without a user-defined primary key, Panoply creates new records for each row in the newly uploaded CSV file.

When you re-upload data, it is possible to create duplicate records. One factor in this is that we never delete data during the ingestion process unless you have selected the truncate option in the data source setup. Without a primary key set, and without truncate enabled, you’ll get a duplicate set of data every time you ingest. That’s because we can’t tell which of the incoming records are already in the system and because we do not delete records unless the truncate option is checked.

You might expect that Panoply will not create duplicate records because Panoply should be able to detect that identical data already exists in the system. However, it’s not relevant that the record’s other values are identical. What distinguishes one record from any other record is the primary key. Without a defined primary key, Panoply protects against data loss by erring on the side of duplication.

To complicate matters further, data sources deal differently with deleted data. For example, Salesforce has two different ways to delete a record. Their “soft” delete puts items in the Recycle Bin by setting the Boolean isdeleted attribute to True, but it but doesn’t actually remove the record.

From the perspective of the ingestion process, there is often no definitive answer to whether a record was deleted from the data source. Consider a case in which a destination table in Panoply has 100 records in it, but during the last ingestion run the data source had only 10 records. What does this mean? Were 90 records deleted? It is impossible to know if the data source contains the full set of source records, holds a subset of previously-loaded data, or contains entirely new data, especially if the data source is a database or file system.

What can you do to avoid ending up with duplicate data? There’s no one-size fits all solution. One approach to solve data duplication issues is to do a full sync every time with the truncate option turned on. In other words: Always load all the data. Because truncate is enabled, this removes current records and replaces them with source data. Records that were removed from the source since the last ingestion are not re-created in Panoply.

Be aware that a “load it all” approach does not work for data sources that load data incrementally, because each incremental pass would remove the previously loaded data. Consult the data source documentation for details on whether a given data source uses an incremental loading strategy.

Ultimately, there is no source of truth to say whether a record should or shouldn’t exist. You have to know your data well enough to specify the appropriate primary key, know whether you should be truncating your data on each ingestion, and plan for how to identify and remove records that were deleted in the source data. If you have records that you think are unexpected, you can reach out to Panoply’s Support team to help identify the origin of these records.