Documentation

Transformations

Panoply.io’s data sources don’t allow complex transformations, Panoply.io simply copies the data from the data source, along with some built-in cleans and processing. It might appear limiting at first, but Panoply.io’s approach, using ELT instead of ETL, allows us to deliver your data in near real time and optimal performance, without compromising on the complexity and customizability of your business-specific transformations.

In Panoply.io’s ELT, the transformations take place immediately after the data is first injected from the data source into your data warehouse. Experienced data engineers know that transformation scripts grow exponentially in complexity and runtime as the requirements evolve, reaching a point where it might take hours to run. Over months and years, many transformations become irrelevant as new ones takes their place. For these reasons, Panoply.io’s Transformations layer, described below, doesn’t actually process the data unless it’s really needed and used frequently.

In Panoply.io, a transformation is simply a view or multiple views used together to produce a new, well-formatted table that’s tailored to your needs.

It’s up to you to define these transformations with standard SQL and user-defined functions. Using transformation views instead of pre-processing all of the data has several key benefits:

  • It’s available immediately. As soon as new data comes in, you can query it. There’s no delay due to processing time of thousands of lines of ETL code.
  • It’s agile. When you modify your views, the changes take place immediately, for all future and past data, with no need to re-run the process over and over again. You can iterate quickly, and get the results you need in seconds.
  • It’s fast. Because ad-hoc queries can benefit from filtering and indexing that’s not available for pre-processing all of the data ahead of time, your views can immediately access the raw data you’re searching for.
  • It’s standard, portable SQL. Unlike many ETL tools, you’re not vendor-locked into a specific technology that would make it difficult to port or use outside of Panoply.

But there’s one big downside: the data has to be re-generated every time you run a query against the view. As a result, if you have many different queries hitting your views, the overall performance might actually not be better.

To overcome this, Panoply.io automatically materializes some of your views when new data comes in and saves the results into a new well-optimized table. These materializations take advantage of your query statistics, so that only actively used transformations are materialized, instead of pre-processing everything, as in many cases not all transformation are accessed and used as frequently. The generated table is also pre-optimized in terms of indexing, partitioning and compression to provide your queries with the best possible performance.

The end result is the same: you’re still able to build any transformation logic imaginable, and you’re still getting pre-processed tables that are optimized for your queries. But you can achieve this without spending a lot of time up-front for data processing, and without being locked into a cumbersome, rigid process that takes days to iterate.