Overview

Overview

Welcome to Panoply. You’ve come to the right place for easy, automated data management. Get ready to accelerate your data journey! We’ve put together a little guide to help you get up and running with your data. But before we launch into the step-by-step, we should cover a little bit of background:

What is Panoply?

Panoply is a cloud data platform made up of three parts:

  1. High-powered, automated data warehouse
  2. Automated ETL (data Extraction, Transformation, and Loading) tools
  3. Data engineering magic (that you don’t have to manage)

This means you can store all your data in one place and know that it’s always up-to-date, ready to be combined, queried, and analyzed.

Panoply and Your Team

With Panoply, you can work with your data as fast as you want and you can think as big as you want. You can analyze, transform, and iterate with your data without burdening your IT and engineering team.

CTOs trust Panoply because it can run on AWS or Microsoft Azure infrastructure and handles everything from access permissions and SOC2 to GDPR, and requires no architectural planning or ongoing maintenance. Panoply also scales automatically and offers real-time chat support from a team of really savvy data architects that can help you when things get complex or unfamiliar.

Now that we’ve covered all the reasons you’re going to love Panoply, let’s jump into the guide.

Getting Started

In order to start this guide, you’ll need to log into Panoply and go to the data sources screen.

Collect Data

Choose a data source that you are familiar with—this guide is about exploring your data, so you might as well have fun with it. If you have any questions while connecting Panoply to your data source, click the chat icon and chat with a Panoply Data Architect to get started. For more instructions for this step, refer to our data source documentation.

Your Data in Panoply

After you collect your data into Panoply, your data will be organized into one or multiple tables, structured like this:

Schema: public (collection of tables and other “objects”):

  • Table: your_crm_contacts
    • Column: id
    • Column: email
    • Column: created_date
  • Table: your_crm_companies
    • Column: id
    • Column: name
    • Column: city
  • Table: your_crm_deals
    • Column: id
    • Column: company_id -- could join to crm_data_companies.id
    • Column: signed_date
  • Table: your_analytics_tool_data
    • Column: event_id
    • Column: timestamp
    • Column: user_id
  • Views: Saved queries can be queried like tables (if defined)
  • Other Objects: User Defined Functions, Triggers, etc

To view the collected data, click Tables in the navigation pane.

The Table Rows count includes rows deleted today. Our automated process permanently removes deleted rows once per day. You can sort this list of tables by name, size, and row count.

When you click on a table, you can see the following:

  • Data – A sample of the table’s data so that you can get a feel for its structure.
  • Metadata – The list of the table’s columns and their data types. Clicking on the fields will allow you to further configure each field. Panoply also generates additional fields for each table to denote when each row was last retrieved from the source (__senttime) and when the row was last written into Panoply (__updatetime).
  • Connection Details – Lists database connection details. Use these to connect your external SQL tools to your Panoply data.

Checkpoint #1 You’ve reached the first checkpoint! Hopefully you’ve gotten a good overview of what Panoply is best at: easily collecting and integrating data from multiple sources and managing the back end maintenance without a dedicated data engineer. If you want to continue with the guide as a step-by-step, you should make sure to have collected at least one data source by this point. If you want to practice SQL joins (more on this later), make sure you have at least 2 related tables in your Panoply data warehouse.

Connect your Tools to Panoply

While your data is flowing, let’s get through the first few steps: You’re going to want to get comfortable with a SQL workbench in addition to your analytics tool(s) of choice. The steps for connecting a SQL workbench or a BI tool are often basically the same. First, go to platform.panoply.io/#/connect and copy your connection details. You should see something like this:

Connection details:

Connection detail Description
Host db.panoply.io
Port 5439
Database The database name you used when you created your Panoply account. If you don’t remember, look at the top-right corner of the platform.
User / Password The same credentials you used to sign up to panoply.io. Don’t remember them? You can reset it.
Driver Amazon Redshift or Postgres

As long as the tool you want to connect to Panoply uses ODBC/JDBC or has a built-in connector for Postgres or AWS Redshift, you should be able to use the same connection details for everything. If you need extra help setting up your connections, check out our in-depth guides to connecting BI tools to your Panoply instance.

Get comfortable in a SQL client

Panoply is designed to allow you to access and manage your data using SQL, specifically PostgreSQL—even more specifically Redshift SQL. So in addition to your analytics and visualization toolset, you’ll be using SQL to prepare and model your data for analytics and reporting.

Panoply provides a simple web-based SQL workbench that you can use to explore your data, but for more extensive and complex query building, you’ll probably want to set up a SQL client. Fortunately, there are some excellent options out there. If need or want to try a new one, here are a few great free recommendations:

Recommended SQL Clients for Mac (with free versions)

  • Postico - Fast and simple Mac desktop client

  • Metabase - Fast, web-based client, easy visualization workflow

  • Mode - Cloud-based data science workflow with built-in SQL client

  • PopSQL / Seekwell - A newer option. Slower performance but has stronger visualization capabilities

Recommended SQL Clients for Windows (with free versions)

  • Metabase - Fast, web-based client, easy visualization workflow

  • Mode - Cloud-based data science workflow with built-in SQL client

  • DBeaver - Free Microsoft standard

Checkpoint #2 You’ve hit the second checkpoint. At this point, you should have some data stored in Panoply. You should also have a SQL workbench connected to Panoply, and you should be able to see the data that you collected into Panoply within the “public” schema. If you already feel comfortable navigating relational databases with SQL, feel free to skip through the next section to checkpoint 3.

If you’d like to check your progress so far using SQL, try running this query in your SQL client (while connected to Panoply):

SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog'
   AND schemaname != 'information_schema';

pg_catalog.pg_tables is a list of all the tables in your Panoply database. With the WHERE and AND statements, this query will filter out any system tables that might clutter up the results.

The next steps will show you how you can gain familiarity with your data and start running queries to answer your business questions.

Get to Know your Data

This step is the foundation to sound data analysis. Often called data profiling, it is all about getting familiar with new data, from a granular level to a high level. The first step is to get a sense of the tables that Panoply generates when it collects a data source. After that, you’ll run some SQL queries to understand the structure and contents of your new tables.

Different sources make different tables

When Panoply collects data from a connected data source, it will generate a table–or multiple tables–in your database that will contain all the data from that data source. There are several reasons why Panoply might create multiple tables from a single source. The most common is that the data source has multiple endpoints that all serve different segments of your data. For example, if you collect data from Facebook Ads, you’ll see at least three tables in your Panoply database after the collection is finished: Insights, Ads and Adsets. That’s the way the data is organized when Facebook API provides the data, and we try to make sure that structure is preserved, as closely as possible to the source.

Another common way for multiple tables to be created from a single data source involves subtable creation. Panoply is designed to optimize your data for analysis, which means that all data points need to be put into standardized tables. For many data sources, like CSVs and MySQL databases this is just a matter of porting over whatever table structures already exist. However, if your data source is a NoSQL database or an API that serves nested JSON, Panoply will automatically unpack the nested objects into their own tables.

In practice, this means that ingesting from these types of sources will create multiple, smaller tables in your Panoply database. Twitter makes for a great example of nested object flattening. When Panoply collects data from Twitter, which serves its data in highly nested structures, it will create multiple subtables. The table twitter_tweets will also generate twitter_tweets_user, twitter_tweets_retweeted—and multiple other tables—as the nested objects are each extracted and collected into their own tables.

This is Panoply’s default behavior, but you can opt to turn on table flattening if you want to avoid table proliferation. This will unpack nested data directly into columns in its parent tables, rather than creating new subtables. For more detailed information on how Panoply deals with nested data, check out the docs page.

Profiling with SQL

Now that we’ve covered how tables are generated and what to expect from them, let’s take a look at some ways you can use SQL to start exploring your data in Panoply. Open up your SQL client or head to the Panoply workbench for this step. After you’re set up, you can start with a simple query that will show you all the tables relevant to a specific data source, like this:

SELECT DISTINCT tablename as 'name'
FROM pg_tables
WHERE tablename LIKE '{data source}_%%'

Note: Remember to replace {data source} with the data source prefix you specified when collecting your data. If, for example, you used default settings to collect data from Twitter, your prefix would be twitter_.

This query will return a single column that lists every table that was generated from your data source. Once you have this list, the fastest way to figure out how the data you need is spread across different tables is just to start digging in and exploring individual tables. You can get a quick look at what might be in an individual table by checking the column names with a query like this:

SELECT table_name, column_name
FROM information_schema.columns
WHERE table_name = '{table of interest name}'

Replacing {table of interest name} with the actual name of the table you’re interested in. If you spot the columns you’re looking for in a particular table, you can then run a quick check on the data itself with a query like this:

SELECT *
FROM {table of interest name}
LIMIT 100

As usual, replace {table of interest name} with the name of the table you’re taking a closer look at. This will give you the first 100 records in the selected table, which will allow you to take a closer look at the data itself.

A simple join

We’re talking relational table structures here, so it’s helpful to touch on joins at this point. If you’re not familiar with joins, we’ll cover enough of the basics here for you to get something done. Be sure to check out more in-depth documentation when you have a chance, though.

It often happens that the data you need to answer a specific question is spread across multiple tables in a relational database. This is by design, and helps keep your database tidy and reliable. In practice, this means that if you have a customers table, a sales table and a products table and want to see total sales broken out by product and customer type, you will need to join multiple tables in your query.

Simple join query

In order to join two tables, you will need to identify a key that links the two tables. Most databases are designed to provide designated, unique keys for this purpose, but you will often find that some other column works as a key for the purposes of joining two or more tables.

A basic two-table join will have the general form of:

SELECT {column1}, {column2}, {column3}  {columnN}
FROM {table1}
JOIN {table2}
ON {table1}.{key} = {table2}.{key}

In this example {key} will be the name of the column that you will be using as a join key. The name of the column may be different between tables–the important thing is that the underlying data matches. As a hypothetical example, suppose that you were trying to join sales and products. In the sales table, products are listed only by their product_id. In products, the corresponding column is named simply id. In that case, the ON statement from the above query would look like:

ON
sales.product_id = products.id

Twitter example

In order to tie everything together, let’s look at a specific example of a join that covers all aspects of the Panoply data collection and integration process that we’ve covered so far. If you collect your Twitter data with subtable creation enabled, you’ll find that each of the main tables–tweets, likes and followers–will spawn multiple subtables. This is helpful in many cases, as specific data from table rows will be broken out into relatable tables.

So let’s suppose you wanted to break out each individual hashtag attached to a specific tweet. You would need to join twitter_tweets to twitter_tweets_entities_hashtags. Here’s how you’d do that:

SELECT twitter_tweets."text" as "tweet", twitter_tweets_entities_hashtags."text" as "tag"
FROM twitter_tweets
JOIN twitter_tweets_entities_hashtags
ON twitter_tweets.id = twitter_tweets_entities_hashtags.twitter_tweets_entities_id

You can add whatever tweaks and permutations you’d like, but that’s the basic query that you should start with.

Other ways to explore your data

If you want to play around more in your new SQL sandbox, our partners at Chartio have put together a list of SQL queries that can help you explore in other ways. You can find some of the queries excerpted below.

A query to preview any table:

SELECT TOP 10 *
FROM table_name

This will give you a quick look at the top 10 values of a table of interest. Similar to a head method like you might find in bash or pandas.

A query to select distinct values of an interesting column:

SELECT DISTINCT column_name
FROM table_name

This will help you get a sense of how many possible values there are in a column of interest. Useful for binning data in plots or just getting a sense of the range of the underlying data.

A query to aggregate values based on an interesting column:

SELECT column_name, count(DISTINCT id)
FROM table_name
GROUP BY id
ORDER BY 2 DESC

This query will group the data based on each item’s unique id and count up the entries of each. The ORDER BY statement will arrange the results in descending order based on their counts, so the most numerous items will be at the top.

A query to filter rows that contain an interesting value in an interesting column:

SELECT *
FROM table_name
WHERE column_name = 'name of the value'

This query will allow you to display all rows from a table that contain the value you specify. If, for example, you wanted to filter all your shoe sales to display only casual sneakers, you would use a query like this.

Checkpoint #3 At this point, you should feel comfortable collecting, organizing and profiling your data. You should also have a grasp on the basics of pulling together your data from different tables for the purposes of analysis.

Advanced data management

Tables and data types

Now that you’ve gotten a chance to get familiar with your data, you can start to use some of Panoply’s more advanced features to tune it to your needs. First, head back to the tables screen.

Select a table you’d like to modify or explore in more depth to view the table’s columns and associated Metadata.

Clicking a column header from the Metadata view allows you to dig deeper into the table’s inner structure. You can also rename the column, delete it, and change its type or other technical details.

Generally, there’s no need to change these values because Panoply automatically assigns them for you.

Job and query management

After running a query, you can save the query as a transformation or export the resulting table into your own system. Panoply offers two types of transformation views: simple views and materialised views. A transformation typically start as a simple view. When needed, based on performance, simple views can be transformed into materialised views, which are always accessible and up to date.

To track your job progress, select Jobs from the navigation pane. You can filter the jobs list according to status (All Statuses, Pending, Running, Error, Success, and Error) and job type (All Types, Collect, Collect-Direct, Materialize, Alter, and Export).

Scheduling data collection

Once you get your data sources connected and your analysis pipeline built, you’ll want to make sure you’re working with the most up-to-date data you have at your disposal. You can manage future collections manually as new data flows into your individual data sources, but this can quickly become labor intensive as the number of sources increases. The easier option is to let Panoply automatically collect new data for you.

Panoply provides a simple-to-manage data collection scheduling interface that you can access from the data sources pane. Next, follow these steps:

Find the data source you’d like to schedule for collection, and select it. A clock icon should appear on the “collect data source” line of the next pane, like so:

Select the clock icon. A new pane will appear with configuration options:

Use the dropdown menus to configure your collection schedule and hit accept. Panoply will now continuously collect your data, adding only the most recent data to your warehouse unless you configure it to collect data from scratch each time.

Checkpoint #4 You’re up and running! Now you can collect your data, configure your tables and set up collection schedules to support your analysis. You’ve just built a cutting edge data analytics pipeline in the time it takes to step through this guide. Now you can spend more time generating insights and less time managing data.

If you’d like to take your Panoply data warehouse to the next level, be sure to set up a data review with a member of our Data Architect Team.

Getting started is easy! Get all your data in one place in minutes.
Try Free