Basic SQL

There are difference in the syntax required when using SQL that will depend on where your data warehouse is built. For more information see the documentation for Google BigQuery or Redshift.

Panoply is designed to allow you to access and manage your data using SQL, so in addition to your analytics and visualization toolset, you’ll be using SQL to prepare and model your data for analytics and reporting.

Panoply provides a simple web-based SQL workbench that you can use to explore your data, but for more extensive and complex query building, you’ll probably want to set up an SQL client.

Getting started with SQL

Now that we’ve covered how tables are generated and what to expect from them, let’s take a look at some ways you can use SQL to start exploring your data in Panoply. What follows are some simple suggestions, but remember to check your SQL resources to be sure you are using the correct syntax.

Open up your SQL client or head to the Panoply workbench for this step. After you are set up, you can start with a simple query that will show you all the tables relevant to a specific data source, like this:

SELECT DISTINCT tablename as 'name'
FROM pg_tables
WHERE tablename LIKE '{data source}_%%'

Note: Remember to replace {data source} with the data source prefix specified when collecting your data. For example, if you used default settings to collect data from Twitter, your prefix would be twitter_.

This query will return a single column that lists every table that was generated from your data source. Once you have this list, the fastest way to figure out how the data you need is spread across different tables is just to start digging in and exploring individual tables. You can get a quick look at what might be in an individual table by checking the column names with a query like this:

SELECT table_name, column_name
FROM information_schema.columns
WHERE table_name = '{table of interest name}'

Replace {table of interest name} with the actual name of the table you’re interested in exploring. If you spot the columns you’re looking for in a particular table, you can then run a quick check on the data itself with a query like this:

SELECT *
FROM {table of interest name}
LIMIT 100

As usual, replace {table of interest name} with the name of the table you’re taking a closer look at. This will give you the first 100 records in the selected table, which will allow you to take a closer look at the data itself.

A simple join

We’re talking relational table structures here, so it’s helpful to touch on joins at this point. If you’re not familiar with joins, we’ll cover enough of the basics here for you to get something done.

It often happens that the data you need to answer a specific question is spread across multiple tables in a relational database. This is by design, and helps keep your database tidy and reliable. In practice, this means that if you have a customers table, a sales table and a products table and want to see total sales broken out by product and customer type, you will need to join multiple tables in your query.

Simple join query

In order to join two tables, you will need to identify a key that links the two tables. Most databases are designed to provide designated, unique keys for this purpose, but you will often find that some other column works as a key for the purposes of joining two or more tables.

A basic two-table join will have the general form of:

SELECT {column1}, {column2}, {column3} … {columnN}
FROM {table1}
JOIN {table2}
ON {table1}.{key} = {table2}.{key}

In this example {key} will be the name of the column that you will be using as a join key. The name of the column may be different between tables–the important thing is that the underlying data matches. As a hypothetical example, suppose that you were trying to join sales and products. In the sales table, products are listed only by their product_id. In products, the corresponding column is named simply id. In that case, the ON statement from the above query would look like:

ON
sales.product_id = products.id

Twitter example

In order to tie everything together, let’s look at a specific example of a join that covers these aspects of the Panoply data collection and integration process. If you collect your Twitter data with subtable creation enabled, you’ll find that each of the main tables–tweets, likes and followers–will spawn multiple subtables. This is helpful in many cases, as specific data from table rows will be broken out into relatable tables.

So let’s suppose you wanted to break out each individual hashtag attached to a specific tweet. You would need to join twitter_tweets to twitter_tweets_entities_hashtags. Here’s how you’d do that:

SELECT twitter_tweets."text" as "tweet", twitter_tweets_entities_hashtags."text" as "tag"
FROM twitter_tweets
JOIN twitter_tweets_entities_hashtags
ON twitter_tweets.id = twitter_tweets_entities_hashtags.twitter_tweets_entities_id

You can add whatever tweaks and permutations you’d like, but that’s the basic query that you should start with.