Data modeling is an integral part of any organization’s ability to analyze and extract value from its data. Everyone involved, from collection to consumption, should know what data modeling is and how they, as stakeholders, can contribute to a successful data modeling practice. Read on to learn how developing a strong data model drives growth and productivity throughout your organization.
What is Data Modeling?
A data model determines how data is exposed to the end user. Optimally creating and structuring database tables to answer business questions is the desired role of data modeling, setting the stage for the best data analysis possible by exposing the end user to the most relevant data they require.
Data Modeling Is Not Data Analysis
There is some overlap between data modeling and data analysis (fields need to be understood in order to be mapped or tabled properly, or analyzing performance of the data model as a whole, for example), but it’s an important distinction to remember: proper data modeling is optimizing the process implemented to deliver clean, useable data for analysis.
Data Modeling Makes Analysis Easier
The fundamental objective of data modeling is to only expose data that holds value for the end user. Clearly delineating what questions a table should answer is essential, and deciding on how different types of data will be modeled creates optimal conditions for data analysis. So, while data modeling itself is highly technical, it nonetheless relies on input from non-technical personnel in order to properly define business objectives.
A good business example to consider is marketing attribution, where comparing and contrasting data from both first touch and last touch attribution perspectives may be very significant. Digging deeper, like building a marketing strategy based exclusively off anything “last touch” in the sales funnel — the final tweet, text alert, email promo, etc., that led to a conversion — requires amassing the raw data and filtering in just the last touch of the journey for analysis. This is hard to do with just a single query, and why it’s important to execute before the time of analysis.
With the objectives outlined, database tables can be assembled with each field itself clearly defined. These definitions become part of a data dictionary, an integral part of any successful data model.
The Role of the Data Dictionary
Data definition is essential. An effective data dictionary is an inventory that describes the data objects and items in a data model, and will include at least two key ingredients: properly labelled tables and properly defined terms. At its core, these define the rows (elements) and columns (attributes). Clarity is key here, and it’s important to remember that tables without definitions are counterintuitive (at best). Whether it’s about marketing, web traffic, an email campaign, etc., the goal is exposing clean, raw data. It’s imperative to any successful data model that the definitions for the terms used are clear, concise, and uniform, and that any ambiguity when labelling and defining terms has been removed.
The data dictionary should be maintained by all the data’s stakeholders but especially those responsible for collection and storage. More on this below, but as we want to democratize data for a data-driven culture, having a data expert adjusting or pruning the data model as needed ensures data collection is structured in a way that enhances downstream use.
Where Is the Data Going? (And to Whom?)
Increased data volumes can produce barriers to accessibility, or provide a wealth of insight. All kinds of business questions arise, requiring data to be structured accordingly. The comprehension level of the end user is a factor, but the guiding principle is modeling data in a way that makes it very easy to query, so that any end user can utilize the data once received. In other words, it’s meant to be useful. Flooding the user with extraneous and irrelevant data is as frustrating as it is time-wasting. Because there are always fields for engineers (like a update timestamps or batch IDs) that hold zero benefit for the end user, attention must be paid to the key take away: what fields are exposed to the end users, and how much will those fields denote true business value?
All data have different kinds of structure and granularity. Tables are structured to suit end user needs, and granularity defines the level of detail the data provides. Take transactional data as an example. Each row of data could represent an item purchased, and include where it was purchased, how it was purchased, or when it was purchased, even down to the second. (As an example, the latter might be a significant metric for anyone in retail monitoring sales on Black Friday or the day after Christmas.)
Another common business reference is the construction of a churn model, and the various parameters inherent in the end user’s needs. These needs are loosely defined as a time component, with contractual and non-contractual factors playing a role as well. Customer onboarding and retention behavior can vary substantially, and what the end user needs often exists at a more granular customer level: one day after a promo, one month after a free trial, measuring client satisfaction a year out, and so on.
Since the requirements are clear, a solution is easily modeled: the end user defines the stages or fields they care about, and the data modeler creates the model with tables exposing all relevant data. By exposing churn rate data at specific intervals, interpreting and then “bucketing” the interpreted data — adding an extra column to the table to provide better insight — a data model has been constructed that produces significant business value.
This is part of the best practices approach to data modeling: two deciding (human) factors — someone that understands the right questions to ask, and someone to build the data tables that provide answers and insights.
Data Modeling in Practice
Utilizing a Domain Knowledge Expert
As mentioned above, defining what a table should look like means modeling data in a way that makes it very easy to query — in essence, so any end user or BI tool can use it. The data engineers do the heavy lifting once they understand the business questions to be answered. And just as someone with business domain knowledge is required for providing the right questions to ask, a data domain expert is necessary for interpreting the technical nuances in the data, what it looks like in its raw form, understanding the instrumentation of the data, and translating it into a model that’s easy to comprehend. Their knowledge is key to what you can and can’t model, and how the tools utilized will be implemented most effectively.
What event the data represents will most likely vary by perspective: for example, a marketing person may see the event as part of a funnel stage — one step has been completed, while another has not — whereas from an engineering standpoint the event might be defined as when a specific POST request was sent. This speaks to another best practice of data modeling: Trust. Both types of expertise require the other to complete the picture and create a model that works for everyone.
As business priorities evolve, the data model must likewise adapt and modify. The entities — and relationships between entities — that make up the schema for queries will change with time and the demands of the enterprise; a data domain expert will ensure that the data model stays up to date and agile enough to continue exposing raw data that is relevant and purposeful.
Runtime vs Preprocessing: It’s About Optimizing Performance
Mapping arcane, technical details within a raw data source and directing it to a user-friendly, easy-to-read outcome can be done with database views and processed at query time. However, if a new table is built on top of that within a data warehouse, modeling the data appropriately as a specific schedule might dictate, that data will be preprocessed.
When weighing the tradeoffs between using runtime for modeling over preprocessed, or pre-calculated, choosing runtime over non-runtime is preferred whenever possible. The more that can be done with the model in runtime, the better (in general), as this translates to less maintenance, while multiple steps with persistent data equate to more management.
Preprocessing is preferred when it’s both calculation-intensive and necessary, as in the churn model referenced previously: looping it through logic is inefficient in runtime, since it would require measuring a ton of data — multiple queries — thereby taking too long to deliver timely insights. Documenting past or forecasting future customer churn rates require different models, each using preprocessed output tables to give desired numbers.
Drawing the line between runtime and preprocessing is the job of an experienced data engineer; as a general rule, it’s good to start “raw” and trend toward more complex models as enterprise needs become more nuanced. Single query works for some tasks; numerous queries may require preprocessing.
The Role of the Analyst
New models are not created overnight. Having to wait hours (or longer) for data processing jobs to arrive, or only receiving once-a-day batched data, will continue to diminish in frequency. End users become more comfortable deploying BI tools for everyday tasks, and the tools themselves continue to become more powerful, reducing the complexity of queries to do analyses, and enabling “self-service” analytics. All are positive developments, but without the interaction and oversight of a data analyst the potential exists for end users to just as easily draw the wrong conclusions from the accelerated access to data.
An analyst assesses data quality and performs data structure checks, isolating and removing bad values. They may create new tables that track volume of data or row counts of data from a specific raw table. The analyst can also automate a data quality model on top of a model that sets a query for customization, identifying poor quality and outliers.
For example, a query structured to evaluate sales data for the current Monday when compared to the previous six months of Mondays would benefit with build-in exceptions into the quality model — think Cyber Monday or Labor Day Weekend — that furnish more nuanced, useable analytics.
Defining the role of the analyst ties into the essence of defining the data model, helping shape what the tables will look like and what queries those tables will serve. And that analyst is part of a team serving a data warehouse, all operating with the goal of delivering relevant, real-time, 360-degree data for all end users. When a change to the logic of a model occurs, they’ll be the ones testing it to make sure it’s robust.
Advances in cloud data warehousing have tied the capability to source both structured and unstructured data directly to an end user’s ability to analyze and implement that data within minutes. A comprehensive, pragmatic data model further facilitates BI tools turning raw data into business value.