Data Democratization: Three Keys to Getting Started

Data democratization describes barrier-free access to data. In this article, we address three important concepts worth internalizing for any company looking to fully embrace data democratization: Cultivating Trust, Usability, and Speed to Insight.

Data Democratization Overview

Simply put, data democratization means providing unfettered, enterprise-wide access to data. Everyone within an organization benefits from accelerated access to data, and should have that ability at their fingertips. The “enterprise” could be humanity itself — the printing press was a massive leap in democratizing data, as is allowing citizens uncomplicated, true access to the data on government spending via the internet, for example. In this context, however, we’re focusing on the roadmap an organization puts forward in the quest to achieve data democratization.

Implementing data democratization requires a data program structured to be self-aware; that is, with greater company-wide access to data, protocols should be in place to ensure that end users exposed to certain data understand what it is they’re seeing — that nothing is misunderstood when interpreted — and that overall data security itself is maintained, as greater accessibility to data may also easily increase risk to data integrity. These safeguards, while necessary, are far outweighed by the observation of and data input from all corners of an organization. With participation enabled and encouraged across a organization’s ecosystem, further insight becomes possible, driving innovation and company performance.

I. Cultivating Trust

Data analytics projects are constantly hampered by lack of trust. When individuals across all sectors of an organization are empowered by the enhanced decision-making capabilities due to accelerated and unfettered access to data, this drives growth and performance. But this empowerment can quickly turn to disillusion if the system in place is rife with data quality issues. There is no trust when the exposed data is perceived to be (or simple is) not credible. Trust is the biggest issue to address at the onset of any data analytics project, and those building the data stack must meet the challenge of instilling trust all end users.

Anything built to last has, at its core, a solid foundation. Building a competent data analytics stack requires several components in both the data pipeline and data warehouse that are analogous to an actual brick and mortar dwelling containing information. For example, properly structuring database tables in the data model and/or implementing and maintaining a comprehensive data dictionary are as essential as, say, the steel and glass housing a local library, or the indexing system at every Amazon distribution center. But trust empowers you to feel confident about entering the building, seeking the knowledge contained therein and extracting value from it. For those relying on visualization tools to derive insights from data analytics, trust is built atop a foundation of thoughtful data management processes and transparency through all layers of the analytics stack.

One of the more difficult problems to solve is the actual connection between source systems and a data warehouse. The more connectors in play, the faster data can be obtained. A lot of nuance is involved in the construction of a connector, and this is a huge value that may not be apparent to those who don’t deal with data systems, though the connector’s worth becomes readily visible once it’s in place and operational.

The solution architecture — the tools and technology that are built into the data stack — shows people that progress is happening in their quest to solve the problem of multiple data sources. These should be viewed as the nuts and bolts of the analytics stack: the ETL tool to extract, transform, and load data (more on that below), the connectors interfaced with the ETL process, the data warehouse responsible for storage and its technology, and, finally, the visualization layer that ideally sits atop it all, allowing users to query data without having to understand how any of it works. When these essential components are properly architected, assembled, and deployed, desired results become achievable, and trust is reinforced.

The Processes of Data Management

With many companies, data management isn’t prioritized as it should, and data engineering not an in-house skill set. Yet those same companies often understand the value of having an analytics environment for them to direct their own product or sales strategy, marketing campaigns, and so on. This may lead to utilizing whatever technical resources they have on hand to write a build script (code to transform and load data) and this may result in personnel whose primary role is centered on product engineering or development, but is now tasked with ensuring that data collection and management is operational, let alone optimal.

They will undoubtedly do their utmost to build what they consider to be the best data stack available, but may not realize how hard it can be to have data jobs run smoothly and efficiently and — equally significant — possess the ability and agility to respond and recover when failure strikes. A dedicated team (even if that means a team of one), whose primary job is development and oversight of the data stack, is always preferred.

ETL and Data Integration

The Extract, Transform, Load (ETL) process describes the infrastructure built to pull the raw data required for analysis (extract), convert and prepare it in some useful form for a business need (transform), and deliver it to a data warehouse (load). Moving data from source to storage — from multiple locales and/or disparate systems to an organized database — renders data homogenous, ready, and useful. Integrating data from various sources in this manner has been the standard for decades, though a modern, highly-agile and evolving process named ELT (extracting and immediately loading data, with transformations then taking place) holds much promise. By reducing the undesirable characteristics associated with ETL — its complex nature and length of time to set up and execute — ELT is an evolving, modern approach to data analysis. This is excellent news for analysts or anyone relying on BI tools to query data: greater access, increased speed, less complexity.

However the data is warehoused, it is essential that data arrive without any loss and without being corrupted by error. Checks can be placed in various locations, depending upon what makes sense from the vendors that you’re using, the different pieces and how easy it all fits. Often, you’ll want to put it these checks within the ETL layer, basically as far upstream as you can to catch the problem, which is helpful in that it provides the earliest possible warning of anything amiss.

The Data Model and Data Dictionary

With data management, it’s key to remember that the goal of the data model is exposing the end user to the most relevant data they require. Data must be clean and prepared, and must reflect the expressed needs of the non-technical personnel relying on the data for use with their BI tools. This input is required for business objectives to be realized.

When data is collected and stored, properly defining each data field’s meaning and purpose requires a comprehensive data dictionary. This collection of definitions describe the dimensions of the data itself; its terms are clear and uniform, updated as needed, and exhibit zero ambiguity. Understanding how a data dictionary is assembled begins with how a particular metric or piece of data fits into a business story.

These two essential components are key to providing truth and fostering insight. A more detailed overview of both may be found in our articles Data Modeling Examples for Analytics and Practical Data Dictionary Design and Maintenance.

Accurate and Faithful Data

Accurate data collection means acquiring clean data that is faithful to its origin. While it’s easy to think about moving data from Point A to Point B, it’s often not practical in the sense of taking billions of rows of data all at once and moving it — you can’t just forklift it all the time, especially if you’re looking to decrease the latency of that data. The trick is to load data incrementally, only bringing over data that you haven’t already. Any data that’s new or changed (and only that) is what should be loaded, reducing by far the volume of data transferred and decreasing the latency as well. With checks in place, this also means there is far less data to monitor for accuracy, streamlining the process of ensuring that data is faithful to actual events.

Data Stack Technologies

Your data analytics stack is based on the framework of four main components: the sources of your data, the ETL process, the data warehouse, and the BI tools for visualizing data. The goal is equally straightforward: build a stack that provides end users a 360-degree view of their data. Visibility and control over data processing is key, and not being able to view a vendor’s schema versus transparency and clarity into schemas and any schema changes — a black box versus white box scenario — is a serious business consideration. One of the main objectives is data trust, and trust is, at its core, facilitated by visibility.

When building a data stack, collaborative involvement with non-technical parties who have a vested interest in clean data ensures that key business questions are addressed, while setting up and maintaining vigilance to technical aspects — design for flexibility and growth, deploy redundant checks along the data pipeline, being agile and responsive if data fails to load, or partially loads, etc. — further advances a strong foundation atop which data democratization can flourish.

Trusted Vendors

Trust in a data vendors is also an essential, and obvious, component to factor in. Whether your business is financial services or forecasting the weather, vendors must be reliable and their data accurate. Choose vendors that can handle known pitfalls or potential glitches. For example, anticipating a schema change [add link / or explain] in the database and addressing it in such a seamless way that it goes unnoticed. They need to be able to handle failure pretty easily, be responsive, and allow you to schedule data jobs at your convenience.

Another significant factor is understanding the people within the organization and company culture as a whole. Does it reflect a mindset that doesn’t fully understand or appreciate that there are solutions available that aren’t solely internal or proprietary? This may extend to levels of resistance from engineers leery of outside vendors, thinking the vendors may introduce a black box that’s inaccessible.

While the engineers may subscribe to an “if you give engineers a problem, they will want to build it” approach, those incentives aren’t necessarily aligned when contrasted with timetables that are geared more toward company growth, speed, and agility.

Trust can be also destroyed in a white box system, as you might input something incorrectly, or with data that was far less accurate than previously thought, or when a lack of quality checks exists. But with a black box system it becomes worse because you lose visibility into exactly why a number is the way it is. There’s no way to trace the history or why you’re seeing it as such.

Irrespective of vendors or the technology used, the overarching goal is still ensuring data quality and integrity. As mentioned above when referencing the ETL process, data quality checks throughout the pipeline is always a good policy. Some vendors may make this task simplified, but it’s up to the implementation team to execute, deploying as many checks as possible along the pipeline.

II. Usability

The road map to data democratization includes moving past the traditional methods of siloing data in favor of smart cloud data warehousing. The end users benefit from increased access and speed to historical data, and both analysts and non-technical users alike integrate, query, and visualize data more quickly with preferred BI tools at their disposal.

It’s important to note, however, that while barrier-free access can provide a wealth of insight, it may also allow for an overwhelming influx of unmanageable data if the comprehension level of the end user has not been properly factored in. Data should be structured according to business questions and needs. Embracing data democratization therefore requires thought leadership from the data owners tasked with governance.

Tailor-Made and the Teachable Moment

If the end user does not understand or cannot apply what they have now been granted access to, usability rapidly descends to zero. The analytics stack must be tailored to suit the needs of the end user, and they must be trained accordingly. The mutual benefit of this symbiotic relationship — being properly trained on a tailor-made data stack — is not to be underestimated: it provides end users with custom-built wheels, and puts each one in the driver’s seat, brimming with confidence.

Having a tool that the end users find easiest to understand, most intuitive, and yields the best experience is ideal. The high usability of a BI tool will itself drive data democratization and drive mass adoption of the data, which then changes company culture. This is a big piece of the modern data stack, because at this point most people — often with very little training — are free to ask questions directly of the data, with no waiting on IT or degree in computer science required.

III. Speed to Insight

Historically, data has been owned and managed by those in the IT department. Business decisions requiring data were made at the pace of access, which was often delayed or “bottlenecked.” While data ownership may remain unchanged, data accessibility should be universal throughout the organization. Failure to adopt such a policy is becoming as antiquated as it is inefficient. It’s not a pragmatic approach, and will be regarded as dinosaur-like soon enough. And, absent an event like that which removed the actual dinosaurs, there is no scenario where the production of data will stop exploding exponentially, or that the need to derive value from the ongoing data explosion will somehow decrease.

Velocity Toward Accessibility

Agreeing that data should be accessible to all, and that a data stack built to collect and prepare clean, faithful data promotes data trust, we also must also allow for certain aspects of time with respect to the data. The speed at which the data becomes available, or how often can it be refreshed are two examples of the time component. In this regard, the speed of insight depends on the velocity toward accessibility.

But there are other significant factors as well, such as how other datasets are often desired once the basic data has been exposed to the end user. After this initial phase, it’s highly likely that they will discover another dataset they’ve always wanted to see in a central place, or merged in with some other data. This references speed from the standpoint of “this solution doesn’t exist” to “when’s the first time I can query the data?” — not necessarily query time itself or performance of the system, but how fast can the end user can get new datasets into their hands.

And, as was stated above, accessibility to and usage of the data is commensurate with the comfort level of the end user touching that data. There is training on how to implement the BI tools and there is training on how to comprehend the data itself, particularly given the innovative strides of innovation with respect to accessing furnishing access to a much larger amount of raw details. Once people develop a greater understanding of just how much power is at their fingertips, a lot more becomes possible, reinforcing data trust and driving the adoption of data democratization along the way.

Conclusion

Data trust is a necessary component when bridging data intermediaries in the quest toward full data democratization. And for those tasked with ensuring data quality, the more they stay agile and responsive, the more data trust itself is perpetually reaffirmed.