Data Platforms Updated Aug 29 2022

What is Data Discovery: Definitions & Overview

AUTHOR | Michael Segner

Table of Contents

Table of Contents

Modern data teams manage hundreds or even thousands of tables, views, and dashboards spread across their platforms. Finding the right data asset at the right time has become one of the biggest productivity challenges they face. Data discovery addresses this challenge by enabling teams to locate and understand relevant data sets across their entire data platform. It makes data engineering and analytical engineering tasks more efficient and can enable self-service access for other types of data consumers.

Just as knowledge workers search through shared drives to find the right document or presentation, data professionals face similar challenges with data assets like tables, models, and dashboards. The stakes are even higher since using the wrong data can lead to flawed analysis, wasted resources, and poor business decisions.

Without proper context, teams struggle with common problems. They might select outdated tables, spend hours searching for the right data set, or have no idea who to contact for help. These issues compound as data volumes grow and teams scale.

This article explores how data discovery solves these problems. We’ll cover the benefits, examine the challenges, and look at emerging trends like automated data discovery tools. Whether you’re building discovery capabilities from scratch or improving existing processes, you’ll learn practical approaches to help your team find and use data more effectively.

Table of Contents

What is data discovery?

Data discovery is the process of identifying, locating, and understanding data assets, though its specific meaning varies across different domains. For privacy, security, and compliance teams, for example, data discovery is the ability to scan collaboration data to identify, classify, and protect sensitive data, particularly the misuse of personal identifiable information (PII).

Others use the term more loosely to describe the process mainly done by analytics engineers or data analysts to prepare, visualize and report on large amounts of data to draw insights and correlations. Data discovery is also frequently confused with data mining, which is the process of scraping and extracting open source data, like websites, at a large scale. A common data mining use case would be to scrape a site like Amazon to provide ecommerce retailers data and insights on item stock and pricing levels online.

Benefits of data discovery

Data discovery delivers value across multiple dimensions of your data organization, from operational efficiency to strategic enablement. Let’s explore the key benefits that make data discovery an essential capability for modern data teams.

Ensures teams use the right data

One of the main benefits of data discovery is that it allows data engineers and others involved in the ETL process to be sure they are leveraging the most up-to-date (and often, correct) data set for a given use case. Without proper discovery tools, teams risk building on outdated or deprecated tables, leading to inaccurate analytics and wasted effort. Data discovery eliminates the guesswork by clearly identifying which data assets are current, validated, and appropriate for specific purposes.

Manages complexity at scale

Once your organization’s operations start to exceed 50 tables or so, even the most experienced data engineers start to lose their mental map of the entire platform. That’s because it’s one thing to remember 50 discrete items, it’s another to understand the purpose and connections between them all. Data discovery tools provide the external brain your team needs, maintaining detailed metadata about relationships, dependencies, and usage patterns that would be impossible to track manually.

Accelerates team onboarding

Without strong data discovery capabilities, it takes longer for new members of the data team to onramp and become familiar with the environment. Instead of spending weeks in knowledge transfer sessions or diving through documentation, new team members can use discovery tools to quickly understand the data architecture, find relevant assets, and see how different components connect. This self-guided exploration dramatically reduces time-to-productivity for new hires.

Enables data democratization

Data discovery plays a key role outside of the data team. One of the most strategic initiatives of any organization is to become more data-driven or to accelerate data democratization. For these initiatives to succeed, reliable data self-service is a must, and data discovery capabilities (along with documentation) are foundational to that effort. When business users can find and understand data independently, they make better decisions faster without creating bottlenecks at the data team.

Unlocks data team capacity

When data teams unlock data self-service access through effective discovery tools, they also unlock additional capacity within their team. Now data engineers spend less time acting as a personal guide to the data platform and more time adding value to the business. Instead of fielding repetitive questions about where to find data or what a particular table contains, engineers can focus on building new capabilities, optimizing performance, and solving complex data challenges.

Data discovery challenges

While the benefits of data discovery are clear, implementing it successfully requires overcoming several persistent obstacles. Recognizing these challenges helps teams set realistic expectations and develop strategies to address them proactively.

Documentation remains a persistent struggle

We have decades of experience to show that no one, whether they are a software engineer or a data engineer, likes writing documentation. Even in the best case scenarios (which are rare) with strong enforcement and policing mechanisms in place, there are gaps. Documentation tasks typically fall outside of the normal workflow and there is just too much work to be done. This is why automated data discovery tools have become so valuable. They capture metadata and relationships without relying on manual documentation efforts.

The producer-consumer divide

Data is constantly in motion and there is typically a gap between those who produce it and those who consume it. This means that the producer is typically not thinking about how to make data assets or products as usable or as discoverable as possible. They’re focused on operational concerns, not analytical ones. Ironically, data discovery tools and mechanisms like data contracts can be extremely helpful to bridge this divide, creating visibility and accountability across teams.

Data literacy limitations

Data discovery depends on making data sets easily accessible and usable, but it also depends on consumers who have the capability to know how to surface it and how to use it effectively once they have done so. Not everyone in the organization has the skills to write SQL queries or interpret data relationships. Analytics engineers occupy a relatively new role on the data team, specifically designed to cover the last mile between data in its raw state and in a more accessible, usable form.

Ad-hoc requests persist

You can have great data discovery in place, but keep in mind there will always be that executive who needs a new report on a unique set of metrics yesterday. These ad-hoc requests can be dramatically reduced with good data discovery, but rarely eliminated entirely. The key is building discovery tools that help fulfill these requests faster, rather than expecting them to disappear.

Organizational alignment complexity

Strong data discovery means having clear lines of ownership across the organization for each data asset. It also means creating shared definitions of what certain metrics mean and how they are used. Without this alignment, different teams may use the same data differently or define key metrics inconsistently. This is where approaches like data mesh and decentralizing the data team can really help to reduce these gray areas of overlap and make the process more agile.

Data discovery 101 and building understanding

Now that we understand the benefits and challenges of data discovery, let’s dive into some of the best practices.

While “providing context to understand data sets” is easy to say, it’s much more involved in reality. To do so, you need to answer the “who, what, when, where, why” of data.

Who: Who owns and is accountable for this table? Who is an expert about this table and the processes it’s involved in? Who is impacted if this table has an issue?
What: How is this data organized, what it’s schema? What are the requirements or service level agreement to which it’s being held? Is it high data quality?
When: How frequently is this data updated or accessed? When does it need to be delivered? When were the last changes or modifications made and by who?
Where: Answering this question isn’t as easy as it seems. Data isn’t static and stored in a single table or system, it flows like water. It also has a lot of dependencies, if there is a leak upstream, everything downstream is impacted. This is why data lineage, understanding how different data assets are connected from source system to final consumption, is so critical.
Why? What business process does this data contribute to? How does it add value? What is the business logic behind it?

Of course the context and relevance changes based on who is consuming the data and what they are trying to do with it. A data engineer is likely going to be more focused on data lineage and data reliability, but an analytics engineer may be more concerned with the schema and the frequency of read/writes.

The best data discovery tools will answer these questions within a clean interface that prioritizes user experiences according to the data consumer onion. At the core are data engineers, the practitioners manipulating data in its rawest form everyday.

Then, that data is consumed by analytics engineers who clean it up to make it more usable for the next layer of consumers which are data scientists and data analysts. Next come data power users and stakeholders, such as software engineers and product managers followed by your everyday business consumer.

Core users for data discovery — When developing your data discovery capabilities it’s important to build for your core users and their use cases.

Data discovery and the data mesh

The concept of data discovery is aligned with the distributed domain-oriented architecture proposed by Zhamak Deghani and Thoughtworks’ data mesh paradigm of data management.

Data discovery works well when different data owners embrace the data as a product mindset, being held accountable for their data assets as well as for facilitating communication between distributed data across different locations. Once data has been served to and transformed by a given domain, the domain data owners can leverage the data for their operational or analytic needs.

Data discovery can provide domain-specific, dynamic understanding of your data based on how it’s being ingested, stored, aggregated, and used by a set of specific consumers. Governance standards and tooling are federated across these domains (allowing for greater accessibility and interoperability).

Data discovery tools

Data discovery tools come in four basic flavors: discovery only, data observability, data catalog, and homegrown.

There are some solutions that focus exclusively on data discovery with an emphasis on the democratization of data. These are similar to the homegrown solutions at mega-tech companies like Meta, Airbnb, and Uber.

More often though, you will see data observability capabilities as part of a larger solution with automated data lineage such as data observability or a data catalog. Each have their own flavor, and we’ve weighed in on how data teams should prioritize a data observability vs data catalog decision.

The difference between the two is the emphasis and additional capabilities that come with the rest of the platform. For example, Monte Carlo builds our data lineage and data discovery capabilities with data engineers in mind to prevent data incidents and accelerate their resolution whereas a catalog’s discovery capabilities are more designed for a steward to help their cataloging and governance activities.

A data observability platform also features real-time data monitoring and alerting capabilities to ensure your data is trustworthy and reliable. As Datanami points out, data catalogs provide a lot of value from their ability, “to provide a bridge between how business talks about data and how that data is technically stored. Nearly all data catalog tools in the market–and there are close to 100 of them now–can do that.”

We’ve previously (and perhaps controversially) written that data catalogs are dead, and that’s partly due to the emergence of data discovery and data observability.

The future of data discovery

Through a combination of smart data discovery tools and automation, it’s easier than ever to ensure your data team has the context they need to find and effectively use data assets across your ecosystem.

And that, in a nutshell, is the answer to the question: “what is data discovery?”

Keep in mind that creating new categories and types of tooling is an iterative process; we have no doubt that what data discovery looks like will continue to change as data volumes grow, the pursuit of real-time analytics continues, and artificial intelligence improves. Balancing speed and reliability remains vital, as does the need for proper governance and compliance with regulatory measures.

Did we miss something? Feel free to leave your thoughts in the comments below.

Interested in learning about how data observability can help your team make the most of their data discovery investments? Schedule time with us by filling out the form below.

Our promise: we will show you the product.

What is Data Discovery: Definitions & Overview

What is data discovery?