How to Reduce Your Data + AI Downtime
Table of Contents
The large model is officially a commodity. In just two short years, API-based LLMs have gone from incomprehensible to smartphone accessible. The pace of AI innovation is slowing. Real world use cases are coming into focus. Going forward, the value of your genAI applications will exist solely in the fitness—and reliability—of your own first-party data.
In 2008, a minute of Amazon.com being down and unavailable would have cost the company $31,000. In 2021, that same minute of downtime would cost approximately $9 million.
The reason is simple: as Amazon’s online retail operations grew—and more orders were placed—downtime became more costly. This story isn’t unique to Amazon, of course. SaaS products, traditional data products—and now AI applications—have become mission-critical to virtually every industry. And when those tools don’t work, the consequences can be quite costly.
In the data and AI industry, we call this downtime. The amount of time that a data + AI product is unusable because the data powering it is missing, inaccurate, or otherwise erroneous in some way. And if you thought solving for software downtime was costly and resource intensive—try root-causing an issue that was generated from AI.
Just like SaaS wasn’t able to thrive until cloud-based performance management solutions solved its reliability issues, we won’t see production-ready AI truly thrive until the problem of data reliability is solved as well.
At its heart, AI is a data product—and to scalably solve the problem of data + AI downtime, we need to look toward best practices for traditional data products as well; like data reliability engineering and AI data engineering, technologies like data testing and data observability, and other data strategies like data SLAs, data contracts, DataOps, and governance and metadata management.
In this post, I’ll walk through a four-step process you can use to reduce downtime, improve your AI reliability, and protect the value of your data + AI applications.
Table of Contents
Step 1: Measure the state of your data quality
It’s either ironic or tragic—I’m not sure which—that most data teams don’t have good metrics in place to measure the health and reliability of their data + AI products.
As a result, many teams are judged by business stakeholders on a qualitative basis, which is often overly weighted on the amount of time that has passed since they experienced a data incident. To make matters worse, data teams routinely underestimate the severity of the problem, which leads them to underinvest in the tooling and processes required to solve it.
So, the first step to any data + AI reliability strategy is to take an accurate accounting of your data quality issues. Here are some example metrics you might want to consider:
- Data downtime: To calculate this metric, take the total number of data incidents and multiply it by your average time to detection (TTD) plus your average time to resolution (TTR). (Unfortunately, without end-to-end data quality coverage, that figure will only include the number of incidents you actually caught. If that’s the case, you can estimate based on an average of ~67 data incidents per year for every 1,000 tables in your environment.) Once you’ve determined your estimated number of incidents you can multiply by your average TTD and TTR. A recent survey fromWakefield suggests it takes upwards of 8 hours or more to find and resolve data quality incidents in production.
- Total engineering resources spent on data quality issues: Survey your engineering team to understand what percentage of their time is spent finding, triaging, and resolving data quality issues. Most industry surveys, including our own, peg this consistently between 30 to 50%. From there it’s a simple process to convert those hours into salary to understand your labor cost. You can also review your OKRs and KPIs to see how many are related to improving data quality or a consequence of poor data quality.
- Data + AI trust: Survey your stakeholders to see how much they trust the data+AI products your team is responsible for curating. You can get a heuristic measure for this by seeing how many times data issues are caught by people outside your team. (Our recent State of Reliable AI survey found that over 68% of data leaders aren’t completely confident in the quality of the data that powers their AI applications.)
Oftentimes, your data + AI products will have data quality “hot spots”: pernicious little areas where quality issues seem to reoccur more frequently. By measuring these failure points, you’ll be able to more easily identify areas for improvement—whether that’s rethinking how your data is managed or aligning on who’s responsible for its overall quality.
By measuring your data and AI health based on quantitative benchmarks, you can set realistic goals to reduce your downtime, improve trust, and activate preventative measures to mitigate future financial and reputational risks.
Step 2: Identify priorities & set SLAs
For this step, you’ll need to talk to your business users to understand how they use data and AI in their daily workflows. Once you understand your business users and their needs more clearly, you can then translate that information into data SLAs to track performance and identify tooling or process gaps. Some examples of SLAs include:
- Freshness: Data will be refreshed by 7:00 am daily (great for cases where the CEO or other key executives are checking their dashboards at 7:30 am); data will never be older than X hours.
- Distribution: Column X will never be null; column Y will always be unique; field X will always be equal to or greater than field Y.
- Volume: Table X will never decrease in size.
- Schema: No fields will be deleted on this table.
- Overall downtime: We will reduce incidents X%, time to detection X%, and time to resolution X%.
- Ingestion (great for keeping external partners accountable): Data will be received by 5am each morning from partner Y.
You can’t always control how a model is developed or trained. But by managing the data feeding your RAG pipelines or the training data within your own small language models, you can effectively maintain and improve the reliability of your outputs for a given use case.
Step 3: Track improvements and optimize processes
Now that you can reliably measure your data + AI health, you can proactively—and efficiently— invest resources where they’re needed. For example, you may have six warehouses running smoothly, and one warehouse for a given domain that’s experiencing repeated issues. Zeroing in on problems like these gives you the ability to improve trust and reliability in a targeted and meaningful way.
This is also a great time to work on your team’s data quality culture and processes like improving your triage process, augmenting your data with additional metadata, and improving your incident response and resolution times.
Another example is leveraging an improved understanding of data lineage and how your data + AI assets are connected to prioritize efforts around your most important tables and products, or conversely, to deprecate old, unused tables without worrying about unexpected downstream consequences.
Step 4: Proactively communicate & certify your data
If your downtime is reduced in the forest and no one is around to hear…did it really happen?
In an ideal world, if you’ve made it to this step in your journey, you’ve dramatically improved your data and AI reliability and you’re well on your way to driving more value and building trust with your stakeholders. But in order to fully realize the benefits of your hard fought uptime, you also need a strategy for communicating your progress across the business.
That means proactively answering this question for your business users: “How do I know I can trust this data?”
Data certification is the process by which data assets are approved for business users across the organization after achieving mutually agreed upon SLAs (often tiered gold/silver/bronze) for data quality, observability, ownership/accountability, issue resolution, and communication.
In Monte Carlo, we offer features like the Data Product Dashboard and automated profiling to expedite this process for governance managers and domain users alike. Using native tools like these, stakeholders can understand at-a-glance how their data quality practices are performing against stated goals—and the AI-readiness of their owned data for future use-cases.
By labeling assets and communicating data reliability proactively, you can cultivate trust with stakeholders, optimize efficiency, and minimize the risk of low-quality or hallucinatory outputs from your data + AI applications.
Data adoption is at the crossroads of data access and reliability
An AI revolution is on the horizon. But to unleash its true potential, data leaders need to prioritize data reliability first and frequently.
If production-ready AI is on your horizon, it’s time to invest in improving your data reliability with thoughtful metrics, clear goals, efficient processes, and proactive communication.
Until then—here’s wishing you no data + AI downtime.
Our promise: we will show you the product.