Skip to content

How Generative AI is Transforming Data Engineering

AUTHOR | Lior Gavish

Table of Contents

Generative AI is taking the world by storm – here’s what it means for data engineering and why data observability is critical for this groundbreaking technology to succeed.

Maybe you’ve noticed the world has dumped the internet, mobile, social, cloud and even crypto in favor of an obsession with generative AI.

But is there more to generative AI than a fancy demo on Twitter? And how will it impact data? 

Let’s assess.

How generative AI will disrupt data

With the advent of generative AI, large language models became much more useful to the vast majority of humans. 

Need a drawing of a dinosaur riding a unicycle for your three-year-old’s birthday party? Done. How about a draft of an email to employees about your company’s new work from home policy? Easy as pie. 

It’s inevitable that generative AI will disrupt data, too. After speaking with hundreds of data leaders across companies from Fortune 500s to startups, we came up with a few predictions:

Access to data will become much easier – and more ubiquitous.

Chat-like interfaces will allow users to ask questions about data in natural language. People that are not proficient in SQL and business intelligence will no longer need to ask an analyst or analytics engineer to create a dashboard for them. Simultaneously, those who are proficient will be able to answer their own questions and build data products quicker and more efficiently. 

This will not displace SQL and business intelligence (or data professionals) for that matter, but it will lower the bar for data access and open it up to more stakeholders across more use cases. As a result, data will become more ubiquitous and more useful to organizations, with the opportunity to drive greater impact.

Simultaneously, data engineers will become more productive.

In the long term, bots may eat us (just kidding – mostly), but for the foreseeable future, generative AI won’t be able to replace data engineers, just make their lives easier – and that’s great. Check out what GitHub Copilot does if you need more evidence. While generative AI will relieve data professionals of some of their more ad hoc work, it will also give data people AI-assisted tools to more easily build, maintain, and optimize data pipelines. Generative AI models are already great at creating SQL/Python code, debugging it and optimizing it and they will only get better.

These enhancements may be baked into current staples of your data stack, or entirely new solutions being engineered by a soon-to-be launched seed stage startup. Either way, the outcome will be more data pipelines and more data products to be consumed by end users. 

Still, like any change, these advancements won’t be without their hurdles. Greater data access and greater productivity increases both the criticality of data and its complexity, making data harder to govern and trust. I don’t predict that bots shaped like Looker dashboards and Tableau reports will run amok, but I do foresee a world in which pipelines turn into figurative Frankenstein Monsters and business users rely on data with little insight into where the data came or guidance around what to use. In this brave new world, data governance and reliability will become much more important. 

Software engineering teams have long been practicing DevOps and automating their tooling to improve developer workflows, increase productivity, and build more useful products – all while ensuring the reliability of complex systems. Similarly, we are going to have to step up our game in the data space and become more operationally disciplined than ever before. Data observability will play a similar role for data teams to manage the reliability of data – and data products – at scale, and will become more critical and powerful.

How generative AI is changing data engineering workflows

Behind the gleaming facades of Fortune 500 companies, a technological shift is quietly reshaping one of the most critical yet invisible functions of modern business: data engineering. In server rooms and cloud environments from Silicon Valley to Wall Street, artificial intelligence is fundamentally transforming how companies build and maintain the digital pipelines that power business intelligence.

Generative AI in data engineering workflows is transforming how teams approach the traditionally labor-intensive work of moving, cleaning, and organizing vast quantities of corporate data. Tasks that once required specialized engineers working for days or weeks, such as writing complex SQL queries, designing data pipelines, or inferring database schemas from messy sources, can now be accomplished in minutes with AI assistance. Consider schema inference, where data engineers previously spent hours manually mapping field types and identifying patterns in unfamiliar datasets. AI tools can now automatically analyze data sources and propose optimal database structures, proving particularly valuable for companies integrating acquisitions or legacy infrastructure.

The transformation is most visible in SQL generation and pipeline creation. Snowflake’s Copilot feature allows analysts to request complex data aggregations in plain English, while the platform generates optimized SQL code behind the scenes. Similarly, dbt Cloud’s AI features can automatically suggest data transformation logic and identify potential quality issues before they corrupt downstream analytics. Pipeline generation, traditionally requiring deep knowledge of specific frameworks and careful consideration of error handling, can now be scaffolded by AI assistants based on high-level requirements, dramatically reducing barriers to data automation.

However, this efficiency comes with new challenges. As AI handles more routine tasks, data engineers find themselves evolving into AI supervisors, focused on validating outputs and handling edge cases that automation misses. The role is shifting from hands-on coding toward strategic oversight, a transition that mirrors broader changes across the technology industry as artificial intelligence reshapes traditional technical work.

Real-world use cases of generative AI in data engineering

While the theoretical benefits of AI-powered data engineering capture headlines, the practical applications already deployed across corporate America reveal the technology’s immediate impact. From financial services firms processing millions of transactions daily to retail giants managing global supply chains, companies are discovering that generative AI excels at solving specific, concrete problems that have long plagued data teams.

Auto-generating ETL code

The most immediate transformation is occurring in Extract, Transform, Load (ETL) development, where AI tools can generate functional code from business requirements. At major financial institutions, data teams now describe desired transformations in natural language and receive production-ready Python or SQL code within minutes. For instance, a request to “aggregate daily credit card transactions by merchant category and flag anomalies exceeding three standard deviations” can produce complete ETL pipelines, including error handling and data validation steps. This capability has reduced development cycles from weeks to days, allowing companies to respond more rapidly to changing business requirements and regulatory demands.

Creating documentation from metadata

Perhaps more revolutionary is AI’s ability to automatically generate comprehensive documentation from existing database metadata and code repositories. Companies struggling with legacy infrastructure often discover undocumented data sources created by long-departed employees. AI tools can now analyze table structures, column relationships, and data lineage to produce human-readable documentation that explains data flows and business logic. Major consulting firms report that this capability has transformed data discovery projects, reducing the time required to understand complex data ecosystems from months to weeks, while ensuring that institutional knowledge is preserved and accessible.

Debugging data pipeline errors with AI prompts

When data pipelines fail, the traditional debugging process often requires deep technical expertise and extensive log analysis. AI assistants can now interpret error messages, analyze pipeline configurations, and suggest specific fixes based on common failure patterns. Production teams at e-commerce companies describe submitting cryptic database error codes to AI tools and receiving targeted recommendations, such as index optimizations or query restructuring suggestions. This capability has proven particularly valuable during high-stakes periods like Black Friday, when rapid error resolution directly impacts revenue and customer experience.

Challenges of using generative AI in data engineering

Despite the compelling efficiency gains, the integration of generative AI into critical data infrastructure has exposed significant risks that many organizations are only beginning to understand. As companies move beyond pilot projects to production deployments, they are discovering that the same capabilities that make AI tools so powerful also introduce new categories of failure that traditional data engineering practices were never designed to handle.

Hallucination risks

The most immediate challenge facing data teams is AI’s tendency to generate plausible but fundamentally incorrect code or solutions. Unlike traditional software bugs that typically produce obvious errors, AI hallucinations can create subtly flawed logic that passes initial testing but corrupts data over time. Financial services companies have reported instances where AI-generated SQL queries appeared syntactically correct and returned reasonable results during development, only to produce systematic calculation errors when processing large datasets. This phenomenon is particularly dangerous in data engineering, where incorrect transformations can cascade through multiple downstream applications before being detected, potentially affecting business decisions worth millions of dollars.

How to solve this challenge

Leading data teams are implementing multi-layered validation frameworks that treat AI-generated code as inherently untrusted until proven otherwise. Companies like JPMorgan Chase have developed automated testing suites that compare AI-generated outputs against known benchmarks using synthetic datasets with predetermined expected results. These validation pipelines include statistical tests that verify data distributions match expected patterns, unit tests that check edge cases the AI might have missed, and integration tests that ensure new code doesn’t break existing functionality. Many organizations have also instituted mandatory peer review processes where senior engineers must approve any AI-generated code before production deployment, often requiring the reviewer to manually trace through the logic to identify potential flaws that automated testing might miss.

Data governance

The introduction of AI-generated code has created unprecedented challenges for data governance teams responsible for regulatory compliance and audit trails. In highly regulated industries like healthcare and finance, every line of code that touches sensitive data must be traceable, explainable, and defensible to regulators. AI tools that generate code based on natural language prompts disrupt traditional approval workflows, making it difficult to establish clear accountability when something goes wrong. Companies are struggling to adapt existing governance frameworks to accommodate AI-assisted development while maintaining the rigorous documentation and approval processes required by regulations like GDPR and SOX compliance.

How to solve this challenge

Organizations are developing comprehensive data governance frameworks that blend human oversight with automated documentation capabilities to maintain regulatory compliance while preserving AI efficiency gains. These frameworks begin with creating pre-approved prompt libraries that have been vetted by legal and compliance teams, ensuring that AI-generated code follows established patterns that meet regulatory requirements. AI governance committees now include data engineers, compliance officers, and legal representatives who review and approve new use cases before deployment, while detailed logging captures all AI interactions, including prompts, iterations, and human modifications.

This human-driven governance process is significantly enhanced by data observability platforms that automatically document data lineage and transformation logic, providing complete audit trails that show which AI-generated transformations touched sensitive data, when they executed, and what changes occurred. This automated documentation capability addresses one of the most time-consuming aspects of AI governance while ensuring that audit trails remain complete and accessible to regulators.

Reproducibility and version control issues

Perhaps most troubling for engineering teams is the inherent unpredictability of AI-generated solutions. The same prompt submitted to an AI tool on different days can produce significantly different code, making it nearly impossible to reproduce exact results or debug issues that emerge weeks or months later. This variability conflicts with fundamental software engineering principles that emphasize consistent, repeatable processes. Data teams are discovering that traditional version control approaches are insufficient for managing AI-assisted development, requiring new methodologies to track not just the final code but also the prompts, model versions, and contextual factors that influenced each generation.

How to solve this challenge

Engineering teams are tackling reproducibility through sophisticated versioning strategies that capture the complete context of AI-assisted development while implementing real-time monitoring to detect when reproducibility breaks down. These strategies involve creating “prompt repositories” that store standardized, tested prompts alongside traditional code repositories, with version control tracking changes to both prompts and outputs over time. Companies are also implementing deterministic AI workflows by setting specific random seeds and documenting exact model versions, API parameters, and environmental conditions used during code generation. Internal AI development platforms now automatically log all interactions and maintain audit trails that link generated code back to specific prompts and timestamps.

This proactive approach to versioning is complemented by data observability platforms that establish baseline patterns for pipeline outputs and immediately alert teams when results deviate from expected norms, even when the underlying code hasn’t changed. This combination helps identify when model updates or environmental changes affect previously reliable AI-generated transformations, allowing teams to quickly isolate and resolve reproducibility issues before they impact business operations.

Data observability will make generative AI better – and vice versa

Data observability gives teams critical insights into the health of their data at each stage in the pipeline, automatically monitoring data and letting you know when systems break. Data observability also surfaces rich context with field-level lineage, logs, correlations and other insights that enables rapid triage, incident resolution, and effective communication with stakeholders impacted by data reliability issues - all critical for both trustworthy analytics and AI products.

Simultaneously, data observability workflows often involve correlating vast amounts of information and creating complex queries and configuration. These workflows lend themselves very well to the power of Generative AI, and we have already identified several dozen opportunities to simplify, streamline and accelerate data observability using LLMs. 

At Monte Carlo, we are hard at work to make these a reality and help our users get their work done faster. Not only do we see data observability critical to the success of generative AI, but we’re dedicated to building the only solution that’s generative AI-first, complete with generative AI-enabled features. 

In fact, we’re already integrating with OpenAI's API to offer users troubleshooting advice when their SQL queries fail to speed up the creation and deployment of data monitoring rules. And in the coming months, we are planning to extend our use of AI to help our users increase their efficiency when monitoring their data environments and resolving data incidents.

Data observability is critical to the future success of generative AI and Monte Carlo is charting the path forward. Will you join us? 

Learn how Monte Carlo and data observability are making generative AI reliable and successful at scale. Request a demo today.