Technology

How can we use Agentic AI to solve the hard problems in software engineering?

07/25/2025

6 min read

It's 2:37 AM. Your phone vibrates with that familiar hum – on call incident. A 30% spike in latency.

What happens next isn't just about troubleshooting; it’s a spotlight into how we fundamentally work with our engineering systems. You begin your detective work by opening your array of observability tools. Each provides a clue into a different aspect of your complex environment: error signatures in logs, hidden trails in tracing UIs, or suspicions on recent deployments. Each tool offers isolated breadcrumbs that often lead nowhere. You’re trying to piece together a narrative to form a coherent theory of “what might have happened?”

This ritual, familiar to most engineers, isn't only about the inherent stress of on-call. It should be a critical realization: we’re drowning in data, but starving for insight. The problem isn’t just limited to finding the right information. But we understand very little about our production systems. Why?

Why system understanding is a hard problem to solve

We are working with systems that exceed human cognition: Perhaps most fundamentally, our distributed systems exceed human cognitive capacity. With hundreds or thousands of interconnected services, each generating its own telemetry and exhibiting complex dependencies, no individual can maintain a complete mental model of the system.

We cannot prepare for unknown unknowns: The most challenging incidents arise from conditions never previously encountered – the "unknown unknowns". Investigations in such scenarios require generating novel hypotheses about potential failure modes, testing these hypotheses against available evidence, and iteratively refining understanding as new information emerges

We don’t have access to engineering intuition in our tools: Engineering knowledge exists in two forms: explicit and tacit. Our tools excel at storing explicit knowledge – configurations, architectures, and documented procedures. But they struggle with tacit knowledge: the contextual awareness that seasoned engineers develop over years. Seasoned engineers develop a powerful intuition, an ability to “just know” how to correlate across seemingly unrelated data sets, systems, and timeframes. This "tribal knowledge" isn't magic; it's highly refined pattern recognition. Replicating this intuition computationally is a profound technical challenge that remains largely unsolved.

We focus more on clues rather than solving the puzzle

We have powerful tools that collect vast amounts of telemetry – metrics, logs, traces. We visualize this data on intricate dashboards, set up sophisticated alerts, and even employ AI to help us query it. But we’ve built this entire stack on a flawed premise, assuming the core problem is “information retrieval”. Here’s where the problem lies:

We’ve confused information access with understanding: Our current tools excel at answering: "What is happening?" – often providing this information through siloed tools that offer a fragmented picture. They struggle with: "Why is it happening?". The fragmented story coupled with the sheer volume of data, makes it a struggle to classify noise and signal. Inevitably, the more data and views we add, the harder it can become to synthesize a coherent understanding, especially under pressure. Today, this cognitive burden of assembling the puzzle remains squarely on the engineer.

We use passive tools for solving active problems: More importantly, the tools we provide to help our engineers are passive. For example, our observability stack responds only to human queries rather than actively participating in the investigation. This creates an asymmetry: our systems can fail any time and in complex ways, but we're limiting our investigations through manual tools and sequential processes. The burden rests heavily on engineers to navigate to the right data and formulate the precise questions: almost expecting them to intuit the answer with only a single clue in hand.

How can AI help us bridge the gap?

Although the natural inclination is to turn to AI for help; how you use AI makes all the difference. We're currently seeing two main approaches emerge:

Passive AI (e.g., Chat interfaces): You can "ask" the chat interface questions like "What's the error rate for service X?" It feels intuitive and lowers the barrier for simple data retrieval.

What is the limitation?: Chat requires the engineer to know what questions to ask – precisely what's most difficult during novel system failures. It provides a better experience, but it is still information retrieval at its crux Chat interfaces also struggle with episodic memory. They process each interaction independently, lacking a persistent mental model of the system that evolves over time. Chat interfaces are fundamentally passive. They respond to human queries but don't autonomously identify issues or generate hypotheses.

Agentic AI: Agentic AI is designed to be an active participant in the understanding process. Instead of just responding to queries, agentic AI aims to directly address the cognitive bottleneck.

How does it work?

Proactive investigates the issue: You need not “ask” an AI agent to investigate a certain way. It picks up an issue from a signal and autonomously pursues potential causes, asking its own questions of the system data.
Builds a deep understanding of your systems: When paired with approaches like a knowledge graph, Agentic AI can construct a comprehensive model of your systems and their behaviors across time periods
Reasons like expert engineers, at machine scale: Most critically, agents perform the cognitive work of connecting disparate evidence into coherent causal narratives. They don't just find data; they construct meaning.
Improves with every interaction: Each investigation, feedback, or a change is a signal for an AI agent to improve its understanding of your systems and context. This means its diagnostic capabilities sharpen over time, making its solutions increasingly valuable with every interaction.

What’s the real measure of success?

The real measure of observability isn't how much data we can collect or how quickly we can query it. It's how effectively we turn that data into actionable understanding. As systems grow increasingly complex, the gap between data collection and meaningful comprehension widens. Closing this gap requires more than better interfaces to the same old paradigm. It demands tools that fundamentally share the cognitive burden of system understanding. Agentic AI offers a promising path forward, acting as your pair operator by automating investigation and constructing understanding from complex data streams.

About Resolve AI

Resolve AI is the agentic AI company for software engineering founded by the co-creators of OpenTelemetry. By combining our deep expertise in building developer tools and observability with state-of-the-art agentic AI, our mission is to increase engineering velocity by transforming the way engineers build, deploy, and maintain real-world software systems.

Resolve AI autonomously troubleshoots and resolves production issues, freeing up engineers to focus on building. Our agentic AI understands your production environments, reasons like your seasoned engineers, and learns from every interaction to give your engineering teams decisive control over on-call incidents with autonomous investigations and clear resolution guidance.

With Resolve AI, customers like Datastax, Tubi, and Rappi, have increased engineering velocity and systems reliability by putting machines on-call for humans and letting engineers just code. Interested in learning more about our Agentic AI approach to production systems? Say hello.

Spiros Xanthos

Founder and CEO

Spiros is the Founder and CEO of Resolve AI. He loves learning from customers and building. He helped create OpenTelemetry and started Log Insight (acquired by VMware) and Omnition (acquired by Splunk), most recently he was an SVP and the GM of the Observability business at Splunk.

Varun Krovvidi

Product Marketing Manager

Varun is a product marketer at Resolve AI. As an engineer turned marketer, he is passionate about making complex technology accessible by blending his technical fluency and storytelling. Most recently, he was at Google, bringing the story of multi-agent systems and products like Agent2Agent protocol to market

Content

Why system understanding is a hard problem to solve
We focus more on clues rather than solving the puzzle
How can AI help us bridge the gap?
How does it work?
What’s the real measure of success?
About Resolve AI

Spiros Xanthos

Founder and CEO

Varun Krovvidi

Product Marketing Manager

Product

The role of logs in making debugging conversational

AI generates code in seconds, but debugging production takes hours. Learn how conversational AI debugging can match the speed of modern code generation. And what role do logs play in it?

Technology

AI SRE: The Next Critical Application of AI in Software Engineering

Software engineering has embraced code generation, but the real bottleneck is production. Downtime, degradations, and war rooms drain velocity and cost millions. This blog explains why an AI SRE is the critical next step, how it flips the script on reliability, and why it must be part of your AI strategy now.