Get back to driving innovation and delivering customer value.
©Resolve.ai - All rights reserved
It's 2:37 AM. Your phone vibrates with that familiar hum – on call incident. A 30% spike in latency.
What happens next isn't just about troubleshooting; it’s a spotlight into how we fundamentally work with our engineering systems. You begin your detective work by opening your array of observability tools. Each provides a clue into a different aspect of your complex environment: error signatures in logs, hidden trails in tracing UIs, or suspicions on recent deployments. Each tool offers isolated breadcrumbs that often lead nowhere. You’re trying to piece together a narrative to form a coherent theory of “what might have happened?”
This ritual, familiar to most engineers, isn't only about the inherent stress of on-call. It should be a critical realization: we’re drowning in data, but starving for insight. The problem isn’t just limited to finding the right information. But we understand very little about our production systems. Why?
We are working with systems that exceed human cognition: Perhaps most fundamentally, our distributed systems exceed human cognitive capacity. With hundreds or thousands of interconnected services, each generating its own telemetry and exhibiting complex dependencies, no individual can maintain a complete mental model of the system.
We cannot prepare for unknown unknowns: The most challenging incidents arise from conditions never previously encountered – the "unknown unknowns". Investigations in such scenarios require generating novel hypotheses about potential failure modes, testing these hypotheses against available evidence, and iteratively refining understanding as new information emerges
We don’t have access to engineering intuition in our tools: Engineering knowledge exists in two forms: explicit and tacit. Our tools excel at storing explicit knowledge – configurations, architectures, and documented procedures. But they struggle with tacit knowledge: the contextual awareness that seasoned engineers develop over years. Seasoned engineers develop a powerful intuition, an ability to “just know” how to correlate across seemingly unrelated data sets, systems, and timeframes. This "tribal knowledge" isn't magic; it's highly refined pattern recognition. Replicating this intuition computationally is a profound technical challenge that remains largely unsolved.
We have powerful tools that collect vast amounts of telemetry – metrics, logs, traces. We visualize this data on intricate dashboards, set up sophisticated alerts, and even employ AI to help us query it. But we’ve built this entire stack on a flawed premise, assuming the core problem is “information retrieval”. Here’s where the problem lies:
We’ve confused information access with understanding: Our current tools excel at answering: "What is happening?" – often providing this information through siloed tools that offer a fragmented picture. They struggle with: "Why is it happening?". The fragmented story coupled with the sheer volume of data, makes it a struggle to classify noise and signal. Inevitably, the more data and views we add, the harder it can become to synthesize a coherent understanding, especially under pressure. Today, this cognitive burden of assembling the puzzle remains squarely on the engineer.
We use passive tools for solving active problems: More importantly, the tools we provide to help our engineers are passive. For example, our observability stack responds only to human queries rather than actively participating in the investigation. This creates an asymmetry: our systems can fail any time and in complex ways, but we're limiting our investigations through manual tools and sequential processes. The burden rests heavily on engineers to navigate to the right data and formulate the precise questions: almost expecting them to intuit the answer with only a single clue in hand.
Although the natural inclination is to turn to AI for help; how you use AI makes all the difference. We're currently seeing two main approaches emerge:
Passive AI (e.g., Chat interfaces): You can "ask" the chat interface questions like "What's the error rate for service X?" It feels intuitive and lowers the barrier for simple data retrieval.
What is the limitation?: Chat requires the engineer to know what questions to ask – precisely what's most difficult during novel system failures. It provides a better experience, but it is still information retrieval at its crux Chat interfaces also struggle with episodic memory. They process each interaction independently, lacking a persistent mental model of the system that evolves over time. Chat interfaces are fundamentally passive. They respond to human queries but don't autonomously identify issues or generate hypotheses.
Agentic AI: Agentic AI is designed to be an active participant in the understanding process. Instead of just responding to queries, agentic AI aims to directly address the cognitive bottleneck.
The real measure of observability isn't how much data we can collect or how quickly we can query it. It's how effectively we turn that data into actionable understanding. As systems grow increasingly complex, the gap between data collection and meaningful comprehension widens. Closing this gap requires more than better interfaces to the same old paradigm. It demands tools that fundamentally share the cognitive burden of system understanding. Agentic AI offers a promising path forward, acting as your pair operator by automating investigation and constructing understanding from complex data streams.
Resolve AI is the agentic AI company for software engineering founded by the co-creators of OpenTelemetry. By combining our deep expertise in building developer tools and observability with state-of-the-art agentic AI, our mission is to increase engineering velocity by transforming the way engineers build, deploy, and maintain real-world software systems.
Resolve AI autonomously troubleshoots and resolves production issues, freeing up engineers to focus on building. Our agentic AI understands your production environments, reasons like your seasoned engineers, and learns from every interaction to give your engineering teams decisive control over on-call incidents with autonomous investigations and clear resolution guidance.
With Resolve AI, customers like Datastax, Tubi, and Rappi, have increased engineering velocity and systems reliability by putting machines on-call for humans and letting engineers just code. Interested in learning more about our Agentic AI approach to production systems? Say hello.
Spiros Xanthos
Founder and CEO
Spiros is the Founder and CEO of Resolve AI. He loves learning from customers and building. He helped create OpenTelemetry and started Log Insight (acquired by VMware) and Omnition (acquired by Splunk), most recently he was an SVP and the GM of the Observability business at Splunk.
Varun Krovvidi
Product Marketing Manager
Varun is a product marketer at Resolve AI. As an engineer turned marketer, he is passionate about making complex technology accessible by blending his technical fluency and storytelling. Most recently, he was at Google, bringing the story of multi-agent systems and products like Agent2Agent protocol to market
Resolve AI, powered by advanced Agentic AI, has transformed how Blueground manages production engineering and software operations, seamlessly handling alerts, supporting root cause analysis, and alleviating the stress of on-call shifts.
Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.
Agentic AI revolutionizes incident management through autonomous, collaborative AI agents that eliminate alert fatigue, maintain dynamic knowledge, conduct consistent investigations, enhance team collaboration, and enable proactive issue resolution—as demonstrated by Resolve AI's platform.