Technology

How we built Resolve AI

07/25/2025

8 min read

Taming the complexity of Production Engineering

Production systems are dynamic and complex. Addressing common production engineering concerns like incident troubleshooting, operations, compliance and security involves painfully piecing together information from many teams (service on-call rotations, infra, SRE, etc) and multiple (routinely 10+) different tools (observability, CICD, infrastructure, paging, chat, etc). These tools were not designed to work together, pushing the complexity on humans. It’s almost impossible for a single engineer to have full and up-to-date knowledge about these rapidly evolving systems. Non-coding artifacts (e.g. runbooks) tend to quickly become outdated. AI powered coding assistants have accelerated the process of creating new code and functionality but production engineering tools have lagged behind.

Resolve AI is building an AI Production Engineer with the goal of automating the majority of tasks across incident management, operations, coding, security engineering, compliance, and cost. As the first step in our ambitious journey, we are automating incident troubleshooting as the most direct way to prevent outages and improve reliability while relieving engineers from the most stressful part of their job. Our goal is to automate the resolution of 80%+ of alerts and incidents without human involvement.

To achieve this level of automation we need an AI that has
1. In-depth understanding of the entire production environment from code to changes, to infra and telemetry
2. Ability to take action using the tools that are in place
3. Ability to reason about novel tasks and incidents using data and not solely rely on anything hard-coded or potentially stale (e.g. runbooks)
4. Ability to support claims with concrete evidence and data as the burden of proof is high is these mission critical situations
5. Data security and privacy as foundational principles - we are running in production alongside critical customer systems within highly secure and compliant boundaries

Autonomous and complete understanding of production systems

Cloud applications are inherently complex and dynamic, requiring production engineering tasks to grasp the lineage and interconnections of all production entities, as well as the multitude of tools used for their operation and maintenance. An AI that lacks an understanding of this complexity and connectedness cannot effectively:
• Reason about Upstream and Downstream Dependencies: such as on other services, infrastructure components, and managed external or cloud provider services.
• Deduplicate Across Environments: to distinguish between dev and prod or handle multiple clusters and deployments of the same service, as seen in large multi-tenant applications.
• Accurately Determine Causality: And remove noise from unrelated but temporally local behaviors that always happen in large environments, even within the same entities

Resolve AI automatically maps and keeps up-to-date a complete knowledge graph of any environment, without needing any upfront training or user input. Resolve AI understands infrastructure, application, and source code entities, and all changes. It understands which signals and knowledge within tools (e.g. dashboards, alerts, KPIs, etc) are the most relevant for any entity and situation. It builds this understanding by automatically inspecting available content in these tools, learning from its interactions with users, and by using the tools as described next.

Teaching AI to use the tools

On-call engineers rely on a multitude of complex tools to diagnose issues and remediate incidents. For an AI to take automated actions effectively, it must not only integrate with all these tools but also operate them with deep intelligence about each task and knowledge of the specific tool’s capabilities. An AI that lacks adaptability across diverse tools cannot effectively:
• Adapt to Organizational Conventions: Each organization—and even different teams—use different sets of tools and have unique conventions for data (e.g., names of metrics, labels on logs), which the AI must comprehend and navigate.
• Perform Complex Actions: Without understanding tool-specific setup and capabilities, it cannot page on-callers, apply scaling actions, or implement configuration changes.
• Handle the scale and live data: These tools have large volumes of constantly churning data. Trying to brute force RAG all the data from them is a non-starter.

Resolve AI utilizing a series of models that combine traditional and generative AI approaches, purpose-trained for different skills and tools. It automatically learns how each team organizes data within their tools and leverages that knowledge, along with past interactions, to improve its ability to operate and retrieve data from these tools.

Agentic AI that troubleshoots repeat and novel incidents

Cloud software is constantly and rapidly evolving. Services and underlying infrastructure is constantly changing operational behavior, and runbooks rarely stay up-to-date. When runbooks are a dead end, humans are good at making data-driven decisions and being able to go back to first-principles debugging using their knowledge of how the system is composed, how each of these systems are supposed to behave when changes are introduced. For an AI to be effective beyond trivial or repeat incidents, it must be capable of conducting data-driven, multi-step investigations by systematically triaging issues, creating hypotheses based on observed symptoms, and proving or disproving these hypotheses with evidence across any of the available tools, just like humans do. Just like a junior engineer, an AI agent may also not be able to troubleshoot all novel incidents and it should be able to collaborate with experts while still offloading a lot of the grunt work from them.

Resolve AI Agentic Intelligence platform consists of several agents that have specialized (and composable) capabilities needed to accomplish production engineering tasks. Each agent brings together deep domain intelligence and extensive training about how to accomplish tasks with external tools. A planner takes a task and orchestrates execution across all these agents until the desired goal is achieved. The planner leverages state of the art techniques to balance quality, explainability, and latency. All agents are able to leverage learnings from prior interactions and create new learnings going forward, to enable a high coverage over time even for novel incidents. The learnings are decomposed in a way that they can be reused in a variety of different situations, not just specific to the one in which they were acquired. This is a very challenging system to build as it needs a really large and diverse set of agents operating on multi-modal and dynamic data.

On the Fly UI to Build Trust

Generative AI is probabilistic in nature and not always 100% accurate. AI models without the full context hallucinate which can mislead users. For an AI that takes actions, building trust with the users is of paramount importance, thus the burden of proof is high and evidence has to always be presented for any decision or action. At the same time our AI processes vast amounts of data which can overwhelm users if presented unprocessed. All these challenges require a novel approach to user experience (UX).

Traditional UIs, such as dashboard-centered or chat interfaces, are too inflexible and lack the depth needed for complex tasks. This led us to create an on-the-fly UI, specifically designed to adapt to various tasks and incidents, providing the right tools and insights exactly when and where they’re needed. Think of generated dashboards with many different components, built on-the-fly, for the task at hand.

Also, Resolve AI is trained to understand and interpret tasks like a human engineer. It can explain to the user how it interpreted the provided task and if the task is unclear, it can ask for clarifications. It has a memory to store acquired knowledge about any environment so it does not repeat the same question twice.

To enhance our models' reasoning capabilities, Resolve AI's UI empowers humans to collaborate and provide guidance when there are gaps in analysis. This goes back into the AI’s learnings and ensures that the next interaction will be better in similar scenarios or even different scenarios.

Non-negotiable focus on Data Privacy and Security

We never train any model on customer data. Furthermore, we only pull in data from any tool to the extent that is needed to perform a task, the data always resides in the tools that were chosen for the task.

Our Engineering Culture

Products and technology like Resolve AI cannot be built in vacuum. Our team is passionate about being customer-centric and has been collaborating with a group of design partners from day one. Every feature and capability is tied to a user outcome in order to maximize our rate of learning in these unchartered waters. At the same time, we are big believers in the transformative role of AI towards software and product engineering, and we heavily use AI to accelerate all aspects of product development, including heavily dogfooding our own product. The team is built around respect, and trust and all engineers are empowered to bring their opinions and shape our strategy. We value ownership, persistence, and out-of-the-box thinking, and we foster a culture of psychological safety that encourages taking risks and learning from failures.

We want to hear from you

We live and breathe technology. We would love to hear from you:
• Feedback from the community about your experience with similar systems and technologies
• Potential engineers who are looking to grow in their career, work on cutting edge technologies, and drive huge industry impact
• Potential users who are feeling the pain of production engineering and on-call and want to leverage AI to balance the scales

Mayank Agarwal

Founder and CTO

Product

The role of logs in making debugging conversational

AI generates code in seconds, but debugging production takes hours. Learn how conversational AI debugging can match the speed of modern code generation. And what role do logs play in it?

Customer

How Blueground is Transforming Software Operations with Resolve AI

Resolve AI, powered by advanced Agentic AI, has transformed how Blueground manages production engineering and software operations, seamlessly handling alerts, supporting root cause analysis, and alleviating the stress of on-call shifts.

Technology

How can we use Agentic AI to solve the hard problems in software engineering?

This blog post explores how Agentic AI can transform software engineering by addressing the deep cognitive challenges engineers face during on-call incidents and daily development. It argues that today's observability tools overwhelm engineers with fragmented data but fail to provide real system understanding. By combining AI agents with dynamic knowledge graphs, Resolve AI aims to replicate engineering intuition at machine scale—enabling proactive, autonomous investigation, and delivering the kind of contextual awareness usually reserved for the most seasoned engineers.

Resolve.ai

Social

Handoff your headaches to Resolve AI

Join our community