Taming the complexity of Production Engineering
Production systems are dynamic and complex. Addressing common production engineering concerns like incident troubleshooting, cloud operations, security, compliance and cost involves painfully piecing together information from many teams (service on-call rotations, Platform, SRE, etc) and multiple (routinely 10+) different tools (observability, CI/CD, infrastructure, paging, chat, etc). These tools were not designed to work together, pushing the complexity on humans. In large organizations, it’s almost impossible for a single engineer to maintain full and up-to-date knowledge of rapidly evolving production systems. This challenge is compounded by the fact that non-coding artifacts (e.g. runbooks) tend to quickly become outdated. While AI coding assistants are helping developers ship code faster than ever, the engineering team's capacity to handle the increasing volume and complexity of deployments in production is falling behind.
Resolve AI is tackling this challenge by building an AI Production Engineer with the goal of automating the majority of tasks across incident management, cloud operations, security engineering, compliance, and cost management. As the first step in our ambitious journey, we are automating incident troubleshooting as it is the most direct way to prevent outages and improve reliability while relieving engineers from the most stressful part of their job. Our goal is to automate the resolution of 80%+ of alerts and incidents without human involvement. While our current system is already very effective, more breakthroughs are required to achieve this level of automation. Below is the blueprint for an AI, developed with our customers, that is designed to become as effective as humans at performing Production Engineering tasks independently.
Deep understanding of production system and tools without training
Engineers rely on a multitude of complex tools to diagnose issues and remediate incidents – source code, CI/CD, infra, observability, runbooks, chat, and more. Performing production engineering tasks requires understanding the lineage and connections of all production entities to each other and to all these tools. For an AI to take automated actions, it must be able to integrate with all these tools, decide the best tool for any task, and be able to use them the way humans do (e.g. write queries and read charts), while being able to:
• Adapt to organizational conventions: Each organization—and even different teams—use different sets of tools and have unique conventions for data (e.g., names of metrics, labels on logs), which the AI must comprehend and navigate
• Join dependencies and knowledge from multiple systems: accurately piece together information and dependencies about a service, deployment, etc from multiple tools (e.g. dashboards, incident reports, CI/CD, infrastructure, source code)
• Handle scale, limits, and live data: These tools have large volumes of constantly changing data. Brute force RAG of all the data is a non-starter for cost, latency and quality
Resolve AI automatically maps and keeps up-to-date a complete knowledge graph of any production environment, without needing any upfront training or user input. It builds knowledge of which tools and signals are relevant for any situation. It comes pre-built with models for various tool categories such as metrics, logs, traces, infra, seamlessly connecting with category- and vendor-specific products like Prometheus, Splunk, GCP, AWS, Azure and others. These models automatically and continuously adapt to each customer's environment. Resolve utilizes a dynamic RAG system that selects the optimal models for each task while retrieving only relevant data to minimize overhead on the tools it uses.
Agentic AI that autonomously troubleshoots repeat and novel incidents
Services and underlying infrastructure continually change operational behavior, and runbooks rarely stay up-to-date. Humans are able to overcome this challenge by using their ability to make data-driven decisions from first principles, relying on their knowledge of system composition and expected behaviors. For an AI to be effective at such multi-step and open-ended interactions, it must:
• Deal with novel incidents: Many incidents are novel so pattern matching won’t help. Even repeat incidents usually have enough variation that an AI which cannot generalize won’t be effective and worse will misguide users
• Accurately determine causality: Remove noise from unrelated but temporally local behaviors that always happen in large environments, even within the same entities
• Learn as it encounters new situations: Each system has its own intricacies, and even humans take time to learn operational behavior. AI should be able to learn on-the-job as it collaborates with humans, and effectively generalize so it doesn’t require the same guidance for a sub-task in a different context
• Perform complex actions using tools: An AI should be able to perform complex tasks like load a dashboard, page on-callers, apply scaling or configuration changes in org-specific conventions
Resolve’s Agentic AI consists of several agents that have specialized (and composable) capabilities. Each agent combines deep domain intelligence and ability to use tools like a human. They can reason over dynamic and multi-modal data. A reasoning engine takes a task and orchestrates execution across multiple agents to achieve the desired goal, leveraging state of the art techniques to balance quality, explainability, and latency. All agents leverage learnings from prior interactions and know how to generalize, generating new knowledge over time, and enabling high coverage for novel incidents. The learnings are decomposed to be reusable across different situations.
On-the-fly UI
Generative AI is inherently probabilistic and not always 100% accurate. Without full context, AI models may hallucinate, potentially misleading users. For an AI that takes actions, building user trust is paramount; it must present clear evidence for any decision or action. Simultaneously, presenting vast amounts of data can overwhelm users if presented unfiltered. These challenges require a novel approach to user experience (UX):
• Support claims with evidence: AI must provide transparent evidence for its decisions, along with the knowledge and reasoning behind its conclusions, ensuring users can confidently rely on its actions
• Need for Context-specific presentation: Raw, unprocessed data can overwhelm users, hindering informed decision-making, while static visualizations add complexity and are hard to interpret
• Collaborate with humans in natural language: AI must use an interface that integrates naturally with the tools that users are familiar with (e.g. slack, zoom) and take responsibility to clarify vague natural language so it can learn or seek guidance when needed
Resolve AI doesn’t rely upon static and predefined constructs like Dashboards. It generates an on-the-fly UI for each incident and task. It tries to provide the right visualizations and insights exactly when needed, building components tailored to the situation. Resolve AI explains how it interpreted the task and asks for clarifications where needed. It displays relevant evidence including where it was sourced from, as well as how it interpreted the evidence. It remembers acquired knowledge to avoid repeating questions and uses it to improve its ability to answer new questions. The UI empowers users to collaborate in-line and provide guidance when there are gaps in analysis, feeding this feedback into the AI’s learning to improve future interactions.
Non-negotiable focus on data privacy and security
Production systems process valuable and sensitive data, making privacy, security, and regulatory compliance absolutely critical. We strongly believe that AI must designed with the following principles:
• Data protection and security: clear boundaries between customer datasets to prevent unauthorized access and leakage. Implement industry best practices on secure data handling, Principle of Least Privilege (PoLP), access control, and protection of sensitive data
• High compliance: Depending upon the environment and nature of data handled, it must meet the requirements of SOC2 type II, ISO27001, PCI, GDPR, and possibly FedRAMP
• Strong AI guardrails: Ensure that the system only answers intended queries and is protected from both traditional and AI-specific attack vectors
Resolve AI upholds an uncompromising commitment to data privacy and security. We don’t use customer data to train our models. We get minimally scoped read-only access, only retrieve data from tools to the extent necessary to perform a task, and provide further controls to further limit data that we can access. Resolve only stores metadata about which tools the data lives in, and the actual data always resides within the customer tools. This approach minimizes data exposure and keeps sensitive information secure and confined to its original environment.
Exciting challenges ahead
While we are off to a great start, there are still a lot of hard and interesting problems to solve. We need to keep training highly specialized models to effectively operate an increasing number of tasks and tools, each of which brings their unique challenges and capabilities. Our agentic AI needs to keep improving its reasoning abilities for increasingly complex tasks through better planning, learning, and orchestration. Finally, our customers are facing problems not just in incident troubleshooting, but a variety of production engineering areas that need to be built out upon the same platform.
See Resolve AI in action or request access by booking a demo. If you’re passionate about what we’re building, consider joining our team.
Related Post
Introducing Resolve AI
Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.
AI Production Engineer
Meet Resolve AI’s Production Engineer, an AI-powered tool designed to autonomously handle incident response and root cause analysis, reducing on-call stress and allowing engineers to focus on innovation.