Get back to driving innovation and delivering customer value.
©Resolve.ai - All rights reserved
AI is helping us generate code much faster, but we aren’t shipping code as fast as we write it. The missing gap is the context of production. How we understand production systems is still stuck in polyglot queries and manually constructing narratives. Consider this contrast:
Code generation in 2025 | Debugging production in 2025 |
---|---|
"Create a payment service that handles retries and timeouts" -> AI generates implementation with error handling, using context of your code base | Improve latency on payment service -> Check Datadog for metrics -> Switch to Loki for logs -> Cross-reference deployment history -> Correlate timestamps -> Build a mental model of payment sequence... And so on |
At Resolve AI, we are making debugging conversational. What we like to call "Vibe Debugging", collapsing the entire loop of hypothesis -> evidence -> validation into a conversation:
Look at this flow as an example:
At Resolve AI, we are making debugging conversational. Along the way, we also realized that logs represent the most valuable evidence yet the most challenging to navigate.
In conversational debugging, logs represent the highest value because logs contain the ground truth. Metrics tell you the what (latency increased), traces show where (bottleneck in service X), but logs explain why (connection pool exhausted). They're how engineers leave debugging breadcrumbs and where the actual failure reasons live.
The manual debugging process is a complex, iterative cycle. Engineers form hypotheses ("Maybe it's a database issue?"), switch between different tools (jumping from database metrics to application logs), craft platform-specific queries (learning Grafana's syntax, then LogQL for your log platform), gather evidence by staring at charts and searching for anomalies, manually correlate timestamps across disparate systems, and synthesize information to build a mental model of what happened. When the hypothesis proves wrong, the entire cycle begins again.
Building AI agents for log investigation isn't just "ChatGPT for logs." Logs are fundamentally unstructured. Unlike metrics (time series) or traces (structured events), logs are free-form text with infinite variety. Every service logs differently. This creates a paradox: the most valuable debugging information is trapped in the least structured format.
Making logs conversational requires solving problems that traditional log analysis tools sidestep entirely:
Translating queries into different languages, at scale
Every log investigation starts with a human question like "Why did checkout break?" But answering requires translating intent across different platforms and query languages. Consider the translation challenge: to search for error logs, you need to become a translator:
You're fighting semantic differences across platforms. Is it service.name or app or component? Does "error" mean log level, HTTP status, or exception presence? Each platform has evolved its own semantic model, creating substantial translation challenges between natural language questions and executable queries.
To understand the magnitude of this challenge, consider that even Text2SQL¹ (converting natural language to structured database queries) remains largely unsolved despite years of research. The best GPT-4² systems today achieve only ~60% accuracy on structured database queries with well-defined schemas. If you are building AI agents that run iterations and error correction on top of such data, these systems can require up to 10 attempts per query while still struggling with complex joins.
If AI can't reliably handle structured database queries, debugging distributed systems with unstructured logs becomes exponentially harder. You're essentially doing Text2SQL, but instead of structured tables, you're working with millions of unstructured text entries, inconsistent formats across services, no predefined schema, and temporal correlations spanning hours or days.
Analyzing causality across different systems
Real incidents demonstrate this complexity perfectly. When you ask "Why did checkout break?", you're correlating error patterns across multiple services and building temporal causality. Consider this debugging scenario:
14:23:15 payment-service: ERROR Connection timeout to auth-db
14:23:15 auth-service: INFO Processing token validation
14:23:16 payment-service: WARN Retrying connection to auth-db
14:23:17 database-pool: ERROR Max connections reached (100/100)
14:23:18 payment-service: ERROR Transaction failed: unable to validate auth
You immediately see the story: auth service is overwhelming the database, causing payment failures. But extracting that narrative requires:
The hardest part isn't finding errors. It's understanding which error caused the cascade. A memory leak in service A might trigger timeouts in service B, which overwhelms service C. Traditional correlation fails here. You need systems that understand how failures propagate through specific architectures.
Finding the right signal at scale
Production logs generate millions of entries. Feeding everything to an LLM, you risk hallucination and will likely run out of LLM context window. On the other hand, sampling risks missing the one critical error that explains everything. The challenge becomes particularly acute when investigating distributed system failures that span multiple services and time periods.
Log patterns change across different organizations and teams
Every organization and sometimes even teams within the same organization have unique logging patterns. Field names, error formats, service naming conventions vary dramatically. Learning these specific patterns to construct a coherent narrative is challenging. Traditional approaches struggle because they can't adapt to organization-specific patterns and relationships between services.
To ship software as quickly as we write it, debugging needs to be as conversational and intuitive as modern code generation.
To make this happen, AI agents should be able to query the right logs across your tools in the stack to look for errors, exceptions, or patterns. In the current state, engineers are forced to translate queries across different tools, pattern match, and correlate the information across sources to validate their hypothesis during debugging. Solving this problem requires breakthrough approaches to natural language understanding, distributed systems reasoning, and human-AI collaboration. These are urgent engineering problems as every minute saved in working with production translates to improved incident response times and faster engineering velocity.
At Resolve AI, we're building AI agents that provide an interface to your production systems. Our agents learn how your production systems operate, combining structured metrics, semi-structured traces, and unstructured logs into coherent understanding. To help you debug, they reason through your specific system's patterns, not just general debugging heuristics. When you ask a question, they pursue multiple hypotheses and correlate evidence from the right data sources. Lastly, they understand your toolstack as an engineer would. They can interpret Grafana dashboards, follow your conventions to generate code suggestions, Git commits, or PRs.
With Resolve AI, customers like Datastax, Tubi, and Rappi, have increased engineering velocity and systems reliability by putting machines on-call for humans and letting engineers just code. If building systems that can reason about distributed infrastructure at scale excites you, we'd love to talk. We're looking for engineers who want to tackle these challenging problems and define what the next generation of debugging looks like.
¹ DIN-SQL + GPT-4 achieves 60.0% exact set match accuracy on the Spider benchmark (Yale Spider Leaderboard, 2024). While execution accuracy can reach 85%+, exact match (getting the SQL query precisely right) remains challenging even for state-of-the-art models on structured database queries.
² Agentic Text2SQL systems can require "up to 10 iterations to refine a SQL query" with multi-stage workflows involving schema linking, candidate generation, self-correction, and evaluation (Hexgen-Text2SQL research, 2024). This iterative approach increases computational complexity while still struggling with accuracy on complex queries.
Priyatham Bollimpalli
Research Engineer
@ Resolve AI
With 9+ years of ML engineering experience, Priyatham brings incredible expertise from his time as a Senior ML Applied Scientist at Microsoft, where he led domain-specific model quality improvements for Microsoft Copilot Studio, serving over 2.1 million monthly active users. His work spans the full spectrum of AI innovation -- from fine-tuning code generation models and developing systems to optimizing multilingual support across 20+ languages.
Resolve AI, powered by advanced Agentic AI, has transformed how Blueground manages production engineering and software operations, seamlessly handling alerts, supporting root cause analysis, and alleviating the stress of on-call shifts.
This blog post explores how Agentic AI can transform software engineering by addressing the deep cognitive challenges engineers face during on-call incidents and daily development. It argues that today's observability tools overwhelm engineers with fragmented data but fail to provide real system understanding. By combining AI agents with dynamic knowledge graphs, Resolve AI aims to replicate engineering intuition at machine scale—enabling proactive, autonomous investigation, and delivering the kind of contextual awareness usually reserved for the most seasoned engineers.
This piece highlights Mehul's transition from larger companies like Splunk and Astronomer to Resolve AI, an early-stage startup focused on agentic AI for software engineering. Drawn by Resolve AI's mission to tackle the complexity of distributed systems, Mehul shares how his initial expectations were exceeded—from seamless developer tooling to a culture that blends speed with thoughtful scalability. He emphasizes the empowering environment that fosters quick onboarding, impact, and collaboration, underscoring Resolve AI's commitment to observability, team learning, and autonomy.