The role of logs in making debugging conversational

AI is helping us generate code much faster, but we aren’t shipping code as fast as we write it. The missing gap is the context of production. How we understand production systems is still stuck in polyglot queries and manually constructing narratives. Consider this contrast:

Code generation in 2025	Debugging production in 2025
"Create a payment service that handles retries and timeouts" → AI generates implementation with error handling, using context of your code base	Improve latency on payment service → Check Datadog for metrics → Switch to Loki for logs → Cross-reference deployment history → Correlate timestamps → Build a mental model of payment sequence... And so on

At Resolve AI, we are making debugging conversational. What we like to call "Vibe Debugging", collapsing the entire loop of hypothesis → evidence → validation into a conversation:

Look at this flow as an example:

Why are logs a valuable evidence source?

In conversational debugging, logs represent the highest value because logs contain the ground truth. Metrics tell you the what (latency increased), traces show where (bottleneck in service X), but logs explain why (connection pool exhausted). They're how engineers leave debugging breadcrumbs and where the actual failure reasons live.

The manual debugging process is a complex, iterative cycle. Engineers form hypotheses ("Maybe it's a database issue?"), switch between different tools (jumping from database metrics to application logs), craft platform-specific queries (learning Grafana's syntax, then LogQL for your log platform), gather evidence by staring at charts and searching for anomalies, manually correlate timestamps across disparate systems, and synthesize information to build a mental model of what happened. When the hypothesis proves wrong, the entire cycle begins again.

Why is navigating logs a hard technical problem to solve?

Building AI agents for log investigation isn't just "ChatGPT for logs". Logs are fundamentally unstructured. Unlike metrics (time series) or traces (structured events), logs are free-form text with infinite variety. Every service logs differently. This creates a paradox: the most valuable debugging information is trapped in the least structured format.

Making logs conversational requires solving problems that traditional log analysis tools sidestep entirely:

Translating queries into different languages, at scale
Every log investigation starts with a human question like "Why did checkout break?" But answering requires translating intent across different platforms and query languages. Consider the translation challenge: to search for error logs, you need to become a translator:

Datadog: service:payment-service status:error @timestamp:[now-1h TO now]
Loki: {app="payment"} |= "error" | json | __error__=""
Splunk: index=production source=payment earliest=-1h@h | search level=ERROR

You're fighting semantic differences across platforms. Is it service.name or app or component? Does "error" mean log level, HTTP status, or exception presence? Each platform has evolved its own semantic model, creating substantial translation challenges between natural language questions and executable queries.

To understand the magnitude of this challenge, consider that even Text2SQL¹ (converting natural language to structured database queries) remains largely unsolved despite years of research. The best GPT-4² systems today achieve only ~60% accuracy on structured database queries with well-defined schemas. If you are building AI agents that run iterations and error correction on top of such data, these systems can require up to 10 attempts per query while still struggling with complex joins.

If AI can't reliably handle structured database queries, debugging distributed systems with unstructured logs becomes exponentially harder. You're essentially doing Text2SQL, but instead of structured tables, you're working with millions of unstructured text entries, inconsistent formats across services, no predefined schema, and temporal correlations spanning hours or days.

Analyzing causality across different systems
Real incidents demonstrate this complexity perfectly. When you ask "Why did checkout break?", you're correlating error patterns across multiple services and building temporal causality. Consider this debugging scenario:

14:23:15 payment-service: ERROR Connection timeout to auth-db
14:23:15 auth-service: INFO Processing token validation  
14:23:16 payment-service: WARN Retrying connection to auth-db
14:23:17 database-pool: ERROR Max connections reached (100/100)
14:23:18 payment-service: ERROR Transaction failed: unable to validate auth

You immediately see the story: auth service is overwhelming the database, causing payment failures. But extracting that narrative requires:

Temporal correlation across multiple services
Causal reasoning about database connection limits
Domain knowledge about how payment validation works
Pattern recognition that "connection timeout" + "max connections" = resource exhaustion

The hardest part isn't finding errors. It's understanding which error caused the cascade. A memory leak in service A might trigger timeouts in service B, which overwhelms service C. Traditional correlation fails here. You need systems that understand how failures propagate through specific architectures.

Finding the right signal at scale
Production logs generate millions of entries. Feeding everything to an LLM, you risk hallucination and will likely run out of LLM context window. On the other hand, sampling risks missing the one critical error that explains everything. The challenge becomes particularly acute when investigating distributed system failures that span multiple services and time periods.

Log patterns change across different organizations and teams
Every organization and sometimes even teams within the same organization have unique logging patterns. Field names, error formats, service naming conventions vary dramatically. Learning these specific patterns to construct a coherent narrative is challenging. Traditional approaches struggle because they can't adapt to organization-specific patterns and relationships between services.

The path forward

To ship software as quickly as we write it, debugging needs to be as conversational and intuitive as modern code generation.
To make this happen, AI agents should be able to query the right logs across your tools in the stack to look for errors, exceptions, or patterns. In the current state, engineers are forced to translate queries across different tools, pattern match, and correlate the information across sources to validate their hypothesis during debugging. Solving this problem requires breakthrough approaches to natural language understanding, distributed systems reasoning, and human-AI collaboration. These are urgent engineering problems as every minute saved in working with production translates to improved incident response times and faster engineering velocity.

About Resolve AI

At Resolve AI, we're building AI agents that provide an interface to your production systems. Our agents learn how your production systems operate, combining structured metrics, semi-structured traces, and unstructured logs into coherent understanding. To help you debug, they reason through your specific system's patterns, not just general debugging heuristics. When you ask a question, they pursue multiple hypotheses and correlate evidence from the right data sources. Lastly, they understand your toolstack as an engineer would. They can interpret Grafana dashboards, follow your conventions to generate code suggestions, Git commits, or PRs.

With Resolve AI, customers like Datastax, Tubi, and Rappi, have increased engineering velocity and systems reliability by putting machines on-call for humans and letting engineers just code. If building systems that can reason about distributed infrastructure at scale excites you, we'd love to talk. We're looking for engineers who want to tackle these challenging problems and define what the next generation of debugging looks like.

¹ DIN-SQL + GPT-4 achieves 60.0% exact set match accuracy on the Spider benchmark (Yale Spider Leaderboard, 2024). While execution accuracy can reach 85%+, exact match (getting the SQL query precisely right) remains challenging even for state-of-the-art models on structured database queries.

² Agentic Text2SQL systems can require "up to 10 iterations to refine a SQL query" with multi-stage workflows involving schema linking, candidate generation, self-correction, and evaluation (Hexgen-Text2SQL research, 2024). This iterative approach increases computational complexity while still struggling with accuracy on complex queries.

Social

Shaping the future of software engineering

Join our community

The role of logs in making debugging conversational

Why are logs a valuable evidence source?

Why is navigating logs a hard technical problem to solve?

The path forward

About Resolve AI

Related Post

The role of multi agent systems in making software engineers AI-native

AI SRE: The Next Critical Application of AI in Software Engineering

Is Vibe debugging the answer to effortless engineering?