Co-work with agents to resolve incidents.
Agent Teams investigate incidents in parallel. Engineers steer and remediate through Workbench.
Specialized agents for harder investigations.
A team of domain-specialized agents investigates in parallel and verifies findings against production evidence, the way a war room of senior engineers would.
pgdb-orders-instance-3 — migrations 0335 and 0337 were silently skipped weeks ago.event_outcome column) was merged 2026-04-24, but Drizzle's migrator skipped it due to timestamp ordering (PR #27504).column "event_outcome" does not exist and relation "order_doc_state" does not exist errors across 5 services.Work alongside your agents in Workbench.
Interrogate every finding, evidence, or theory by just interacting with the report.
What happened
Alert fired at Wed May 6 1:19am for PostgreSQL High Rollback Rate on pgdb-orders-instance-3 (database: orders, cluster: orders-db-cluster). Rollback ratio was ~2.2% at alert time, with peaks up to 4.3%. Alert cleared at Wed May 6 1:29am (~10 minutes after firing).
Chronic schema drift on orders-db-cluster — missing event_outcome column and order_doc_state table. The order_doc_state escalation was the acute trigger that pushed rollback ratio above threshold.
Errors stopped at ~Wed May 6 7:33am, indicating remediation (migrations) was applied.
Root cause
Missing database migrations on orders-db-cluster
Drizzle migrations 0335_event_outcome_column.sql and 0337_quick_argent.sql were never applied to the orders-db-cluster orders database. Application code deployed 8–13 days earlier references schema objects that don't exist, causing every transaction touching them to roll back.
Causal chain:
Investigate with your team, agents, or on your own.
Pursue parallel hypotheses, redirect agents, and add context as new evidence emerges.
Main Investigation
Follow the agent's reasoning and steer it directly
Search for 5xx HTTP errors during the deadlock window
Verify PostgreSQL deadlock issue resolution status
What triggered the scaling event on October 23?
What happened
PostgreSQL deadlock alert fired on Thu October 23, 2025 4:55pm for the orders database on pgdb-orders-instance-2. The order-events-ingest service's OrderReconciler experienced 25 deadlocks over 14 minutes when attempting to batch update order events with resolution timestamps.
Impact
- 100% error rate on
updateOrderEventAPI for 2 minutes - 456-second latency spike on
getOrderEventCountsAPI - 68% error rate on
spanRequestAPI - 12+ downstream services affected including investigation system and analytics APIs
- Self-resolved when load normalized; no data loss
Root cause
Confirmed (HIGH confidence): Lack of deterministic lock ordering in batchUpdateOrderEvents() method. The SQL VALUES clause processes order IDs in arbitrary iteration order, allowing concurrent requests to acquire row locks in different sequences. A scaling event deployed ~27 new pods (vs baseline 1–2), dramatically increasing concurrency and triggering circular lock dependencies.
Gets you to verified root cause.
Every finding is backed by production evidence for you to verify or explore further.
Missing schema migrations on orders-db causing transaction rollbacks
Two schema migrations — adding the event_outcome column and the order_doc_state table — were merged but never applied to orders-db. The migration runner is invoked manually rather than through the deploy pipeline, so the cluster was silently overlooked. When a morning deployment triggered a traffic surge, transactions hit the missing schema and the rollback ratio crossed the 2% alert threshold. Applying both migrations cleared the errors.
event_outcome column never applied to orders-dborder_doc_state table never applied to orders-dborders-db after dependent code shippedorder-fulfillment, checkout-router, inventory-sync, catalog-service, payments-apiRemediate from the same surface.
Trigger commit reverts, GitHub Actions, and alert silencing without leaving the context or interface.
Revert: disable checkout-v2-routing in production
Revert recent enablement of enableCheckoutV2Routing, identified as the trigger for elevated p95 latency on checkout-router. Restores the previous routing path.
This change:
- Sets
enableCheckoutV2Routing: falseinhelm/values/production/values.yaml
Used and loved by engineers
Removing the toil of investigations, war rooms, and on-call.
We pull fewer engineers into war rooms, on-call is materially better, and that translates directly to advertiser trust and revenue protection.
Shahrooz Ansari
Sr. Director of Engineering, DoorDash
I don't need more numbers or more data. What I need is a root cause.
Chris Umbel
AIOps Lead & SRE, Zscaler
Resolve AI proved it could deliver real results in a constrained environment. It identified dependencies, surfaced accurate root causes 73% faster than our teams, all while integrating cleanly into our existing stack.
Angelo Marletta
Staff Software Engineer, Coinbase
Resolve AI makes our junior on-call engineers as effective as our seniors, flattening the experience curve. We've seen a 2x productivity lift while eliminating the runbook gap.
A.D.
Sr. Director of Engineering, Financial Services Company
We pull fewer engineers into war rooms, on-call is materially better, and that translates directly to advertiser trust and revenue protection.
Shahrooz Ansari
Sr. Director of Engineering, DoorDash
I don't need more numbers or more data. What I need is a root cause.
Chris Umbel
AIOps Lead & SRE, Zscaler
Resolve AI proved it could deliver real results in a constrained environment. It identified dependencies, surfaced accurate root causes 73% faster than our teams, all while integrating cleanly into our existing stack.
Angelo Marletta
Staff Software Engineer, Coinbase
Resolve AI makes our junior on-call engineers as effective as our seniors, flattening the experience curve. We've seen a 2x productivity lift while eliminating the runbook gap.
A.D.
Sr. Director of Engineering, Financial Services Company
Shipping every week.
- May 2026
Agent Teams
Specialized agents investigating in parallel with verified findings.
- May 2026
Workbench
Shared workspace with real-time visibility and engineer steering.
- May 2026
Closed-loop actions
Commit reverts, GitHub Actions, and alert silencing from investigation findings.
- April 2026
Adaptive knowledge
Every investigation makes the platform smarter.