Build or buy? See where eng teams are landing

By Jeff Aronhalt, Principal Backend Engineer, Gametime
Gametime is a last-minute live event ticketing platform where reliability isn't just an engineering concern - a degraded purchase flow during a sold-out game is lost revenue with no recovery window. For most of my seven years here, our on-call model matched that urgency: one backend rotation, 12-14 engineers, one week per shift, with each person paged only for the services they knew well.
As the company scaled, we made a deliberate choice to evolve how engineering teams were structured. We shifted to cross-functional product squads - backend, web, mobile, and data engineers owning a full product domain together - to drive real ownership, reduce handoffs, and give teams accountability for user outcomes rather than just their slice of the stack. It was the right move for how we wanted to build. What it exposed was that our on-call model hadn't kept pace with how we were now organized.
Moving to squads without changing the on-call structure exacerbated the cognitive dissonance to alerts. You're on call once every 14 weeks, but now four of the five domains you're being paged for - you're not even involved in.
We have 60-plus services, multiple databases, Lambda processes, Kubernetes workloads, and a web of external supplier integrations - any of which could be the source of an incident. The tribal knowledge that backend engineers had carried for years was now fragmented across six squads, so when something broke, we'd end up with three engineers in a room comparing notes across three different dashboards before anyone had even formed a hypothesis.
What we needed wasn't better alerting or more dashboards. We needed a way for any engineer on the rotation - web, mobile, or backend - to cross domain boundaries during an investigation without escalating to a subject matter expert at 3am. That's what brought us to Resolve AI.
We evaluated Resolve AI because it was solving a problem nobody else was directly addressing: giving engineers the full operational context of a system they don't own, at the moment they need it, without requiring them to already know where to look. We'd looked at other options, but most tools assumed the person investigating already had the domain knowledge. Resolve AI was designed for the opposite case.
What sealed it was the partnership. Throughout the pilot, the Resolve AI team treated this like a shared problem to solve, not a product to sell - and that approach gave us confidence that the calibration work ahead of us would actually go somewhere.
"Before Resolve AI, an incident felt like being paged into a dark room. Now it's like walking in with the lights already on."
Getting Resolve AI to perform well isn't plug-and-play, and that's worth being direct about. Like onboarding a new engineer to go on-call, Resolve AI needs context before it can reason well: which alerts are noise, which services have known degraded-but-acceptable behavior, which users and transactions actually matter to revenue. That context lives in guidance files - structured operational knowledge that you load into Resolve AI about your specific environment. The more you invest in them, the more accurate the reasoning becomes. The difference is that where a new human engineer might take months to reach the point where you'd trust their judgment on an unfamiliar service, Resolve AI gets there in days.
Our platform team connected Resolve AI to Kubernetes via Satellite, wired in Argo for event sourcing, Grafana Loki for log ingestion, Slack, and Datadog. We ran it against real production incidents retroactively before the full rollout was finished - Datadog APM only, no logs, no Satellite - and even with that limited surface, it was identifying root causes accurately. Once we built out the guidance files and brought in the full integration layer, it became a materially different AI teammate; one that could orient any engineer on the rotation, regardless of which squad owned the service.
The moment that validated the whole approach happened with our supply team. We had an ongoing issue preventing orders for inventory through one of our ticketing partner integrations - direct revenue impact. A post-incident review had been kicked off, and the working hypothesis, already forming in the communication chain and being socialized, was that this was a partner-side problem. Reasonable enough: failures concentrated in the purchase flow, the integration was the obvious candidate, and our own services looked healthy in isolation.
When I ran the investigation through Resolve AI, the answer was different. The root cause was a code change made three days prior in an upstream internal service - not the integration, not the order API directly, but two to three degrees of separation removed from where the symptom was visible. Our partner API service was showing high-latency failures, gateway timeouts, and internal server errors, all of which correlated with the order failures, and that's exactly where human intuition anchors. Resolve AI traced causality upstream through the dependency graph and surfaced the code change that was actually responsible.
It was fascinating to see that - and it's something that's genuinely hard for a team of humans to do, especially without full context across the dependency graph. Even our domain experts didn't catch it. Resolve AI sees and understands the entire production context; that's the specific capability that matters in a distributed system with multi-faceted squad ownership. The post-incident review was about to close on the wrong conclusion, and Resolve AI found the actual root cause before that happened.
Before Resolve AI, an incident felt like being paged into a dark room - you knew something was wrong, but orienting yourself around which service, which layer, and whose problem took most of the investigation time. Now it's like walking in with the lights already on; the context is there, and a hypothesis is forming before I've even opened a second tab. Resolve AI has fundamentally changed how I think about on-call: it's no longer about who has the most context - it's about who can ask the right question, because the context is always there.
The practical version of this: I've woken up in the middle of the night, seen an alert, and asked Resolve AI whether it's actually critical before deciding whether to get out of bed. Before, that question cost 30-45 minutes of investigation regardless of the answer. Now I get the answer in seconds, and if it isn't critical, I go back to sleep. That's not a small thing at 3 am.
The broader shift is harder to describe but more significant. We're now shipping roughly 200 PRs a week, seventy to eighty percent of which are AI-generated, and code volume has tripled since January (2026). In that environment, I've stopped trying to maintain a mental model of the full codebase the way I used to - and that's the right trade. The job is thinking at higher-order levels of abstraction: about capabilities, about composing them for novel features, about what we're actually trying to build for the user. When I need to understand system behavior in production, I now delegate much of it to AI, and Resolve AI is where I turn.
That shift has a cost that isn't obvious until you look for it. When code generates faster than any engineer can absorb system complexity, the operational coordination problem gets worse, not better - more changes, more potential failure modes, more surface area, and less individual intuition to draw on. A significant incident a few weeks back created a measurable trough in our commit velocity: days of engineering time absorbed by root cause analysis and remediation. That trough is what Resolve AI compresses. Faster code generation is not the same thing as better operations, and resolving the gap between those two is what actually matters at scale - that's exactly what Resolve AI does for us.
The bottleneck in shipping right now is not generating code - we solved that. The bottleneck is operational confidence: getting a change to production and knowing fast whether something degraded, work that is still mostly manual and partially cancels the velocity gains from AI coding agents.
The incident that changed how we think about root cause - Resolve AI catching a three-day-old code change that even our domain experts missed - is the baseline for what we already have. That context matters when I think about what Resolve AI's new teams of investigation agents mean for us going forward. What we already had was SOTA; their new approach raises that ceiling again. Instead of a single agent working through a problem sequentially, Resolve AI now runs a coordinated team: agents pursuing independent threads across logs, metrics, traces, and code simultaneously, with another agent that independently re-runs the same queries to reproduce the team's conclusions before standing behind them. What makes this tangible for me is also their new workbench - the surface where I can co-work with those agents in real time, steer hypotheses as new evidence emerges, and remediate from the same context without switching tools or losing the thread. The investigations we get are going to be sharper, deeper, and more defensible - and I'm genuinely excited to see what that means for the problems that have historically been the hardest to crack.
Background agents are what I'm most excited about, because the shift they represent goes beyond faster incident response. The way I think about it: I open Resolve AI and my priority feed is already there - pre-investigated alerts, deployment summaries, operational reports - each one with verified findings and recommended next steps waiting for me, rather than a queue of unknowns I have to orient myself around from scratch. The routine operational work I currently direct and execute myself becomes work I direct and observe, with agents handling the execution on schedule or on trigger under my oversight. I'm still setting the quality bar and making the judgment calls at the boundaries that matter, but I'm doing it as someone managing a capable team at scale rather than doing every task personally. That's a fundamentally different relationship with production than I've ever had - not AI replacing engineering judgment, but AI scaling what one engineer with good judgment can actually hold at once.
I don't see a world where Gametime two years from now doesn't have AI running production alongside us - and honestly, given where things are heading, I'd be surprised if it takes that long.

Join our engineering leads for "Behind the Build", a webinar series deep-dive into how we built agents that run software.

Watch how Resolve AI investigates a service timeout from application logs through Kubernetes pods down to failing memory modules in a UCS blade - building a complete causation chain in 3 minutes. See the stark contrast between traditional multi-team incident response (4 teams, multiple tools, hours of coordination) and AI-native investigation that maps dependencies from app code to storage infrastructure without organizational handoffs. Learn why engineering silos slow incident response and how AI agents can reason across the entire production stack as one connected system.

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Resolve AI, powered by advanced Agentic AI, has transformed how Blueground manages production engineering and software operations, seamlessly handling alerts, supporting root cause analysis, and alleviating the stress of on-call shifts.