Accelerating Zero-Trust Network Incident Response

About Zscaler
Zscaler operates the world’s largest zero-trust security cloud, providing secure access between users and applications from anywhere. The platform protects more than 50 million users across over 9,400 customers, including 30 percent of the Forbes Global 2000. Every day, Zscaler inspects and secures global internet traffic, preventing more than 9 billion security incidents such as malware downloads, phishing attempts, and data exfiltration.
To deliver this level of protection, Zscaler runs a globally distributed production environment. The platform processes more than 500 billion transactions per day across hundreds of thousands of systems across more than 160 global data centers. The infrastructure is highly bespoke, performance-critical, and spans a hybrid mix of bare metal and cloud environments.
Because Zscaler sits directly on the critical path of user access and security, reliability is fundamental. Production issues, elevated latency, or outages can immediately disrupt access to applications, impact employee productivity, and erode customer trust.
The problem: When secure access breaks, everything stops
Zscaler’s scale and architectural complexity generate a constant stream of operational signal. On average, the organization sees more than 150,000 alerts per month, with roughly 120 escalating into full incidents that require complex live coordination.
When incidents occur, impact is rarely isolated to a single service or team. Issues often ripple across networking, infrastructure, application, and security layers, pulling 20 to 30 engineers into live incident bridges. Engineers must manually stitch together context from logs, dashboards, metrics, change events, CI/CD systems, and bespoke internal tools, often under significant time pressure.
Even mid-severity incidents can take more than an hour to resolve. During that time, users may experience degraded access or elevated latency, support teams field customer reports, and senior engineers are pulled away from building and improving the platform. The longer resolution takes, the greater the downstream impact across customers and internal teams.
Zscaler’s SRE leadership recognized that more dashboards or raw data would not materially reduce this impact. What they needed were faster, higher-confidence answers about why issues were happening, how broadly they affected users, and where to focus engineering effort to restore reliable access quickly.
The Solution: Autonomous incident investigations for reliability at global scale
Zscaler deployed Resolve AI as its AI for production, to autonomously investigate alerts and support engineers during live incidents.
Resolve AI connects directly to Zscaler’s existing production systems, including logs, metrics, dashboards, alerts, change events, collaboration tools, and internal knowledge sources. It builds a living model of how services, infrastructure, and dependencies interact, allowing it to reason about incidents the way an experienced team of production engineers would.
When alerts fire, Resolve AI begins investigating immediately. It determines which dashboards and metrics matter, crafts and refines log queries, correlates signals across systems, and evaluates multiple root-cause hypotheses in parallel. Rather than producing walls of telemetry, Resolve AI surfaces clear working theories with production evidence that enable teams to act quickly and with confidence.
During Zscaler’s evaluation, Resolve AI autonomously investigated a DNS resolution issue and identified the underlying cause more than two hours before a human incident bridge was created. By the time the incident escalated, engineers already had actionable context, reducing time spent assembling teams and narrowing the window of user impact.
The Impact: Faster recovery, fewer escalations, more reliable access
With Resolve AI in place, Zscaler has significantly improved how it responds to production issues that directly affect their customer access and security.
Zscaler benefits with Resolve AI:
- 75 percent reduction in incident investigation time, shortening the duration of degraded access
- More than 30 percent fewer engineers involved per incident, reducing unnecessary escalations and interruptions
- Earlier diagnosis of issues, in some cases hours before incident bridges begin
- Thousands of engineering hours saved annually, allowing teams to focus on building and operating the platform
Resolve AI now autonomously investigates alerts as they occur and can be invoked on demand during other active incidents. This enables incident commanders to move faster with fewer people, restoring reliable access sooner and limiting downstream customer and support impact.
Looking ahead, Zscaler plans to expand the role of Resolve AI across additional production workflows, including capacity planning and supervised remediation. The long-term goal is to reduce mean time to resolution for many incidents to 15 minutes, ensuring that secure access remains reliable even as traffic, complexity, and customer expectations continue to grow.

Want to see why leading companies trust Resolve AI?
Learn how engineering teams are transforming software engineering with agentic AI.