Industry

Efficiency On-Call at Meta

07/25/2025

9 min read

Drawing from Meta's experience with Instagram's infrastructure, this article explores how efficiency on-call engineering has become crucial for modern software operations. As tech companies face increasing pressure to optimize costs while maintaining growth, a new specialized on-call role has emerged: the efficiency engineer. This role focuses on ensuring systems operate within compute and storage budgets while supporting continuous product innovation. The shift toward efficiency-focused operations represents a significant evolution in how technology companies manage their infrastructure and resources. While the specific examples come from Meta's journey, the principles and practices are valuable for organizations of any size managing cloud resources and infrastructure costs.

Why Efficiency On-Call Matters

The tech industry has undergone a fundamental shift from an era of unlimited infrastructure growth to one of careful resource management. This transformation was particularly evident around 2022 when major tech companies faced multiple business headwinds, including increased competition, privacy changes affecting advertising revenue, and rising interest rates. These factors created a situation where revenue growth could no longer keep pace with continuously rising infrastructure costs.

Meta realized its revenue growth could no longer sustain ever-increasing infrastructure costs. This led to a significant cultural shift toward efficiency-first engineering, with Mark Zuckerberg declaring 2023 "the year of efficiency." The challenge intensified as Meta, like many companies, began allocating more computing resources to GenAI training and inference clusters, leaving traditional products with stricter capacity constraints. This scenario is increasingly common across the industry as organizations balance existing product needs with investments in strategic technologies.

Core Responsibilities

Efficiency on-call engineers are tasked with monitoring and optimizing system resource usage (tracked as virtual resources) across various services within each product. Virtual resources can be:

Web services and async jobs: CPU cycles (MIPS)
Databases: queries per second (QPS) and storage utilization (TBs)

Engineers must ensure that no service exceeds its allocated virtual resource quota, typically setting critical thresholds at 95% (SEV4) and 100% (SEV3) of capacity. A key concept in efficiency operations is the capacity regression: an increase in a service’s resource usage without a corresponding increase in its quota. These regressions must be identified and addressed quickly to prevent service degradation.

Managing infrastructure efficiency in a fast-paced tech environment is a multifaceted challenge that extends far beyond on-call duties. A significant portion of our work involves deep collaboration with teams to forecast their capacity needs, ensuring projects stay within organizational budgets while balancing top-line impact, regression risks, and leadership priorities. We maintain a rigorous process for evaluating ad-hoc projects, carefully weighing their resource demands against existing service quotas and capacity constraints.

Major efficiency initiatives often involve complex design work, extensive refactoring, and rigorous testing cycles. Our rapid experimentation culture, while driving innovation, can lead to frequent capacity regressions that demand thorough investigation and resolution. Unlike many organizations, we maintain strict efficiency SLAs, which we continuously review and refine to meet evolving business needs.

A substantial portion of our effort also goes into developing and maintaining internal tooling, ensuring accurate virtual resource measurements, and supporting capabilities like A/B testing platforms. This infrastructure needs to precisely track and convert resource usage into operational expenditure, requiring ongoing coordination with multiple teams. Additionally, we maintain comprehensive documentation, including detailed on-call runbooks, dataset descriptions, tool guides, and sample queries – essential resources that enable our team to operate efficiently and respond quickly to incidents.

Essential Tools and Infrastructure

Modern efficiency engineering relies on sophisticated monitoring and analysis tools. Resource monitoring dashboards track capacity usage and quota across products and services on a daily granularity, converting virtual resources into power consumption (MW) and operational expenses ($/year) via rate cards.

Sampling-based profiling tools provide deep insights into each service’s performance. Each of these tools logs performance data (CPU cycles, CPU time, I/O time, wall time, memory usage, stack traces, read and write QPS, query latency, etc.) at a function-level granularity. Being able to effectively understand, query, visualize, and communicate this data is the most important skill in being able to identify performance bottlenecks, design projects to fix and optimize it, and root-cause capacity regressions.

Automated regression detection systems like FBDetect can help identify potential issues, though they require careful interpretation to distinguish significant regressions from normal variations.

Collaboration Across Teams

Effective infrastructure management requires exceptional communication and analytical skills across multiple domains. When performance bottlenecks emerge from new experiments, it's crucial to present clear, data-driven explanations that help engineering teams understand why a feature launch must be delayed or an experiment rolled back. These high-stakes decisions demand both compelling quantitative evidence and the ability to build consensus among stakeholders.

Success in this role requires deep technical understanding of various codebases. Discovering optimization opportunities, designing efficiency improvements, and diagnosing capacity regressions all depend on meaningful dialogue with code owners. Through targeted questions, we uncover the reasoning behind implementation choices and assess the safety of proposed changes.

Incident response brings its own communication challenges. When investigating SLA violations or SEVs, we coordinate across teams - collaborating with data scientists to identify relevant metrics, working with engineers to pinpoint when and why bottlenecks emerged, and developing strategies to prevent future issues. During critical incidents, we must diplomatically but firmly manage project delays and feature freezes.

Capacity planning demands both technical depth and strategic thinking. We work closely with teams to forecast infrastructure requirements, considering edge cases and usage spikes. This involves asking probing questions about technical requirements while balancing competing priorities across the organization. These discussions inform our recommendations for project feasibility and resource allocation, ensuring we optimize for both innovation and stability.

Incident Response Process

Instagram’s approach to efficiency incidents offers valuable lessons for the industry. The process begins with distinguishing genuine problems from temporary anomalies by analyzing peak usage patterns over one to two days. This is particularly important for consumer applications like Instagram that experience regular traffic fluctuations.

The investigation process involves examining multiple layers of the system:

Engineers identify which specific service has exceeded its allocation, such as Django applications, async jobs, or databases. Within each service, usage can be broken down by endpoint, job, or database object to pinpoint the primary contributors to the regression.
Once the problematic components are identified, engineers utilize performance profiling tools to generate icicle charts showing CPU time distribution across functions. Delta charts can reveal percentage differences in function performance between timeframes.
Root causes typically fall into several categories: code changes (identified through version control history), new experiment launches, or A/B test deployments. Engineers must examine each possibility systematically, often using specialized tools that track experiment deployments and feature launches.

A Real-World Example: Instagram Notifications Crisis

A compelling example from Instagram illustrates the critical nature of efficiency engineering. The team faced a severe incident when the notifications team exceeded its quota, with async job usage exceeding over 110% of capacity and the whole team exceeding 100% of its quota. This crisis required immediate intervention as any further regression would trigger throttling, potentially causing millions of async job failures and impacting user engagement and revenue. The challenge was complex: there was no single cause behind the regression. Instead, it stemmed from dozens of A/B tests launched over weeks, each contributing small capacity costs that accumulated into a significant problem. The resolution required a three-pronged approach:

Immediate mitigation through pausing non-critical A/B tests and feature launches
Many technical optimizations including function memoization, query batching, and logging deprecation
Implementation of infrastructure DEFCON practices for graceful degradation

This incident led to lasting improvements in Instagram's efficiency practices, including better bottleneck detection and scaling protocols.

Impact on Engineering Culture

The introduction of efficiency on-call has catalyzed significant cultural changes in engineering organizations. Teams now must balance aggressive testing and feature development with strict resource constraints. This has led to the development of more rigorous testing processes that include capacity impact assessments before deployment.

Engineers have become more conscious of code performance implications, leading to improved engineering practices. The focus on efficiency often results in simpler, more maintainable code bases with reduced complexity. This simplified code architecture typically leads to more reliable systems and reduced operational overhead.

Best Practices

Organizations implementing efficiency on-call should maintain comprehensive runbooks that document common problems and their solutions. These runbooks should include detailed procedures for investigating capacity regressions, along with sample queries and debugging approaches for different services.

Teams should implement clear service level objectives (SLOs) for resource usage and establish automated alerting systems. It's crucial to develop robust degradation mechanisms that can gracefully reduce service load when necessary, such as reducing async job traffic for less critical operations.

Documentation plays a vital role in efficiency operations. Beyond runbooks, teams should maintain detailed records of past incidents, solutions, and optimization strategies. This knowledge base accelerates future incident resolution and helps prevent recurring issues.

Future of Efficiency Engineering

Unlike traditional automation, which operates based on fixed rules or predefined workflows, agentic AI systems are capable of adapting to new situations, reasoning through complex problems, and learning from their environments. Here’s how it can automate the tedious parts of efficiency on-call -

Enhanced Anomaly Detection

Agentic AI can connect the dots across diverse data sources—such as infrastructure metrics, logs, and usage patterns—to identify meaningful deviations. This approach reduces noise and surfaces critical information, helping engineers focus on what truly matters.

Smarter Root Cause Analysis

AI can streamline this process by analyzing dependencies, telemetry, and recent changes to propose a ranked list of likely causes. By reasoning about system behavior, these tools can address issues without requiring exhaustive manual effort.

Adaptive Interfaces for Better Decision-Making

Adaptive interfaces powered by AI can adjust to the context of the task, showing only the most relevant data and insights. This makes it easier for engineers to focus on solving problems without being overwhelmed by unnecessary information.

Conclusion

While Meta's experience with Instagram provides valuable insights into efficiency engineering at scale, the principles apply broadly. Whether managing cloud instances or global infrastructure, efficiency on-call engineering drives both technical excellence and business sustainability. The role combines technical expertise with business acumen, requiring engineers to balance system performance with operational costs.

The impact of efficiency engineering extends beyond immediate cost savings. By promoting better engineering practices and simpler system architectures, it contributes to overall system reliability and maintainability. As companies continue to scale while optimizing costs, the importance of efficiency engineering will only grow, making it an increasingly critical specialization in the field of software operations.

Nishanth Salinamakki

Software Engineer at Meta

Nishanth is a software engineer on Instagram who has worked at Meta for the past 5 years. He focuses on solving the most important problems in product infrastructure, specifically on its capacity usage and efficiency.

Customer

How Blueground is Transforming Software Operations with Resolve AI

Resolve AI, powered by advanced Agentic AI, has transformed how Blueground manages production engineering and software operations, seamlessly handling alerts, supporting root cause analysis, and alleviating the stress of on-call shifts.

Technology

How can we use Agentic AI to solve the hard problems in software engineering?

This blog post explores how Agentic AI can transform software engineering by addressing the deep cognitive challenges engineers face during on-call incidents and daily development. It argues that today's observability tools overwhelm engineers with fragmented data but fail to provide real system understanding. By combining AI agents with dynamic knowledge graphs, Resolve AI aims to replicate engineering intuition at machine scale—enabling proactive, autonomous investigation, and delivering the kind of contextual awareness usually reserved for the most seasoned engineers.

Company

An engineer’s journey from the frontlines of o11y to the forefront with Agentic AI

This piece highlights Mehul's transition from larger companies like Splunk and Astronomer to Resolve AI, an early-stage startup focused on agentic AI for software engineering. Drawn by Resolve AI's mission to tackle the complexity of distributed systems, Mehul shares how his initial expectations were exceeded—from seamless developer tooling to a culture that blends speed with thoughtful scalability. He emphasizes the empowering environment that fosters quick onboarding, impact, and collaboration, underscoring Resolve AI's commitment to observability, team learning, and autonomy.

Resolve.ai

Social

Handoff your headaches to Resolve AI

Join our community