©Resolve.ai - All rights reserved
Get back to building.
©Resolve.ai - All rights reserved
Drawing from Meta's experience with Instagram's infrastructure, this article explores how efficiency on-call engineering has become crucial for modern software operations. As tech companies face increasing pressure to optimize costs while maintaining growth, a new specialized on-call role has emerged: the efficiency engineer. This role focuses on ensuring systems operate within compute and storage budgets while supporting continuous product innovation. The shift toward efficiency-focused operations represents a significant evolution in how technology companies manage their infrastructure and resources. While the specific examples come from Meta's journey, the principles and practices are valuable for organizations of any size managing cloud resources and infrastructure costs.
The tech industry has undergone a fundamental shift from an era of unlimited infrastructure growth to one of careful resource management. This transformation was particularly evident around 2022 when major tech companies faced multiple business headwinds, including increased competition, privacy changes affecting advertising revenue, and rising interest rates. These factors created a situation where revenue growth could no longer keep pace with continuously rising infrastructure costs.
Meta realized its revenue growth could no longer sustain ever-increasing infrastructure costs. This led to a significant cultural shift toward efficiency-first engineering, with Mark Zuckerberg declaring 2023 "the year of efficiency." The challenge intensified as Meta, like many companies, began allocating more computing resources to GenAI training and inference clusters, leaving traditional products with stricter capacity constraints. This scenario is increasingly common across the industry as organizations balance existing product needs with investments in strategic technologies.
Efficiency on-call engineers are tasked with monitoring and optimizing system resource usage (tracked as virtual resources) across various services within each product. Virtual resources can be:
Web services and async jobs: CPU cycles (MIPS)
Databases: queries per second (QPS) and storage utilization (TBs)
Engineers must ensure that no service exceeds its allocated virtual resource quota, typically setting critical thresholds at 95% (SEV4) and 100% (SEV3) of capacity. A key concept in efficiency operations is the capacity regression: an increase in a service’s resource usage without a corresponding increase in its quota. These regressions must be identified and addressed quickly to prevent service degradation.
Managing infrastructure efficiency in a fast-paced tech environment is a multifaceted challenge that extends far beyond on-call duties. A significant portion of our work involves deep collaboration with teams to forecast their capacity needs, ensuring projects stay within organizational budgets while balancing top-line impact, regression risks, and leadership priorities. We maintain a rigorous process for evaluating ad-hoc projects, carefully weighing their resource demands against existing service quotas and capacity constraints.
Major efficiency initiatives often involve complex design work, extensive refactoring, and rigorous testing cycles. Our rapid experimentation culture, while driving innovation, can lead to frequent capacity regressions that demand thorough investigation and resolution. Unlike many organizations, we maintain strict efficiency SLAs, which we continuously review and refine to meet evolving business needs.
A substantial portion of our effort also goes into developing and maintaining internal tooling, ensuring accurate virtual resource measurements, and supporting capabilities like A/B testing platforms. This infrastructure needs to precisely track and convert resource usage into operational expenditure, requiring ongoing coordination with multiple teams. Additionally, we maintain comprehensive documentation, including detailed on-call runbooks, dataset descriptions, tool guides, and sample queries – essential resources that enable our team to operate efficiently and respond quickly to incidents.
Modern efficiency engineering relies on sophisticated monitoring and analysis tools. Resource monitoring dashboards track capacity usage and quota across products and services on a daily granularity, converting virtual resources into power consumption (MW) and operational expenses ($/year) via rate cards.
Sampling-based profiling tools provide deep insights into each service’s performance. Each of these tools logs performance data (CPU cycles, CPU time, I/O time, wall time, memory usage, stack traces, read and write QPS, query latency, etc.) at a function-level granularity. Being able to effectively understand, query, visualize, and communicate this data is the most important skill in being able to identify performance bottlenecks, design projects to fix and optimize it, and root-cause capacity regressions.
Automated regression detection systems like FBDetect can help identify potential issues, though they require careful interpretation to distinguish significant regressions from normal variations.
Effective infrastructure management requires exceptional communication and analytical skills across multiple domains. When performance bottlenecks emerge from new experiments, it's crucial to present clear, data-driven explanations that help engineering teams understand why a feature launch must be delayed or an experiment rolled back. These high-stakes decisions demand both compelling quantitative evidence and the ability to build consensus among stakeholders.
Success in this role requires deep technical understanding of various codebases. Discovering optimization opportunities, designing efficiency improvements, and diagnosing capacity regressions all depend on meaningful dialogue with code owners. Through targeted questions, we uncover the reasoning behind implementation choices and assess the safety of proposed changes.
Incident response brings its own communication challenges. When investigating SLA violations or SEVs, we coordinate across teams - collaborating with data scientists to identify relevant metrics, working with engineers to pinpoint when and why bottlenecks emerged, and developing strategies to prevent future issues. During critical incidents, we must diplomatically but firmly manage project delays and feature freezes.
Capacity planning demands both technical depth and strategic thinking. We work closely with teams to forecast infrastructure requirements, considering edge cases and usage spikes. This involves asking probing questions about technical requirements while balancing competing priorities across the organization. These discussions inform our recommendations for project feasibility and resource allocation, ensuring we optimize for both innovation and stability.
Instagram’s approach to efficiency incidents offers valuable lessons for the industry. The process begins with distinguishing genuine problems from temporary anomalies by analyzing peak usage patterns over one to two days. This is particularly important for consumer applications like Instagram that experience regular traffic fluctuations.
The investigation process involves examining multiple layers of the system:
Engineers identify which specific service has exceeded its allocation, such as Django applications, async jobs, or databases. Within each service, usage can be broken down by endpoint, job, or database object to pinpoint the primary contributors to the regression.
Once the problematic components are identified, engineers utilize performance profiling tools to generate icicle charts showing CPU time distribution across functions. Delta charts can reveal percentage differences in function performance between timeframes.
Root causes typically fall into several categories: code changes (identified through version control history), new experiment launches, or A/B test deployments. Engineers must examine each possibility systematically, often using specialized tools that track experiment deployments and feature launches.
A compelling example from Instagram illustrates the critical nature of efficiency engineering. The team faced a severe incident when the notifications team exceeded its quota, with async job usage exceeding over 110% of capacity and the whole team exceeding 100% of its quota. This crisis required immediate intervention as any further regression would trigger throttling, potentially causing millions of async job failures and impacting user engagement and revenue. The challenge was complex: there was no single cause behind the regression. Instead, it stemmed from dozens of A/B tests launched over weeks, each contributing small capacity costs that accumulated into a significant problem. The resolution required a three-pronged approach:
Immediate mitigation through pausing non-critical A/B tests and feature launches
Many technical optimizations including function memoization, query batching, and logging deprecation
Implementation of infrastructure DEFCON practices for graceful degradation
This incident led to lasting improvements in Instagram's efficiency practices, including better bottleneck detection and scaling protocols.
The introduction of efficiency on-call has catalyzed significant cultural changes in engineering organizations. Teams now must balance aggressive testing and feature development with strict resource constraints. This has led to the development of more rigorous testing processes that include capacity impact assessments before deployment.
Engineers have become more conscious of code performance implications, leading to improved engineering practices. The focus on efficiency often results in simpler, more maintainable code bases with reduced complexity. This simplified code architecture typically leads to more reliable systems and reduced operational overhead.
Organizations implementing efficiency on-call should maintain comprehensive runbooks that document common problems and their solutions. These runbooks should include detailed procedures for investigating capacity regressions, along with sample queries and debugging approaches for different services.
Teams should implement clear service level objectives (SLOs) for resource usage and establish automated alerting systems. It's crucial to develop robust degradation mechanisms that can gracefully reduce service load when necessary, such as reducing async job traffic for less critical operations.
Documentation plays a vital role in efficiency operations. Beyond runbooks, teams should maintain detailed records of past incidents, solutions, and optimization strategies. This knowledge base accelerates future incident resolution and helps prevent recurring issues.
Unlike traditional automation, which operates based on fixed rules or predefined workflows, agentic AI systems are capable of adapting to new situations, reasoning through complex problems, and learning from their environments. Here’s how it can automate the tedious parts of efficiency on-call -
Enhanced Anomaly Detection
Agentic AI can connect the dots across diverse data sources—such as infrastructure metrics, logs, and usage patterns—to identify meaningful deviations. This approach reduces noise and surfaces critical information, helping engineers focus on what truly matters.
Smarter Root Cause Analysis
AI can streamline this process by analyzing dependencies, telemetry, and recent changes to propose a ranked list of likely causes. By reasoning about system behavior, these tools can address issues without requiring exhaustive manual effort.
Adaptive Interfaces for Better Decision-Making
Adaptive interfaces powered by AI can adjust to the context of the task, showing only the most relevant data and insights. This makes it easier for engineers to focus on solving problems without being overwhelmed by unnecessary information.
While Meta's experience with Instagram provides valuable insights into efficiency engineering at scale, the principles apply broadly. Whether managing cloud instances or global infrastructure, efficiency on-call engineering drives both technical excellence and business sustainability. The role combines technical expertise with business acumen, requiring engineers to balance system performance with operational costs.
The impact of efficiency engineering extends beyond immediate cost savings. By promoting better engineering practices and simpler system architectures, it contributes to overall system reliability and maintainability. As companies continue to scale while optimizing costs, the importance of efficiency engineering will only grow, making it an increasingly critical specialization in the field of software operations.
Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.
Learn how AI Production Engineer can streamline incident management with real-time root cause analysis, faster resolutions, and reduced on-call burdens. Explore the benefits of integrating AI into your workflows for reliable and efficient operations.
Resolve AI has built a holistic AI platform for proactive incident troubleshooting and operational efficiency.