How to debug OOMKilled errors in Kubernetes?
Pod stuck in an OOMKilled loop? Learn to distinguish between container-level and node-level OOM, analyze memory growth patterns, and fix Kubernetes Exit Code 137.
What is Kubernetes OOMKilled?
When a container terminates with Reason: OOMKilled, the Linux kernel's Out-of-Memory (OOM) killer has forcibly terminated a process because it exceeded its memory allocation. This is one of the most common failure modes in Kubernetes (k8s) environments.
The OOMKilled status indicates that a process inside your container attempted to allocate more memory than was available to it. When this happens, the kernel's OOM killer selects a process to terminate and frees its memory. In containerized environments, this almost always means the primary process in your container.
You'll typically see this in pod events or when describing a crashed container:
State: Terminated
Reason: OOMKilled
Exit Code: 137
Exit code 137 specifically indicates the process received a SIGKILL (128 + 9 = 137) from the OOM killer.
What is the impact of OOMKilled errors?
Misconfiguring memory is not just a technical error; it is a massive driver of cloud waste. According to the 2024 Sysdig Cloud-Native Security and Usage Report, approximately only 34% of allocated memory in Kubernetes clusters is actually used. Conversely, Datadog’s 2024 State of Cloud report indicates that nearly 40% of organizations have at least one service that is regularly OOMKilled due to aggressive "under-provisioning" or leaks.
How to differentiate Container OOM vs. node OOM?
There are two distinct scenarios that result in OOMKilled, and distinguishing between them affects your remediation approach.
- Container-level OOM occurs when a container exceeds its configured memory limit. Kubernetes sets a cgroup memory limit based on your
resources.limits.memoryspecification. When the container's resident memory exceeds this limit, the kernel terminates it, regardless of how much memory is available on the node. This is the most common case. - Node-level OOM occurs when the node itself runs out of memory. The kubelet has eviction thresholds (typically around 100Mi of available memory) that trigger pod eviction before the node becomes completely memory-starved. If memory pressure builds faster than the kubelet can respond, the kernel's OOM killer may terminate processes directly. In this case, you might see OOMKilled containers even when they're well within their configured limits.
To identify which scenario you're dealing with, check the node's memory pressure condition and system logs around the time of the OOMKilled event. If other pods on the same node were also killed or evicted simultaneously, you're likely dealing with node-level pressure.
Resource requests and limits
Kubernetes memory configuration has two components that interact in ways that often surprise engineers.
- Requests define the minimum memory the scheduler guarantees for your pod. The scheduler uses requests to decide which node can accommodate your pod. A pod won't be scheduled on a node unless the node has enough allocatable memory to satisfy all requests.
- Limits define the maximum memory your container can use. This is enforced by cgroup constraints. When your container tries to exceed this limit, the OOM killer terminates it.
The relationship between these matters. If you set limits significantly higher than requests, you're allowing your container to use "burst" memory that isn't guaranteed. Under node memory pressure, containers using more than their requested amount are first in line for termination, even if they're below their limits.
A common configuration mistake is setting limits without requests, or setting requests much lower than actual usage. This leads to containers that schedule successfully but get killed under pressure because they're consuming more than their guaranteed share.
How to diagnose the root cause for OOMKilled errors?
When you see OOMKilled, the first question is whether your memory limit is simply too low for your workload's actual requirements, or whether something is wrong with the application itself.
- Check your baseline memory usage. Look at memory consumption over time for healthy instances of the same workload. If your limit is close to or below the normal operating memory, the container will be killed during normal operation. The fix here is straightforward: increase the limit to accommodate actual usage plus reasonable headroom.
- Look for memory growth patterns. If memory usage steadily increases over time until the container gets killed, you're likely dealing with a memory leak. The container might run fine for hours or days before accumulating enough leaked memory to hit the limit. Restarting "fixes" the problem temporarily because it resets memory state, but the leak persists.
- Identify sudden spikes. Some workloads have legitimate memory spikes during certain operations like processing large files, handling traffic bursts, or running batch jobs. If kills correlate with specific events, you might need higher limits or need to redesign how the application handles memory-intensive operations.
- Examine what's consuming memory. This is where things get difficult. The OOMKilled event tells you that memory was exceeded, but not what was using it. Standard metrics show container-level memory consumption, not the breakdown by function, object type, or allocation pattern.
Common causes of unexpected memory consumption
Beyond outright leaks, several patterns frequently lead to OOMKilled:
- Unbounded caches or buffers. In-memory caches without eviction policies, or buffers that grow with input size without limits, can consume arbitrary amounts of memory depending on workload.
- Connection pool accumulation. Database or HTTP connection pools that grow under load but don't shrink, especially in languages with garbage collection that may not release memory back to the OS promptly.
- Large request handling. Processing large payloads like file uploads, bulk API requests, large query results can temporarily require memory proportional to the payload size.
- JVM heap sizing. For JVM-based applications, the default maximum heap size may exceed your container's memory limit, guaranteeing an eventual OOMKilled. Container-aware JVM flags (
-XX:+UseContainerSupport, enabled by default in recent versions) help, but explicit heap sizing is often still necessary. - Sidecar containers. Don't forget that pod memory limits apply to the sum of all containers. A sidecar logging agent or proxy using more memory than expected can push the pod over its limit even if your application container is well-behaved.
Why these memory issues are hard to debug
OOMKilled events are particularly frustrating to investigate because the information you need is scattered across multiple systems, and the most useful data often isn't captured at all.
The symptom and cause are disconnected: Kubernetes tells you a container was killed for exceeding memory. It doesn't tell you which allocations caused the growth, what triggered them, or whether the issue is in application code, configuration, or the environment. You're left correlating timestamps across metrics dashboards, deployment logs, and application logs. You will often find that the actual cause predates the kill event by hours or days.
Standard observability shows the "what" but not the "why": Container metrics tell you memory usage increased over time. Logs might show the application was processing requests normally. Neither reveals which code paths are allocating memory or which objects are accumulating. The gap between "memory grew" and "this function is leaking because of this pattern" requires instrumentation that most production environments don't have in place.
Memory leaks are intermittent and state-dependent. A leak might only manifest under specific traffic patterns, data shapes, or sequences of operations. Reproducing the issue in development often fails because the conditions are different. By the time you notice the problem, the container has restarted and the evidence is gone.
Profiling data is voluminous and noisy. Continuous profiling tools like Grafana Pyroscope or Parca can capture allocation patterns over time which is invaluable for memory debugging. But the output is dense. A healthy application's allocation profile looks similar to the early stages of a leak. Finding the problematic growth pattern means comparing profiles across time, across instances, and against baseline behavior. It's manual, time-consuming work that requires familiarity with both the profiling tool and the application's expected behavior.
The investigation spans multiple domains. Memory issues can stem from application code, resource configuration, node-level pressure, recent deployments, traffic changes, or interactions between all of these. No single tool captures this full picture. An engineer debugging OOMKilled typically has to check Kubernetes events, query Prometheus or their metrics platform, review recent deployments, examine application logs, and possibly dig into profiling data. This means you are synthesizing information from five or six different systems to form a hypothesis.
General-purpose AI tools hit the same walls. Pasting an OOMKilled error into ChatGPT or a coding assistant will get you generic advice: check your memory limits, look for leaks, consider profiling. This is reasonable guidance, but it's the same guidance you'd find in any Kubernetes troubleshooting doc. These tools don't have access to your metrics, your deployment history, your profiling data, or your system's topology. They can explain what OOMKilled means; they can't tell you why your container was killed or what to do about it. The hard part of debugging isn't understanding OOMKilled in the abstract. It is about connecting the event to your specific system's behavior, and that requires access to and reasoning across your actual production data.
How Resolve AI approaches memory issues
Memory-related failures exemplify the broader challenge in production debugging: the symptom appears in one place, but the cause lies elsewhere. Be it in application code, in resource configuration, in node-level behavior, or in interactions between recent changes and current state. Understanding what happened requires connecting information that lives in different systems, owned by different teams, and often not easily queryable together.
Resolve AI investigates memory issues by working across these domains simultaneously. When an OOMKilled event occurs, it correlates memory consumption patterns leading up to the failure, identifies what changed recently (deployments, configuration updates, traffic patterns), and examines whether the issue is container-specific or part of broader node pressure. The investigation pursues multiple hypotheses in parallel: was this a limit misconfiguration, a leak, a sudden spike from a specific operation, or node-level resource contention?
For organizations using continuous profiling, Resolve AI incorporates profiling data into its investigation, identifying allocation patterns that correlate with memory growth. Rather than manually comparing flame graphs across time, the system surfaces which code paths show anomalous growth alongside the deployment history and traffic patterns for that service.
The goal is to close the gap between "this container was OOMKilled" and "memory consumption increased by X% over the past Y hours, concentrated in these functions, following this deployment": without requiring an engineer to manually pull data from metrics, logs, profiling tools, deployment records, and Kubernetes events. Memory issues span code, infrastructure, and telemetry, and diagnosing them requires reasoning across all three.
Citations
- Sysdig: 2024 Cloud-Native Security and Usage Report - Data on 66% memory over-provisioning.
- Datadog: State of Kubernetes 2024 - Benchmarks on OOMKilled frequency and pod resource metrics.
- Cast.ai: 2025 Kubernetes Waste Report - Discussion on resource utilization vs. reliability.
- Kerno/Spacelift: Technical Analysis of Exit Code 137 - Calculation of SIGKILL + 128