Fixing Kubernetes ‘Service 503’
A 503 in Kubernetes means zero healthy backends in the endpoint pool. This guide covers the most common causes — empty endpoints, failing readiness probes, selector mismatches — and how to trace each one.
How to Fix Kubernetes 'Service 503' (Service Unavailable) Error
A 503 from your Kubernetes cluster means the ingress controller knows it has nowhere to send your traffic. Unlike a 502, where the request reaches a pod and gets a bad response, a 503 means there are no healthy backends available at all. The ingress tried to route the request and found an empty pool.
This usually points to one of a few things: your pods aren't running, your service has no endpoints, or your health checks are failing across every pod in the set.
502 vs 503: Why the Distinction Matters
Both errors look similar from the outside, but they tell you very different things about where to look. A 502 means "I sent the request to a backend and got garbage back." A 503 means "I have no backend to send this to." With a 502, at least one pod is receiving traffic. With a 503, zero pods are in the healthy pool. That changes your debugging path completely.
Common Causes
No pods running. The simplest case. Your deployment scaled to zero, a failed rollout terminated all pods, or resource quotas prevented new pods from scheduling.
kubectl get pods -n <namespace>
kubectl get deployment <deployment-name> -n <namespace>
If you see zero pods or all pods in Pending state, check whether your cluster has enough resources to schedule them:
kubectl describe pod <pending-pod> -n <namespace>
Look for scheduling failures in the Events section. Common blockers are insufficient CPU or memory on available nodes, node affinity rules that can't be satisfied, or PersistentVolumeClaim bindings that are stuck.
All pods failing readiness checks. This is the most common cause of 503s that surprise people. Your pods are Running, so everything looks fine at a glance. But every single pod is failing its readiness probe, so Kubernetes has removed all of them from the service endpoint list. The service exists, the pods exist, but the service has zero healthy endpoints.
kubectl get endpoints <service-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
In the describe output, look for readiness probe failures in the Events section. The reason every pod is failing simultaneously is usually one of these: your application depends on an external service (database, cache, config server) that's down and the readiness probe checks connectivity to it; a bad config was deployed that prevents the app from starting properly; or a shared resource like a mounted volume is unavailable.
Service has no endpoint because selectors don't match. You've deployed new pods with different labels, or someone changed the service selector, and now the service can't find any pods. This looks identical to having no healthy pods from the ingress controller's perspective.
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> -l <key>=<value>
If the second command returns nothing, your selector doesn't match any running pods. This happens more often than you'd think after Helm chart upgrades or label refactors.
Ingress pointing to a nonexistent service. Your ingress resource references a service that doesn't exist in the namespace. Maybe it was deleted, maybe the name has a typo, maybe it's in a different namespace.
kubectl get ingress <ingress-name> -n <namespace> -o yaml
kubectl get svc -n <namespace>
Compare the backend service name in the ingress spec with the actual services in the namespace. Kubernetes won't warn you about this mismatch at apply time.
Rate limiting or circuit breaking. If you're running a service mesh like Istio or Linkerd, 503s can come from the mesh's own traffic management. Circuit breakers tripping because of high error rates, rate limits being exceeded, or outlier detection ejecting all endpoints can all result in 503s that have nothing to do with your pods' health.
For Istio:
kubectl get destinationrule <rule-name> -n <namespace> -o yaml
Check for connectionPool, outlierDetection, and trafficPolicy settings. An aggressive outlier detection config can eject all endpoints if error rates spike briefly, turning a temporary problem into a full outage.
HPA scaled to zero or can't scale up. If you're using a Horizontal Pod Autoscaler with a minimum replica count of zero (common with KEDA for event-driven scaling), there may be no pods running during low-traffic periods. When traffic arrives, there's a cold start delay before pods are ready.
Even with a minimum of one, if the HPA is trying to scale up but new pods can't schedule (resource constraints, node pool exhaustion), you can end up with all existing pods overwhelmed and failing readiness checks while replacement pods are stuck in Pending.
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
Debugging Sequence
Start with endpoints. This is the fastest way to confirm what the ingress controller sees:
kubectl get endpoints <service-name> -n <namespace>
If the endpoint list is empty, the 503 makes sense. Now figure out why.
Check if pods exist and their status. kubectl get pods -n <namespace> tells you if pods are running at all. If they're Running but not Ready (you'll see 0/1 in the READY column), readiness probes are failing.
Look at pod events. kubectl describe pod <pod-name> will show you readiness probe failures, scheduling issues, or volume mount problems in the Events section. This is usually where the actual root cause surfaces.
Check what the readiness probe depends on. If the probe hits an endpoint that checks database connectivity, and the database is down, every pod in the deployment will fail its readiness check simultaneously. This is a design choice worth revisiting: readiness probes that depend on external services can turn a partial outage into a complete one.
Verify the ingress controller is healthy. If endpoints exist and look correct but you're still getting 503s, the ingress controller itself might be struggling:
kubectl logs <ingress-controller-pod> -n ingress-nginx | grep 503
Preventing 503s
The most impactful prevention is getting your readiness probes right. A readiness probe should check whether this specific pod can serve traffic, not whether the entire system is healthy. If your readiness probe fails when a downstream database is unreachable, a database blip takes out your entire service instead of just degrading it.
For deployments, set maxUnavailable in your rolling update strategy to a value that guarantees some pods remain in the endpoint pool during rollouts. If you have 3 replicas and maxUnavailable: 100%, all old pods can terminate before new ones are ready.
Pod Disruption Budgets help protect against involuntary evictions (node drains, spot instance reclamation) taking out too many pods at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: my-app
When It Gets Complicated
The clean cases are when one thing is wrong: pods aren't running, or labels don't match, or a probe is misconfigured. The harder cases are when multiple things interact. A node goes down, the scheduler can't place replacement pods due to resource pressure, the remaining pods get overloaded and start failing health checks, and now you have zero healthy endpoints.
Tracing through that chain from the 503 you're seeing back to the original trigger requires pulling context from pod events, node conditions, scheduling decisions, and deployment history simultaneously. It's the kind of investigation that takes an experienced engineer 30 minutes of context-switching across kubectl, dashboards, and deployment logs.
What is Resolve AI
Resolve AI investigates production issues across your code, infrastructure, and telemetry. When you're staring at a 503 and need to trace it back through endpoints, pod health, scheduling, and recent deploys, Resolve pulls context from across your stack and reasons through the investigation the way a senior SRE would.
If your team spends too much time on investigations like this, see Resolve AI in action.