Avatar

Omnath Ganapure

Cloudsmith

Read Resume

Kubernetes Production Is Not Theory

4 min read
Article
Kubernetes Production Is Not Theory

Kubernetes in production is not about running pods it’s about containing failures.
This article explains three real production scenarios every DevOps engineer eventually faces and how they are solved step by step.

Scenario 1: The Noisy Neighbor Problem

Kubernetes Production Is Not Theory

In a shared Kubernetes cluster, multiple teams run workloads together.
If one pod starts consuming excessive CPU or memory, it can impact every other application in the cluster.

This misbehaving pod is called a Noisy Neighbor.

What happens

  • One pod leaks memory

  • It consumes most cluster resources

  • Other pods start crashing

  • Entire cluster becomes unstable

This is called an uncontained blast radius.

First Fix: ResourceQuotas (Namespace-Level Protection)

To stop one team from killing the entire cluster, We will apply ResourceQuotas at the namespace level.

What ResourceQuota does

  • Limits total CPU & memory per namespace

  • Prevents one team from consuming everything

Now:

  • The cluster is safe

  • Other namespaces are protected

But, The problem still exist inside the namespace.

Kubernetes Production Is Not Theory

Final Fix: ResourceLimits (Pod-Level Isolation)

To fully contain failures, DevOps engineers define resource limits per pod.

What ResourceLimits do

  • Set max CPU/memory per container

  • If a pod exceeds it → Kubernetes kills only that pod

Result:

  • Faulty pod dies

  • Other pods stay healthy

  • Blast radius = one pod

Kubernetes Production Is Not Theory

Scenario 1 Takeaway

Two-layer defense is mandatory

  • ResourceQuota → protects the cluster

  • ResourceLimits → protects applications

Scenario 2: The OOMKilled Mystery (Memory Leak Debugging)

Even after applying resource limits, a pod keeps restarting with:

Reason: OOMKilled
Exit Code: 137
You already:

  • Set correct limits

  • Benchmarked memory

  • Configured Kubernetes properly

So why is it still crashing?

Because Kubernetes only tells what happened, not why.

Kubernetes Production Is Not Theory

The DevOps Responsibility

Important production truth:

DevOps engineers do not debug application code

Their role is to:

  • Capture evidence

  • Preserve state before crash

  • Hand data to developers

This evidence comes from dumps.

Heap Dump & Thread Dump Explained

Heap Dump

  • Snapshot of application memory

  • Shows all objects, sizes, references

  • Used to find memory leaks

Thread Dump

  • Snapshot of running threads

  • Shows blocked, waiting, deadlocked threads

  • Used for CPU / hang issues

For OOMKilled → Heap dump is critical

Scenario 2 Takeaway

OOMKilled is usually application memory misuse, not Kubernetes failure

DevOps enables diagnosis.
Developers fix the code.

Scenario 3: The High-Wire Kubernetes Upgrade (Zero Downtime)

Every Kubernetes cluster must eventually be upgraded.

Example:

  • From Kubernetes 1.29 → 1.30

Risks:

  • API deprecations

  • Control plane failure

  • Application downtime

  • Complete cluster outage

This is one of the highest-risk DevOps tasks.

Kubernetes Production Is Not Theory

The Correct Upgrade Strategy

Upgrades must never be ad-hoc.
They follow a strict, repeatable playbook.

Phase 1: Preparation

  • Backup etcd

  • Read release notes

  • Identify breaking changes

Phase 2: Control Plane Upgrade

  • Upgrade API server

  • Upgrade scheduler & controller manager

  • Validate cluster health

Phase 3: Rolling Worker Node Upgrade

For each node:

  • Cordon node

  • Drain pods

  • Upgrade kubelet

  • Uncordon node

This ensures zero downtime.

Scenario 3 Takeaway

Production upgrades succeed due to discipline, not confidence

Share this article:
2026 — Built by Omnath Ganapure