KubeFM

Hosted by KubeFM

TechnologyInterviews guests

Website RSS feed

Episodes

100

Latest episode

May 2026

Language

EN-US

About the show

Discover all the great things happening in the world of Kubernetes, learn (controversial) opinions from the experts and explore the successes (and failures) of running Kubernetes at scale.

Listen to episodes

60 recent

May 19, 2026Episode 1721 min

The Hidden Cost of Slow Autoscaling, with John Ford

Forced platform migrations are usually treated as something to survive. At Scout24, a mandatory OS migration became an opportunity to rethink Kubernetes autoscaling, node provisioning, and infrastructure efficiency.John Ford explains how Scout24 moved its EKS-based Infinity platform from a polling autoscaler and over-provisioned capacity to Karpenter and Bottlerocket. The result was faster node startup, a safer migration path, and about a 30% infrastructure reduction without major downtime.In this interview:Why two-minute node provisioning forced a 25% capacity bufferHow Karpenter made the Bottlerocket migration saferWhat broke around EC2 metadata, AWS SDKs, and cgroupsHow the new foundation enables Spot, ARM, and GPU workloadsSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/DdmVC2_7vInterested in sponsoring an episode? Learn more.

May 12, 2026Episode 1636 min

The Namespaces Scaling Trap, with Brian Stack

Most teams scale Kubernetes by thinking about pods and nodes. At Render, Brian Stack ran into a different dimension: hundreds of thousands of namespaces per cluster, multiplied across DaemonSets that list-watch every namespace.Brian explains how Render traced the issue through Calico and Vector, worked with upstream maintainers, and turned memory profiling into operational wins: lower node costs, lighter API-server load, and faster rollouts.In this interview:Why namespaces can become a hidden scaling bottleneckHow DaemonSets multiply memory and control-plane pressureHow profiling, staging clusters, and upstream collaboration freed 7 TiBWhy pushing from an 80% fix to a complete fix can make teams fasterSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/0mrvCsXrVInterested in sponsoring an episode? Learn more.

May 5, 2026Episode 1538 min

AI Agents Running Kubernetes, with Mike Solomon

What happens when an AI agent stops generating Kubernetes YAML and starts operating the cluster directly?Mike Solomon, software engineer at AIATELLA, explains how his team moved from a sprawling Helm setup to Markdown-driven infrastructure specs that Claude Code can execute, test, and refine.You will learnWhy Helm became hard to maintain for a fast-moving medical infrastructure repoHow Claude debugged Argo, TLS conflicts, kubectl patches, and private registry credentialsHow runbooks plus agent memory files capture failures so deployments become reproducible.It is a practical look at where Kubernetes automation may be heading: less hand-written YAML, more precise intent, and a sharper definition of when the human must stay in the loop.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/y70mLvWNsInterested in sponsoring an episode? Learn more.

April 28, 2026Episode 1435 min

SaaS with Kubernetes Operators and Garbage Collection, with Alexander Held

A single Kubernetes CRD for every service request turns small changes into full-platform reconciliations.Alexander Held, former platform engineer at Mercedes-Benz Tech Innovation, describes a production refactor from a 2,000-line CRD to purpose-built resources and controllers. He shows how teams can model business workflows as Kubernetes APIs and then use owner references, finalizers, and events to keep platform operations predictable.You will learn:Why monolithic CRDs create performance and troubleshooting problemsHow controllers turn database provisioning and backups into reconciliation loopsHow finalizers clean up external resources such as S3 backupsWhy Kubernetes events make platform workflows easier to debugSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/TGy4Qn7QsInterested in sponsoring an episode? Learn more.

April 21, 2026Episode 131 hr 29 min

What Hip-Hop Can Teach Us About Kubernetes, with Kelsey Hightower, Eric Abercrombie, and Julius Payne II

Kelsey Hightower, Eric Abercrombie, and Julius Payne II reflect on life after achievement, entering the Kubernetes world for the first time, and how music, creativity, and lived experience shape the way they think about technology.In this interview:Why fundamentals, patience, and repetition still matter more than shortcutsHow Kubernetes, community, and confidence intersect for people entering cloud-native workWhat hip-hop, production, and storytelling can teach us about ownership, authenticity, and finding your voiceSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/czrCCXSLtInterested in sponsoring an episode? Learn more.

April 7, 2026Episode 1230 min

Intelligent Kubernetes Load Balancing, with Rohit Agrawal

You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.In this episode:Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per requestHow Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every clientHow zone-aware spillover cut cross-availability-zone costs without sacrificing availabilityWhy CPU-based routing failed (monitoring lag creates oscillation) and what signals to use insteadThe system has been running in production for three years across hundreds of services, handling millions of requests.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/y803JMhBkInterested in sponsoring an episode? Learn more.

March 31, 2026Episode 1128 min

That Time I Found a Service Account Token in my Log Files, with Vincent von Büren

You're integrating HashiCorp Vault into your Kubernetes cluster and adding a temporary debug log line to check whether the ServiceAccount token is being passed correctly. Three months later, that log line is still in production — and the token it prints has a 1-year expiry with no audience restrictions.Vincent von Büren, a platform engineer at ipt in Switzerland, lived through exactly this incident. In this episode, he breaks down why default Kubernetes ServiceAccount tokens are a quiet security risk hiding in plain sight.You will learn:What's actually inside a Kubernetes ServiceAccount JWT (issuer, subject, audience, and expiry)Why tokens with no audience scoping enable replay attacks across internal and external systemsHow Vault's Kubernetes auth method and JWT auth method compare, and when to choose eachWhat projected tokens are, why they dramatically reduce blast radius, and what's holding teams back from using themPractical steps for auditing which pods actually need API access and disabling auto-mounting everywhere elseSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/LTnB_NtbcInterested in sponsoring an episode? Learn more.

March 24, 2026Episode 10

GPU Containers as a Service, with Landon Clipp

Running GPU workloads on Kubernetes sounds straightforward until you need to isolate multiple tenants on the same server. The moment you virtualize GPUs for security, you lose access to NVIDIA kernel drivers — and almost every tool in the ecosystem assumes those drivers exist.Landon Clipp built a GPU-based Containers as a Service platform from scratch, solving each isolation layer — from kernel separation with Kata Containers + QEMU to NVLink fabric partitioning to network policies with Cilium/eBPF — and shares exactly what broke along the way.In this interview:Why standard NVIDIA tooling (GPU Operator) fails in multi-tenant setups, and how to use CDI with PCI topology scanning to make GPUs visible to Kubernetes without kernel driversHow to partition the NVLink fabric between tenants using a trusted service VM running Fabric Manager, and why the physical PCIe wiring differs between Supermicro HGX and NVIDIA DGX systemsWhy gVisor doesn't work for GPU workloads — NVIDIA's unstable ioctl ABI means Google has to update gVisor for every driver release, and they only support a handful of GPUsWhat caused 8-GPU VMs to take 30+ minutes to boot, and the specific fixes (IOMMUFD, cold plugging, kernel upgrades) that brought it down to minutesHow Cilium network policies enforce tenant isolation at the Kubernetes identity level instead of fragile IP-based rulesWhere Containers as a Service fits best: inference workloads where AI teams want to ship an OCI image without managing infrastructure or signing multi-million dollar cluster contracts.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/jjK_yJTDzInterested in sponsoring an episode? Learn more.

March 17, 2026Episode 920 min

How We Cut Build Debugging Time by 75% with AI, with Ron Matsliah

Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.In this episode:Why combining deterministic rules with AI produces better results than letting an LLM guess aloneHow correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errorsWhy integrating into existing workflows and building feedback loops from day one drove adoptionThe prompt engineering lessons learned from testing with real production data instead of synthetic examplesThe takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/PDdYfC00wInterested in sponsoring an episode? Learn more.

March 10, 2026Episode 825 min

Migrating Kubernetes Off Big Cloud, with Fernando Duran

Managed Kubernetes on a major cloud provider can cost hundreds or even thousands of dollars a month — and much of that spending hides behind defaults, minimum resource ratios, and auxiliary services you didn't ask for.Fernando Duran, founder of SadServers, shares how his GKE Autopilot proof of concept ran close to $1,000/month on a fraction of the CPU of the actual workload and how he cut that to roughly $30/month by moving to Hetzner with Edka as a managed control plane.In this interview:Why Kubernetes hasn't delivered on its original promise of cost savings through bin packing — and what it actually provides insteadA real cost comparison: $1,000/month on GKE vs. $30/month on Hetzner with Edka for the same nominal capacityWhat you need to bring with you (observability, logging, dashboards) when leaving a fully managed cloud providerThe decision comes down to how tightly coupled you are to cloud-specific services and whether your team can spare the cycles to manage the gaps.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6nSDbz9m4Interested in sponsoring an episode? Learn more.