CloudChat

Hosted by Carl and Brandon

Technology EducationInterviews guests

Website RSS feed

Episodes

Latest episode

Jun 2026

Language

About the show

Conversations about building software and designing architecture in the cloud natively.

Listen to episodes

34 recent

June 8, 20261 hr 17 min

Ep. 33 - Route of All Evil

Episode 0033 - Route of All Evil Cloud networking still breaks when teams assume the platform will "just handle it," and Carl and Brandon dig into why. They challenge that myth and show where parity falls apart across providers: VNet, VPC, and VCN primitives look familiar but behave differently in defaults, region and zone design, and routing/security expectations. From there, the episode moves into foundational design pressure points such as IPv4 range planning, overlapping CIDRs, Kubernetes networking overlays, and the route-level surprises that cause hard-to-diagnose failures, including asymmetric paths, BGP mistakes, and MTU mismatches. The second half focuses on the practical failure modes teams feel in production: SNAT exhaustion that appears as random timeouts, endpoint and DNS choices that silently change traffic paths, and egress patterns that impact both reliability and cost. Load balancing choices (Layer 4 vs Layer 7), TLS termination strategy, and cloud-specific security control models all shape the final behavior of a system. The throughline is consistent: make network intent explicit, treat egress and observability as first-class design surfaces, and standardize repeatable patterns that survive provider changes. Links Core Networking Concepts RFC 1918: Address Allocation for Private Internets RFC 4632: Classless Inter-domain Routing (CIDR) RFC 4271: Border Gateway Protocol 4 (BGP-4) Maximum Transmission Unit (MTU) Asymmetric routing Cloud Networking and Edge Services Azure Front Door overview How Cloudflare works Azure Virtual Network overview Amazon VPC user guide OCI Virtual Cloud Network (VCN) overview NAT, Private Access, and Egress Azure NAT Gateway AWS NAT Gateway Azure Private Endpoint overview Azure bandwidth pricing (egress) Load Balancing, Security, and Operations Azure Load Balancer overview Azure Application Gateway overview Azure Traffic Manager overview Microsoft Zero Trust guidance Azure Network Watcher flow logs overview Hybrid Connectivity AWS Direct Connect Azure ExpressRoute overview Oracle FastConnect overview Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

May 11, 20261 hr 19 min

Ep. 32 - Rolling, Rolling, Rolling…

Episode 0032 - Rolling, Rolling, Rolling… Logs are ground truth — high-fidelity, event-level data that anchor observability alongside metrics and traces. Carl and Brandon argue the biggest mistake teams make is treating "more logs" as "better logs." If everything is logged, nothing is useful, and they both share recent troubleshooting sessions where verbose, unstructured output forced them into KQL gymnastics just to find the actual error. Brandon walks through a 503 that turned out to be a database fault hidden one layer down, and Carl recounts a customer whose "unplanned" VM reboots were actually planned Kubernetes node maintenance — a story you can only untangle by correlating infrastructure and platform logs. Along the way they cover the six log sources worth thinking about (application, infrastructure, platform/managed service, security, audit, and access logs), with a detour into a customer whose minute-long latency vanished once infra logs revealed a VPN routing New York users through Texas. The middle of the episode is a clinic on log hygiene. Carl walks through log levels — debug/verbose, info, warn, error, fatal — and the distinction Brandon draws between an exception (a code construct) and an error (a log level): a caught exception is an error, an uncaught one becomes fatal. They make the case for structured logging into stores like Kusto or via OpenTelemetry so keys can be projected, indexed, and fed directly into dashboards, and Brandon's tip on not pre-computing expensive log arguments is a reminder that a disabled verbose call still costs CPU if you build its message eagerly. Centralized logging pipelines beat rolling your own helper class — log4-anything frameworks exist for a reason — and UTC alone won't save you when scaled-out instances drift apart in time. Correlation and trace IDs, especially parent/child IDs from OpenTelemetry, are the thread that stitches a single user's journey back together across microservices. Carl and Brandon close on cost and discipline. Logging budgets balloon fast, so production should not be running verbose, retention should be tiered (a month of exceptions is plenty once the fix ships), duplicate destinations like Log Analytics plus Event Hubs plus a storage account should pick one source of truth, and Application Insights-style sampling can collapse repetitive traffic into representative events. Compliance logs that sit for years belong in cold or frozen storage tiers where the access pattern actually matches the cost. Their do's and don'ts land on a simple posture: log with intent, redact secrets and connection strings, standardize across teams, and — especially if AI agents are writing your code — make sure the logging conventions travel with the work. Point an agent at a recent run and ask where the gaps and noise are; it's a fast way to audit whether your logs are actually doing their job. Links Observability and logging concepts OpenTelemetry OpenTelemetry traces and spans W3C Trace Context (correlation IDs) Structured logging overview (Microsoft Learn) Log levels in .NET (LogLevel enum) Logging frameworks log4j (Apache) log4net (Apache) Serilog (structured logging for .NET) Azure platform logging Azure Monitor Logs / Log Analytics Azure diagnostic settings Azure Application Insights sampling Kusto Query Language (KQL) Azure Event Hubs Azure Blob Storage access tiers (hot/cool/cold/archive) Security and supply chain XZ Utils backdoor (CVE-2024-3094) Veritasium: "The Internet Was Weeks Away From Disaster and No One Knew" Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

April 6, 20261 hr 20 min

Ep. 31 - AI All the Things?

Episode 0031 - AI All the Things? Traditional sprint ceremonies start getting in the way when AI-assisted development outpaces the cadence they were built for. Carl and Brandon unpack why that happens and what to do about it — starting with the basics. Brandon defines context windows, distinguishes original vibe coding from the sloppy way the term is used today, and walks through the software factory model where requirements, source code, and tests live in separate repos. Carl shares how he continuously refines his Copilot instructions file, instructs the agent to detect and document recurring patterns, and leans on intent-based prompting over tactical step-by-step descriptions — a three-sentence prompt describing preset themes and macOS Focus Mode integration wrote his Swift UI code nearly flawlessly. Both hosts dig into context management: plan mode to review before implementing, the "Ralph Wiggum" pattern of starting fresh sessions with just the plan, and Architectural Decision Records that give future sessions a trail to follow. Different models suit different jobs — Claude for architecture, Codex for implementation — and MCP servers let those models reach Git and GitHub without a copy-paste workflow. Brandon argues AI is a tool like the Internet — some roles will shift, but learning and adapting has always been the core tech-industry skill. Carl backs that up with a study showing senior engineers only see productivity gains when they change their process, not when they bolt AI onto the old one. On the junior side, Carl mentors a developer to focus on data structures and algorithms — not for the implementation details, but for knowing when to apply them. An MIT study pegs realistic job displacement at around 11.7 percent, and cases like Box's layoffs look more like post-COVID overcorrection than proof that AI is replacing everyone. Links AI-Assisted Development GitHub Copilot Anthropic Claude OpenAI Codex Model Context Protocol (MCP) T3 Chat — Compare LLM Outputs Development Concepts Strangler Fig Pattern (Martin Fowler) Test-Driven Development (TDD) Behavior-Driven Development (BDD) Tools Mentioned Swift UI (Apple Developer) Draw.io Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

February 2, 20261 hr 3 min

Ep. 30 - Local‑First Lifeboats: Architecting for Post‑EOL Usability

Episode 0030 - Local‑First Lifeboats: Architecting for Post‑EOL Usability This episode is about designing for the last day, not just the launch day. Carl kicks off with the Bose SoundTouch situation: a vendor moves toward EOL on a cloud-tethered API, users push back, and the outcome (at least in spirit) becomes a blueprint we wish was more common: keep the hardware useful by enabling local control paths and leaning on protocols that already work without your cloud. From there we broaden the conversation to the bigger problem: products and services that do something totally reasonable in a LAN suddenly need a round trip to the internet just to respond to a button press. Carl and Brandon talk through concrete "this actually happened" examples and what good looks like. Belkin's Wemo sunset email is a solid reference: clear dates, repeated notices, and a reality check that local APIs and ecosystems like HomeKit and Matter can keep devices working even when a vendor endpoint is shut off. We contrast that with the messier side of the industry: thermostats and other home gear that still function locally, but lose their main value when the cloud connection is removed, and cloud-only platforms like Stadia where "no backend" means "hard stop" (with the one bright spot being things like refunds and a final firmware update to unlock a controller for normal Bluetooth use). On the builder side, we get practical about how to retire things without surprising your users. We cover technical signaling (Deprecation and Sunset headers), the need for human-friendly comms beyond "put it in the docs," and the architecture patterns that make "minimum viable offline" real: local-first state, local discovery and control surfaces, and fallbacks that do not require re-pairing or re-auth when identity systems go away. We also touch on SaaS escrow and continuity as a way to build trust (especially for startups) and close with a simple gut check: if your cloud disappeared tonight, what can your users still do tomorrow morning? Links News and examples we discussed Bose is open-sourcing its old smart speakers instead of bricking them | The Verge Belkin Wemo cloud service end-of-support notice Google Stadia - Strategy change and shutdown (2021–2023) | Wikipedia Google Stadia controller Bluetooth mode help article API deprecation and shutdown mechanics Deprecation HTTP response header (RFC 9745) Sunset HTTP response header (RFC 8594) Smart-home protocols and "local-first" connectivity Matter (Connectivity Standards Alliance) Thread protocol overview (Thread Group) Multicast DNS (mDNS) (RFC 6762) Tools and patterns Local-first software (Ink & Switch) Strangler Fig Application pattern (Martin Fowler) Automerge (CRDT) - GitHub Yjs (CRDT) - GitHub Contracts and continuity SaaS escrow overview (Escrow London) SaaS escrow overview (PRAXIS Escrow) Software escrow overview (EscrowTech) Other links of interest Microsoft Modern Lifecycle Policy EU Right to Repair overview (European Commission) Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

January 5, 20261 hr 2 min

Ep. 29 - New Year's ☁️ Resolutions

Episode 0029 - New Year's ☁️ Resolutions "In 2026, your cloud is not allowed to have the same incidents for the same reasons as last year." Carl and Brandon treat this episode like a retrospective (the kind any good agile team would run), but instead of talking about sprint tickets, they write a New Year's resolution list on behalf of your cloud team. The format is simple: Stop, Start, Keep. Small, opinionated constraints that change day-to-day habits, not vague wishes about "better reliability, security, and cost." The Stop list hits the repeat-incident patterns: single-region "global" apps, treating infrastructure-as-code as optional (and living in the portal), mystery ownership with no clear tags or escalation path, one-off production fix scripts that never get documented, dashboards that are always green while users are hurting, and "temporary" exceptions that turn into permanent risk. The Start list is the muscle-building: run realistic failover/incident drills, measure change and recovery (DORA-style signals and MTTR, not just uptime), budget reliability and cost together, treat internal platforms like products with golden paths, standardize secrets and identity, and add a regular "delete day" so old environments and artifacts do not drag into the new year. The Keep list is what compounds: automate repetitive toil, invest in observability tied to real user flows, keep blameless postmortems with concrete follow-ups, and keep platform/SRE work visible so it does not get squeezed out by features. We hope you and your team are able to embrace some of these resolutions in the coming year, and hope that listening to more CloudChat is at the top of your list. Happy New Year everybody! Links DORA: What is DevOps? Site Reliability Engineering (SRE Book) Azure Well-Architected Framework AWS Well-Architected Framework Google Cloud Architecture Framework Azure Bicep documentation Terraform documentation Azure Key Vault overview Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

December 1, 20251 hr 4 min

Ep. 28 - Respect My (DNS) Awe-Thor-Ih-TAY!!

Episode 0028 - Respect My (DNS) Awe-Thor-Ih-TAY!! Your cloud is humming along, then an edge breaks. What lever do you actually still have to steer users? In this episode, Carl and Brandon dig into DNS as a control plane and why "it is always DNS" keeps being true in 2025. DNS was designed for a slower internet with long TTLs and infrequent changes, but we now treat it like a real-time steering wheel for global failover. That mismatch shows up in outages where the backend is fine but nobody can resolve the hostname that front doors, CDNs, and APIs live behind. We unpack how TTL and caching really work (including negative caching and serve-stale), why modern edge products like Azure Front Door and Cloudflare can still turn into global single points of failure, and how DNS-based load balancers actually behave when you flip weights or priorities. From there we move into patterns and mitigations. We walk through hub-and-spoke vs mesh topologies and where public vs private DNS sit in each, plus concrete strategies for what to do when your edge is broken: bypass patterns, equivalent services, and multi-product designs that let you route around a failing front door. We also hit the observability side so "it is DNS" becomes a graph and an alert instead of a guess in a war room. We close with a look at emerging record types like SVCB/HTTPS and how they may help you advertise alternate endpoints and protocol hints without building another fragile tower of CNAMEs. Links DNS Fundamentals RFC 1034: Domain Names - Concepts and Facilities RFC 1035: Domain Names - Implementation and Specification RFC 2308: Negative Caching of DNS Queries RFC 8767: Serving Stale Data to Improve DNS Resiliency DNS Load Balancing and Edge Services Azure Traffic Manager documentation Azure DNS alias records Amazon Route 53 health checks and failover Cloudflare Load Balancing Akamai Global Traffic Management Azure, AWS, and Cloudflare Outage Reading Azure Front Door service documentation AWS DynamoDB and Route 53 service health history Cloudflare status history Architectures and Private DNS Azure Private DNS zones Azure DNS Private Resolver Azure Virtual WAN DNS guidance Emerging DNS Records and HTTP/3 Service binding (SVCB) and HTTPS resource records Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

November 3, 202551 min

Ep. 27 - Whoops, No VM's!!!

Episode 0027 - Whoops, No VM's!!! You've planned for redundancy, scaling, and failover, but what happens when the cloud itself runs out of space? In this episode, Carl and Brandon untangle capacity (what the provider physically or logically has available in a region or zone) versus quota (the soft limit on what you can consume). Mixing the two leads to painful surprises during scale events and failovers. We talk through how capacity shortfalls show up in real life—zones that are full, SKUs that vary by location, and limited supply for GPU-heavy instances, and the patterns that help: design for multiple zones and regions, add retry and fallback logic with flexible SKUs, balance spot with on-demand, and hold a baseline with reservations or time-bound commitments. We close on the business side: the price of headroom, when commitments make sense, and simple pipeline and monitoring checks so "no capacity" errors fail fast instead of 30 minutes into a deploy. Links AWS Auto Scaling allocation strategies AWS EC2 Capacity Reservations AWS insufficient capacity guidance AWS Savings Plans AWS Service Quotas Azure On-demand Capacity Reservations Azure quotas overview Azure region pairs Azure subscription and service limits Azure VM allocation failures Azure VM Scale Sets orchestration modes (Flexible) GCP Compute Engine Reservations GCP quota alerts and monitoring GCP Regional Managed Instance Groups GCP resource availability errors Google Cloud quotas overview Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

October 6, 202558 min

Ep. 26 - Are Your Cloud Costs Too Damn High???

Episode 0026 - Are Your Cloud Costs Too Damn High??? Cloud cost optimization is about designing systems that perform efficiently without wasting money. In this episode, Carl and Brandon break down how AWS, Azure, and Google Cloud help teams rightsize compute, manage storage tiers, and control networking costs. They talk through savings plans, spot instances, lifecycle management, and data transfer strategies that keep performance high and waste low. The discussion then moves into monitoring, automation, and FinOps culture, where budgets, policies, and shared accountability make optimization stick. They cover dashboards, tagging, auto-shutdown routines, and partner-led programs that unlock funding and deeper discounts. Real-world stories from enterprises and startups highlight one key truth: cost management is not a cleanup exercise, it is an ongoing habit that keeps cloud architectures both efficient and sustainable. Links AWS: Well-Architected Framework – Cost Optimization pillar AWS: How to Use AWS Well-Architected with Trusted Advisor for Cost Optimization AWS: AWS Savings Plans AWS: Amazon EC2 Spot Instances Azure: Microsoft Cost Management + Billing (overview) Azure: Quickstart: Start using Cost Analysis Azure: Common cost analysis uses in Cost Management Azure: Control Azure spending and manage bills (learning path) GCP: Create, edit, or delete budgets and budget alerts (Cloud Billing) GCP: Cloud Billing Budget API overview GCP: Committed Use Discounts (Compute) GCP: Understand your bill – pricing & billing (Google Developers) Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

September 8, 20251 hr 6 min

Ep. 25 - The Sound of Security

Episode 0025 - The Sound of Security Security is more than a feature, it's a pillar of the Well-Architected Framework. In this episode, Carl and Brandon explore how AWS, Azure, and GCP approach security across identity and access, infrastructure defense, data protection, monitoring, governance, and the shared responsibility model. They compare tools and practices like IAM, RBAC, and conditional access; network firewalls, WAFs, and DDoS protection; encryption at rest and in transit; and incident detection and automated remediation. The conversation also dives into security testing, drift detection with IaC, compliance posture, and how policy enforcement differs across the big three. The episode closes with a reminder that cloud security is always shared, and is never finished. Links AWS: Well-Architected Framework – Security pillar AWS: Identity and Access Management (IAM) AWS: AWS Shield and WAF AWS: Amazon Macie AWS: Amazon GuardDuty AWS: AWS Config Azure: Azure Well-Architected Framework – Security Azure: Microsoft Entra ID (Azure AD) Azure: Azure Role-Based Access Control (RBAC) Azure: Azure Key Vault Azure: Defender for Cloud Azure: Microsoft Sentinel Google Cloud: Google Cloud Architecture Framework – Security Google Cloud: IAM overview Google Cloud: Cloud Armor Google Cloud: Cloud KMS Google Cloud: Data Loss Prevention (DLP) API Google Cloud: Security Command Center Google Cloud: Assured Workloads Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

August 4, 202552 min

Ep. 24 - Operating Excellently

Episode 0024 - Operating Excellently Operational excellence goes beyond uptime, it's about building and operating cloud systems with discipline, automation, and continuous improvement. Carl and Brandon break down what operational excellence really means, drawing a distinction between striving for perfection and building resilient, adaptable systems. They discuss how principles from AWS, Azure, and GCP converge around key practices like repeatable automation, structured change management, and process validation. The episode dives into real-world strategies for automation, incident readiness, and observability, including where and how to insert gates, use feature flags, and integrate infrastructure as code across cloud platforms. From avoiding certificate-induced outages to catching misconfigurations early, the key theme is consistency at scale. The discussion also emphasizes the cultural side, why shared ownership, retrospectives, and iterative postmortems matter just as much as tooling. Links Ansible: Ansible community documentation AWS Docs: Amazon CloudWatch documentation overview AWS Docs: Operational Excellence whitepaper AWS Docs: Prescriptive Guidance: Operational Excellence AWS Docs: Using CloudWatch dashboards and alarms AWS Docs: Well‑Architected Framework – Operational Excellence pillar AWS: Getting started with Amazon CloudWatch Google Cloud: Continuously improve and innovate Google Cloud: Manage incidents and problems Google Cloud: Operational Excellence pillar overview Google Cloud: Operational readiness & performance using CloudOps HashiCorp Docs: Terraform configuration language reference HashiCorp Docs: Terraform documentation Microsoft Docs: Automation of tasks with PowerShell in Power Platform Microsoft Learn: Azure Automation documentation Microsoft Learn: Azure Monitor documentation Microsoft Learn: Operational Excellence maturity model Microsoft Learn: Operational Excellence overview & quickstart Microsoft Learn: Operational Excellence principles (maturity model, practices) Microsoft Learn: PowerShell documentation PowerShell Universal Docs: PowerShell Universal platform guide Red Hat Docs: Ansible Automation Platform guide Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat