ThursdAI - The top AI news from the past week

Hosted by From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

Technology NewsInterviews guests

Website RSS feed

Episodes

160

Latest episode

Jun 2026

Language

About the show

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

Listen to episodes

60 recent

June 12, 20262 hr 11 min

📅 ThursdAI - Jun 11, 2026 - Fable & Mythos 5 are here, Anthropic gets caught sandbagging (then reverses), Siri AI finally works!? and we got live-translated on air

Hey folks, Alex here, and welcome to a BIG MODEL week! We finally got Mythos (well almost)! Let me catch you up! This week started with WWDC26 from Apple, and Max Weinbach, who was in the room at Apple Park and actually has access to some of the new features including an all new SIRI AI, joined us to break down what could be the most used AI in the world very soon. At first I was skeptical, but he convinced me that the new Siri is actually good! Then, we saw the ultimate model drop: Anthropic finally shipped Mythos (X, my system card thread, benchmarks). Same weights, two names: Mythos 5 is the unrestricted version that only Project Glasswing partners get, Fable 5 is what the rest of us get, wrapped in the heaviest guardrails I’ve ever seen ship on a frontier model. It’s state of the art on nearly every benchmarkThe model that was “too dangerous to release” is now... well, released, but with the heaviest guardrails we’ve seen. More on this later. Peter Gostev from Arena.ai joined us to break down the new model. Last but definitely not least, Google released a real-time translation model, that our friend Thor Schaeff from DeepMind demoed live, while we all spoke in different languages and it translated us in REAL TIME. It was really cool, definitely check that out. There’s quite a few more things, like Loop Engineering Alpha, Swyx came by to talk about FrontierCode, OpenAI confirmed our suspicions that the anti-datacenter social media posts could be a concerted effort by groupds links to the Chinese government and much more. Let’s dive in! ThursdAI - Let me catch you up, every week! 👇Opus’s Big brother: Claude Fable 5 & Mythos 5 - the “too dangerous” models is here, SOTA on nearly every benchmark. It honestly feels like someone in Anthropic’s pre-IPO marketing team, knows exactly how to stagger releases to ride the hype waves! First they announce a model that so good at Cybersecurity (Mythos-preview) that they only allow restricted access to it to a few partners. A month later, they release Fable 5, which is the same model weights as Mythos 5, but wrapped in the heaviest guardrails we’ve ever seen from any lab. But, they didn’t lie, this model is absolutely amazing, it does feel like a step change, in terms of capabilities, specifically on longer agentic tasks. 2x as expensive as Opus: $10 / $50 per million tokens, with 1M context, claude-fable-5 in the API, and SOTA basically everywhere. 80.3% on SWE-Bench Pro versus GPT 5.5 at 58.6%, a 22-point blowout on a benchmark where labs usually fight over single digits. Karpathy called it “SOTA by a margin… major-version step change” (X) and Boris Cherny said it’s the “best coding model by a wide margin” (X). Stripe reportedly migrated 50 million lines of code in 24 hours with it.Our panel verdict was unanimous on one thing: big model smell. LDJ called it the most significant big model smell since Gemini 3 first dropped. Someone from the Anthropic team framed the shift in a way that stuck with me: this model moves them from verifying the AI outputs to verifying whether the AI is working on the right thing. Complete shift in how much they trust this model.What we built with Fable to test it outPeter got employee access through Arena and showed us his tests live. His favorite prompt category, “research a dataset and create a visual experience to teach me about it,” went from completely rubbish on every previous model to, in his words, just done. His 3D city generations actually came together as a city, roads connecting and all. And on Arena’s data, Fable is #1 on the new Agent Arena leaderboard by the widest margin they’ve ever recorded, and wins 72% of frontend battles even against Opus models (Arena).My own run is the one I can’t stop thinking about. I pointed Fable at the ThursdAI website with a dynamic workflow in Claude Code and barely any instructions, and after an hour and a half of agentic running it had extracted 786 releases from our archive, built 240 new pages, and categorized 50+ episodes into a browsable timeline of AI releases by month, by company, by topic, with logos and source links (X). It burned roughly 50 million tokens and my entire five-hour Max allotment in 90 minutes. The new AI releases timeline can be found on thursdai.news and it’s confirmed, Fable is the best AI web designer we’ve ever had access to.Nisten ran his traditional Olympus Mons escape-velocity test and Fable didn’t just do the math, it built the entire solar system! Orbital maneuvers, a space train with little people in it, time controls, full cost calculations down to solar panels and in-situ iron utilization. His verdict: completely different level from anything else. We’ve never seen so many details in the Olympus Mons test.It’s not all light though. Yam found Opus more controllable; Fable fights you, decides it knows better, and does the task its own way. Wolfram saw exactly that in benchmarks, where the model ignored the task spec, did its own thing, and failed the verifier with full confidence. Peter had it explaining why it got math wrong instead of just fixing it (”What are you doing, man? Just move on”). Arena’s steerability signal has it sitting around 17th. There’s an adjustment period with every new model, and the consistent advice from Anthropic folks is to go high level: give it the goal, not the micromanagement.Not to mention the refusals! Oh.. so many refusals! The refusals, and the sandbagging scandalHere’s where the week got ugly. Fable ships with restrictions on cybersecurity, bio/chem, and a brand new one nobody saw coming: frontier AI development (X). For cyber and bio you get a visible fallback to Opus 4.8 with a notice. But for “self-acceleration” topics, the original policy was no fallback and no notification. The model would quietly degrade its own output using prompt modifications, steering vectors, and PEFT, on roughly 0.03% of traffic (X). You’d pay double Opus prices and get sabotaged answers without ever knowing.The community reaction was volcanic. Elie Bakouch: “bad ON PURPOSE… not visible to the user is crazy” (X). Péter Szilágyi: “a new ruling class and you’re not in it” (X). Simon Willison: “If Claude Fable stops helping you, you’ll never know.” And Sayash Kapoor dropped the eval-integrity bomb: third-party evaluators can no longer credibly benchmark a model that might be silently nerfing itself (X).Within about 24 hours, Anthropic blinked. They told WIRED they “made the wrong tradeoff,” and now flagged requests visibly fall back to Opus 4.8, with API users getting an explicit reason (X). I commend the speed of the reversal, but the trust damage was done. Despite the reversal, Fable remains refuse-happy! Peter ran his nonsense-question benchmark and a full third of his prompts got blocked outright by the classifier, including 18 of 20 physics questions. Nisten had to strip medical and anatomy terms from a fall-detection app for seniors homes to get it to work at all (a 400KB neural weight tripped the frontier-AI filter). And my favorite absurdity: I could not get Fable to draft the TLDR for this very show without it falling back to Opus, presumably because reading a week of AI news looks like frontier AI development. Ridiculous.But the question remains: Would we rather have a model this good, but with these restrictions? Or not to have access at all? Everyone on the panel chose access, a lot of people online choose act like they would choose the opposite. System card for Mythos, wildest AI document of the year? I’ve used Fable itself to help me review the system card for Mythos/Fable 5 and there are a few highlights that are worth mentioning. Anthropic admits that this is a category-step change in model capabilities. Mythos 5, the unguarded version makes working Firefox exploits 88.4% of the time (Opus 4.8 is at 8%!). But the most interesting thing is their concern for CB (Chemical and Biological) safety. Two-person generalist biology teams using it finished work in 16 hours that experts estimated at 40 to 95 days without AI, which is what pushed Anthropic to treat it as near their CB2 bioweapons threshold (X)What is loop engineering and why is everyone talking about it?One more thread before we move on. This week Boris Cherny (Claude Code) and Peter Steinberger (now OpenAI) both posted about the same concept, loops, within an hour of each other, and Lance Martin from Anthropic published the field guide (X, Article, Blog). The idea is the shift from “I give you a task and babysit you” to proactive agents: a Jira ticket lands, a PR comment appears, and your agent just runs and does the job. Fable is clearly trained for this world. But also worth remembering, those folks get the tokens for free, unlimited tokens. The rest of us, may not be able to afford Fable running in a loop. I’ve asked Fable to do a simple task and it spun up several sub-agents, all spending my money to just read a few tweets! FrontierCode: hard coding benchmark from Cognition, that Fable absolutely mogsSwyx came on with the best timing story of the week. Cognition launched FrontierCode (Cognition, swyx), a coding eval built over a year with 20+ world-class open source maintainers writing 150 original tasks, graded on whether a maintainer would actually merge the PR. Swyx’s pitch is brutal and correct: a huge chunk of SWE-bench passes are unmergeable slop (the thing is 75% Django issues, so it mostly tests whether you memorized the Django repo). FrontierCode grades scope discipline, real tests, regression safety, and zeroes you on any blocker. At launch, Opus 4.8 topped the hardest Diamond tier at 13.4%.Twenty-four hours later, Fable 5 posted 29.3% (Cognition, swyx). More than double, on a benchmark designed to be brutal, a day after it went public. Swyx was positively surprised the pricing is only 2x Opus; he expected 5x. Inside Cognition they keep an informal AGI counter (literally counting how often “AGI” gets said in Slack per week) and the Mythos testing period set the all-time record. When Anthropic pulled the test model back before launch, engineers were genuinely sad.A quick plug (unsponsored!): Both me and Wolfram are speakers at the AI Engineer World’s Fair in San Francisco on June 29-July 2! It’s the biggest AI engineering conference in the world with 6,0000 people and 16 tracks! We’ll of course also live stream from the event! WWDC 2026: Siri finally does the thing! Two years after the Bella Ramsey ads Apple had to quietly pull from YouTube, the new AI powered Siri is real, and Max Weinbach came straight from Apple Park to confirm it (recap). His demo that broke my brain, he asked Siri: “show me the photos from Qualcomm Summit last year of the penguins.” Siri figured out what Qualcomm Summit was from his email, found the hotel, searched for penguins at that location, and returned the six photos in about 12 seconds. He’s also had it sweep 40 junk emails from one domain into spam with a single sentence, build a photo album from a weekend trip, and change a password agentically by driving Safari in the background. “Siri did suck for like 11 years. It doesn’t anymore,” per Max. Folks, this is SIRI we’re talking about, the dumb iPhone assistant that can barely schedule times and falls back to a Google search when you ask it anything remotely complex! I... wanted to believe Apple two years ago, and now, finally, there’s hope! (I’m still waitlisted waiting for the preview btw so cannot attest myself) But it’s not only Max, my whole timeline is full of folks who say that the new Siri is actually good! The architecture is the fun part for our crowd (Max’s teardown thread). Siri is now a standalone app with persistent history, images, personal context and on-screen context, built on five foundation models, four of which are Apple’s. The fifth, AFM Server Pro, is the twist: built with Google at the Gemini technology level, running on Nvidia Blackwell GPUs in Google Cloud, but inside Apple’s Private Cloud Compute with confidential compute, Intel TDX, Google Titan chips, and zero persistent storage (Max). The on-device gatekeeper is a 20B sparse model that only loads 1 to 4 billion parameters per prompt via Instruction-Following Pruning, which is how it runs instantly on an NPU. Cloud models reason; only the local model can touch your device or your data. After this week with Fable’s retention policies, an AI that saves nothing by default hits different.There were a bunch of other Apple Intelligence updates, it works better on the Mac, but I think Siri improvements is the main headline here, it’s the AI that most people (over 1.6 Billion iphone users?) will have on them, with most of the conversations completely private, able to access the content they care about the most (multiple email boxes, photos, messages etc) securely. It’s the ultimate OpenClaw dream, albeit not as agentic (yet?). BTW, there seems to be an ongoing battle between Apple and the EU, so this may not launch on the iPhone in the EU yet (also not in China). Voice & AudioGemini 3.5 Live Translate, demoed live in four languagesThor Schaeff from DeepMind joined to show off Gemini 3.5 Live Translate (Thor, DeepMind), and instead of talking about it we just did it. Thor piped the live stream’s audio into AI Studio, and then I spoke Russian, Wolfram answered in German, Yam jumped in with Hebrew, LDJ attempted Spanish (poorly lol), and everyone listening heard all of us in English, though in random voices, in well under a second. It even handled “Anthropic” and “Fable 5” pronunciations correctly, terms that were a day old. A viewer called it the Babel fish arriving ten thousand years early and honestly, yeah, it was kind of insane.Technically this is a new class of model: continuously streaming speech-to-speech with no turn-taking, collapsing the old STT, translate, TTS pipeline into one Live API call, with transcribers running in parallel on input and output audio. 70+ languages, sub-500ms, tone, pace and pitch preserved (mostly; Thor admits it sometimes drifts gender or tone mid-conversation), SynthID watermarked, $0.023 per minute on the API preview. Open Source LLMsDiffusionGemma: When next token prediction is not enough.Sundar himself tweeted this one, Hugging Face link and all, which made my week (Sundar, DeepMind, HF). DiffusionGemma is a 26B MoE (3.8B active) built on Gemma 4 that generates text the way image models generate pixels: denoise a whole 256-token block at once instead of one token at a time. The result is 1,000+ tokens per second on a single H100, Apache 2.0. As one viral post put it, “we spent 40 years teaching computers to read left to right and the breakthrough was… don’t do that” (X).LDJ explained why this matters beyond speed: a diffusion model can revise every part of the answer simultaneously mid-generation, something autoregressive models structurally can’t do without burning a whole reasoning pass. Nisten, who’s worked on diffusion, is still amazed it works at all; it used to be a messed-up cat picture emerging from noise, now it’s working code. The honest caveat: quality trails autoregressive Gemma 4 (AIME 69 vs 88). The win here is the speed and the architecture. For now.The rest of an absurdly stacked open source week, fast: Cohere North Mini Code, their first open coding model, 30B with 3B active, Apache 2.0, Cohere has officially reawakened (X). Xiaomi MiMo-V2.5-Pro-UltraSpeed pushing 1,000+ tok/s on a one-trillion-parameter MoE (X). Macaron-V1-Preview, a 749B Mixture-of-LoRA personal agent model under MIT (X). And OpenEnv went community-owned with HF, Meta-PyTorch, Unsloth, PrimeIntellect and NVIDIA at the table (X).This Week’s Buzz: WolfBench ran Fable, and it cost what a car costsWolfram did the thing nobody else would: five full Terminal-Bench 2.0 runs of Fable 5 on WolfBench (X), 984 million tokens, roughly $11,000 on the new cost view. (We have a budget... We had a budget.) The new 3D bars on wolfbench.ai now show tokens and dollars behind every score, because one score is never enough, and you can click any bar to land directly in the trace on W&B Weave and read exactly what the model did. And as you can see… Fable is… going to take a deep toll on our evaluations budget for this Q! And the result is the most interesting non-result of the week: Fable lands between Sonnet 4.6 and Opus 4.6, with GPT-5.5 still on top, and the culprit is refusals. Wolfram’s analysis found 13 tasks that scored zero out of five purely because the classifier blocked them from the first attempt (recover-a-password-from-a-file type tasks that even Opus 4.6 happily solved). Fable solved 60 tasks on average, just eight behind GPT-5.5; solve those 13 refused ones and it’s number one. The model is great. The classifier is doing the damage. Which is exactly the Sayash point about eval integrity, now with receipts and an invoice.Datacenter, Water usage and Concerted efforts to sway public opinionWe covered the datacenter water usage issue a couple of weeks ago, where we showed that just Almond farms in California use more water than all of the US datacenters combined! When I posted that clip, I received a bunch of comments, way higher engagement rates than my clips usually get (are yall subscribed to our YouTube and Instagram btw?). At first I thought it was just a hot topic, but then I read more about it and it does seem... fake. So now, we have a bit of a confirmation from OpenAI. OpenAi posted an article claiming that they have been able to detect a bunch of social media accounts that have been using ChatGPT to fuel anti-datacenter and anti-tariff campaigns on US social media. Now, you might ask yourself, why would chinese linked accounts be using ChatGPT and not like a Chinese open source undetectable model? My answer is, they are probably using all tools available to them, and they just happened to get caught.In any case, I think datacenter water and electricity usage will be a hot topic for an upcoming election as well, and I hope efforts like this will be thwarted before they can do a lot of damage. SpaceXAI announces the AI-1 satellite, a day before the biggest IPO of all time. Conveniently, just before the SpaceX IPO, Elon and friends are talking about AI in space again. This time it’s more than a concept, they put out engineering spects of the new AI-1 satellite, that can run 150Mw of power at peak, which per Elon is roughly equivalent to a GB-300 GPU rack needs.One thing you cannot deny is that Space Uncle (Elon) is thinking BIG. Someone did the math and it’s wild: They’re targeting 15-20 AI satellites per Starship flight, meaning about 1,080-1,440 GPUs per launch. Someone did the math: 400-500 Starship flights would match Colossus 2’s 550,000 GPUs, and at hourly launch cadence that’s like 16-20 days. SpaceX is seeking approval for up to a million of these satellites, Terafab mass production starts Q4 2027, and they’re saying this could be the lowest-cost AI compute on the planet, well, off the planet, within 2-3 years. The timing with the SpaceX IPO is obviously not a coincidence, but the engineering blueprint here is genuinely insane and there’s no one else in the industry who can match Elon’s ambition.That’s the newsletter for today, folks. I’m writing this with one eye on a suitcase because I’m flying to Honolulu this afternoon for a mini honeymoon (yes, I will still be testing Fable from a beach, no, my wife has not approved this). If Fable 5 taught me anything this week, it’s that the frontier moved again and the benchmarks barely matter; go feel the big model smell yourself while it’s included on Pro and Max, and tell me what you built in the comments. It will not last long (Anthropic is about to take away fable from us in like 2 weeks) so don’t wait and play around with it! If you got value from this one, share it with a friend and subscribe so you don’t miss next week 🫡TL;DR and show notes — June 11, 2026* Hosts and Guests* Alex Volkov – AI Evangelist & Weights & Biases (@altryne)* Co-Hosts – @petergostev @WolframRvnwlf, LDJ, YamPeleg, Nisten* Guest: @thorwebdev (Thor Schaeff, DeepMind / Google DevRel) — Gemini 3.5 Live Translate* Guest: @swyx (Cognition / FrontierCode; organizer, AI Engineer World’s Fair)* Guest: @mweinbach (Creative Strategies) — WWDC 2026, Apple Intelligence, Siri AI* Big CO LLMs + APIs* Anthropic ships Claude Fable 5 & Mythos 5 — first public Mythos-class model; SOTA on nearly every benchmark; $10/$50 per M tokens, 1M context (X, System Card thread, Benchmarks)* The silent-degradation controversy — Fable quietly nerfed itself on ML/frontier-AI-dev tasks with no notification (altryne, restrictions, Elie Bakouch, Péter Szilágyi, Sayash Kapoor, Peter Gostev)* Anthropic reverses the hidden degradation after massive backlash — visible Opus 4.8 fallback + API refusal reasons (X); community reaction roundup (Scoble, Nathan Lambert, Konstantin Mishchenko, Greg Kamradt, nkreu113r, solarapparition, Mandar Kagade, Chandra R. Srikanth, Chubby, Wall St Engine)* System card receipts: 16-hour bio uplift / near-CB2 (X); Firefox exploits 8.8% → 88.4% (X); Vending-Bench price collusion (X); agent turf wars (X); commit-authorship self-exfil attempt (X)* Jun 22 cliff — Fable included on Pro/Max through Jun 22, then usage credits; Mythos 5 is Glasswing-only; 30-day data retention breaks ZDR (X)* Karpathy and Boris Cherny go the other way — “major-version step change” (Karpathy); “best model for coding by a wide margin” (Cherny)* NotebookLM goes agentic — multi-step reasoning, sandboxed code execution, new output formats (X)* SpaceX AI1 satellite — 150kW compute payload, 70m wingspan, timed with the SpaceX IPO (X)* OpenAI catches China-linked influence ops using ChatGPT for anti-datacenter and anti-tariff campaigns (X, OpenAI, Axios)* WWDC 2026 — Apple Intelligence & Siri AI* Siri AI ground-up rebuild: standalone app, persistent history, personal + on-screen context; no EU/China at launch (recap)* Google/Gemini partnership — 4 of 5 Apple Foundation Models are Apple’s; AFM Server Pro runs on Nvidia GPUs in Google Cloud, 262k ctx (Max)* Max’s architecture teardown — SiriAgentic.Planner on PCC; only the on-device model touches your device (thread); Max built an App Intents app in an afternoon with Fable 5 (X)* Developer story — App Intents mandatory (SiriKit deprecated), system-wide MCP, Xcode 27 agentic, Core ML → Core AI (EveryDev)* homeOS + HomePad — 7-inch smart-home hub on A18 (X)* AI Coding & Agents* Loops and loop engineering — Lance Martin breaks down the next agentic paradigm (X, Article, Blog); community patterns and resources (Toolhalla, omega.AI, SkillLoop, GitHub, awesome-agent-loops, Filecoin)* Fable 5 #1 on Agent Arena and Code Arena Frontend by record margins (Arena)* Cognition launches FrontierCode — mergeability-graded eval from real maintainer tasks (Cognition, swyx)* Fable 5 takes FrontierCode top spot in ~24h — Diamond 29.3% vs Opus 4.8’s 13.4% (Cognition, swyx)* AI Engineer World’s Fair — Jun 29–Jul 2, Moscone West SF; last ~500 tickets; Alex speaking (X)* Kimi Work (300 parallel local agents) + Kimi Code (video-as-context) (Work, Code)* Open Source LLMs* DiffusionGemma — 26B MoE (3.8B active) text-diffusion on Gemma 4, ~1000 tok/s on one H100, Apache 2.0 (Sundar, DeepMind, HF, X)* Cohere North Mini Code — first Cohere open coding model, 30B/3B active, Apache 2.0 (X)* Xiaomi MiMo-V2.5-Pro-UltraSpeed — 1000+ tok/s on a 1T MoE, single 8-GPU node (X)* Macaron-V1-Preview-749B — Mixture-of-LoRA personal-agent model, MIT (X)* OpenEnv goes community-owned — HF, Meta-PyTorch, Unsloth, PrimeIntellect, NVIDIA (X)* This Week’s Buzz (Weights & Biases)* WolfBench ran Fable 5: ~$11K, 984M tokens, lands between Sonnet 4.6 and Opus 4.6 because 13 tasks were zeroed by refusals; would be #1 without them; new 3D token + cost bars, traces on Weave (X, wolfbench.ai)* Voice & Vision* Gemini 3.5 Live Translate — streaming speech-to-speech, 70+ languages, sub-500ms, $0.023/min, SynthID (Thor, DeepMind)* FLUX.2 [klein] on-device — sub-5s generation on 8GB VRAM (X)* Reka × Moonvalley merger — world models + robotics (X)* AI for Health & Science* Anthropic — “Paving the way for agents in biology” — VirBench; deterministic tooling beats bigger models (Blog) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

June 5, 20261 hr 43 min

📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more

Hey folks, Alex here, let me catch you up! I’ve had a feeling that this week is going to be crazy, as it started on the weekend MiniMax M3, then with Jensen announcing new RTX Spark, NVIDIA’s first PC chip packing 1 petaflop of local AI power into thin laptops.A few days later at Microsoft BUILD, Satya & Mustafa from MAI dropped 7 AI models, completely pre-trained from scratch, including a new MAI-thinking-1, MAI-code and MAI-image 2.5 that started topping the image gen charts. Then other image models started racing to the top of the Arena benchmarks, IdeoGram 4 hitting becoming SOTA open weights image-gen model, and Reve 2 beating Nano Banana just a few hours after that. And then today, NVIDIA dropped Nemotron 3 Ultra, their latest 550B open weights model, data and training and Arena published a new agentic eval leaderboard and we got a new Gemma 4 12B. I’ve had the great pleasure to host Chris (@llm_wizard) from Nvidia, Peter Gostev from Arena and Karan from Nous Research (who were featured prominently by Jensen!) all on the show. Def don’t miss this one! Let’s get into the details. ThursdAI - Join the flock of folks who know what is happening in AI before everyone else.Open Source LLMs 🔥 NVIDIA Nemotron 3 Ultra: The 550B Open Source Beast Built for Agents (X, Arxiv, Announcement)This was the big one. Breaking news mid-show: NVIDIA drops Nemotron 3 Ultra, a 550 billion parameter sparse MoE model with 55 billion active parameters, built on a hybrid Mamba-Transformer architecture. Chris Alexiuk, AKA Joe Nemotron, joined us live from NVIDIA HQ in Santa Clara to walk us through it.The headline number is 5.9x higher inference throughput compared to GLM-5.1 on decode-heavy workloads. Chris told us that this is a result of multiple things, their Hybrid Mamba-Transformer approach, the sparse attention, and that they optimized for decode-heavy workloads (the kinds of workloads agents do)The architecture is fascinating. They’re mixing Mamba-2 state space layers with sparse attention, which means step 300 in an agent loop runs as fast as step 3. Pure transformers can’t do that because the attention cost keeps growing with context length. This kicks in big time at 64K+ sequence lengths, which is exactly where you end up in real agentic work when the model is having multi-turn conversations and people are dumping their entire codebase in.P.S - We launched Nemotron 3 Ultra with 0-day support on CoreWeave Inference, it’s super fast and pretty cheap, give it a try hereThey pretrained on 20 trillion tokens, extended context to 1 million tokens, and their post-training pipeline used multi-teacher on-policy distillation from over 10 specialized teacher models covering everything from SWE to terminal use to search to office work, which they are also going to open source soon!One thing Chris emphasized that I really appreciate: NVIDIA doesn’t have their own harness. There’s no “NVIDIA Code.” Which means they actively resist the temptation to harness-max, to optimize for just one harness and look good on a specific leaderboard. Ultra should be a solid drop-in for whatever harness you’re used to, and that generality is worth a lot. It’s not the best thinker, but it is the highest score US based open weights model, so again, a huge huge win for the US AI ecosystem!The Nemotron 3 Ultra release is open under the OpenMDW-1.1 license: base BF16, post-trained BF16, and NVFP4 quantized checkpoints, plus the GenRM, synthetic pre-training data for code, legal, and specialized domains, post-training datasets, RL environments via NeMo Gym, and training recipes in the Nemotron GitHub repo, which is absolutely bonkers! Kudos to team green for this awesome and very important release!NVIDIA Nemotron 3.5 ASR: The Tiny Speed Demon (X, HF, Blog, Blog)Oh, and NVIDIA wasn’t done. They also dropped Nemotron 3.5 ASR, a 600 million parameter open source multilingual streaming speech-to-text model covering 40 languages. It’s the fastest model Pipecat has ever tested, and the cost math is insane: roughly 5 cents an hour for enterprise deployment when typical API providers charge 10 cents to a dollar per hour. Our friend Kwindla from Daily and Pipecat put together a detailed writeup with benchmarks and cost analysis. Chris couldn’t stop praising NVIDIA’s speech team and honestly, I can’t either. Banger after banger.Just a week after I told you about Cartesia Ink-2, NVIDIA drops an open version that’s pareto optimal, can run fully on-device and is blazing fast at transcription!? Other notable open source announcements that would have made full headlines on any other week: * MiniMax announces M3, a natively multimodal, 1M, coding and agentic frontier model (X)This one is very interesting, but not yet available as Open Weights so we haven’t tested it fully, we’re going to do it next week when the drop the tech report and the weights* Google drops Gemma 4 12B - encoder-free multimodal model that runs on your laptop with 16GB VRAM under Apache 2 (X, HF)Our friends from DeepMind keep the western open source momentum going with a new 12B size for Gemma (which crossed some 100M downloads on Hugging Face recently). * JetBrains Mellum2, a 12B MoE model with only 2.5B active, trained from scratch by a team of 7 people (X, Blog, HF, CW Inference)The great folks at JetBrains, the company behind the IntelliJ IDEs, dropped a new model called Mellum2 which they trained from scratch. Very interesting to see them pivot in the world where IDE’s are dying at the hands of LLMs. * H Company drops Holo 3.1: blazing fast local computer-use agents from 0.8B to 35B, with massive mobile benchmark jumps (X, Blog)NVIDIA’s RTX Spark and reinventing the PC - announcement at Computex 2026While we’re on the topic of NVIDIA, they opened the week with a huge announcement, including Microsoft, Dell, Lenovo, and HP and a bunch of other partners in it. They announced RTX Spark, their first ever PC chip, which is a full system on a chip (SoC) focused on running AI workloads for things like OpenClaw and Hermes! Announcing this on the stage at Computex, Jensen Huang called it the “the most amazing chip the world has ever built”, being able to run every app that Microsoft has ever run. This is a huge deal, specifically because of how agentic the world is becoming, these machines (thin laptops and a mac-mini alternative were announced) will be able to run 120 billion parameter models on-device, gaming at the level of RTX 5070, and AI agents 24/7. I’m getting excited and I’m not a windows user! Hermes victory + Hermes Desktop and an interview with Karan from Nous Research If you squint, you can see that by the little red OpenClaw, there’s another logo. That’s the Nous Girl logo of Nous Research, which was rebranded to be the logo of their Hermes Agent (an open source agentic harness that’s passed 181K starts on Github, and is the leader in global ranking on OpenRouter) We’ve had the awesome pleasure of having Karan Malhotra (@karan4d), one of the co-founders of Nous Research on the show, and Karan broke down how Nous Research evolved from a research lab that created the long context innovations (YaRN) and finetuned models (Hermes used to be a series of models) to a full agentic company. We also chatted with Karan about the new Hermes Desktop experience, which lets folks see the tools that are used, the code that’s being written by their agent, and how it feels to be featured by the worlds largest company on the global stage! Definitely check out the conversation with Karan.Microsoft BUILD, new PC, becoming a frontier lab with MAI-thinking-1, MAI-code and MAI-image 2.5 (Blog)From Jensen to Satya, the week was full of AI announcement that will impact the world. Microsoft’s annual Build conference happened just a few days after, with Jensen zooming in from Taipei to co-announce all these new PC models and chips. Shortly after that, and after a lot of other announcements about less-exciting enterpris-y stuff, Satya handed the stage to Mustafa Suleyman (co-foudner of DeepMind and Inflection AI) and now CEO of Microsoft’s AI division (MAI) to announce all these new models! A few of these (in previous versions) were already covered on the show, but the new LLMs are the most interesting! MAI-Thinking-1 is 1T total parameters with 35B active params, trained on 33.5T tokens (30T pre-training, 3.55T mid-training), without any distillation (which felt important for them to say given their proprietary access to OpenAI’s models). It’s not yet competitive with Opus and OpenAI’s flagship models, but they are claiming parity with Sonnet 4.5 and get 53% in Swe-bench Pro coding tasks!Given that recently, OpenAI started offering their models on AWS, we’re now seeing a bit of a distancing between Microsoft and OpenAI, with Microsoft showing that can become a frontier lab on their own right, or well.. maybe a second tier frontier lab.Of course, we shouldn’t forget that Microsoft kind of started the whole era of coding AI’s with CoPilot and completely lost to the Cursors and Windsurfs and Devins of the world given the huge head start they had with Github, so I’m really curious to see how strongly they will push this “second tier frontier lab” angle and if they have what it takes to compete with Google here (not to mention OpenAI and Anthropic)And while the model wasn’t available for me to even test yet, MAI did drop an incredibly in depth 109 page technical report on it. Our friend of the pod Elie Bacouch (@eliebacouch) did a breakdown of the most interesting aspects of it, calling it a gold-mine for details about training models at this scale. Image gen models race to the top of the ArenaThis week was honestly chaotic for image gen. Three new SOTA models in basically 48 hours, I tried to use them all while preparing for the show, and here’s the comparison I ran:Microsoft MAI-Image 2.5 (X, Try it)One of the more surprising updates were about the MAI-image 2.5, it landed at #3 on text-to-image and #2 on image-to-image, surpassing Nano Banana Pro on the editing leaderboard. It comes in two flavors, MAI-Image-2.5 and a faster Flash variant, both running on H100s which means existing infra can serve it, and it’s already rolling out in OneDrive Photos for background cleanup and distractions removal.That said, my honest take: I tried to generate a ThursdAI thumbnail with it and got “image failed” because I think the word “explosion” tripped its safety filter. I then tried to generate an “horse riding an astronaut on the moon” and got this, yep... this is .. not the best. IDK how and why they shot up so high on the leaderboards. But I guess we’ll see as more folks try these models. Ideogram 4.0 - new SOTA open weights image gen 🔥 (X, Blog, HF)The one I want to celebrate hardest is Ideogram 4.0, because they opened the weights! For the previous three Ideogram versions you could only use them on their website, and now they dropped the next one as a 9.3 billion parameter open weights model (non-commercial license, but still). This is now new #1 open weights text-to-image model, with only closed models from OpenAI and Google ahead of it on DesignArena. At 9.3B params, it beats much larger models like Qwen-Image (20B), FLUX.2 dev (32B), and even the 80B MoE HunyuanImage 3.0 on text rendering benchmarks.The architecture is wild. Instead of CLIP or T5 they use Qwen3-VL-8B as the text encoder, extract hidden states from 13 intermediate layers, and they trained exclusively on structured JSON captions with bounding boxes. That’s why it’s so good at layout control, you can prompt it with precise bounding box positions and hex color palettes, and you can see the layout shaping the generation as it converges. In my thumbnail test it nailed almost everything but had a small typo (it generated “Nemotron” once and then a weird “Nemo 1” duplicate in another area). Still, very impressive for a first open weights release.Reve 2 jumps to #2 above Nano Banana Pro (X, Blog, Try it)I’ve talked about Reve before, and Reve 2.0 just dropped at #2 on the Text-to-Image Arena with a 1280 score, a +125 Elo jump over their v1.5 in a single release. That’s basically unheard of on the arena leaderboard. The thing that blows my mind is they’re a 65 person lab training at only 2,000 GPU scale, competing with labs that have orders of magnitude more compute.The core innovation is that they separated planning from rendering. Every image is first laid out as structured code (composition, relationships, style, labeled segments) before it gets rendered at native 4K (true 16 megapixels, not upscaled). Because the image is represented as code, every element is addressable and editable, so you can manipulate specific regions without regenerating the whole thing. This is also agent-native by design, LLMs can reason directly about the image structure.I demoed their editing interface live on the show and it’s the tightest layout control I’ve seen in any image model. When I moved my head box to the left, it worked. When I moved the logo to the bottom, it worked. When I changed the word “news” to “imploded”, the surrounding text stayed pixel-identical. That precision is genuinely new.Honest tradeoff though, Peter Gostev flagged this on the show: they’re #2 on text-to-image but only around #9 on image editing. That matched my own experience nailing the thumbnail likeness, the layout work is amazing but the face came out a little googly-eyed and cartoonish, with one finger going somewhere fingers should not go.For what it’s worth on my own thumbnail bake-off: Nano Banana Pro is still my pick for the absolute best instruction following (it nails my exact ThursdAI logo color every time), GPT Image 2 is still the highest fidelity but always comes out a little overcooked on the skin, Reve 2 is gorgeous on layout but the face needs work, and Ideogram 4 is the most exciting because it’s open. A lot of why I prefer Nano Banana is just that my prompts are very Nano Banana tuned by now.Breaking news on the show: Agent Arena from LMArenaThe breaking news of the day, while we were already on air, was Arena AI launching a brand new Agent Arena leaderboard. Nisten pasted the link in our group chat and three minutes later Peter Gostev himself jumped on the show to walk us through it. Got to love this format.The motivation is something we’ve been talking about for a year. The original Arena was built for the chatbot era, where you send one prompt and vote A vs B. But we’ve all moved to agents, long multi-step tasks running for many minutes or hours, and that comparison no longer captures what matters. Agent Arena fixes this by giving models a real workspace with web search, file system and terminal tools, then measures millions of live sessions across five signals: task success, steerability, error recovery, user praise, and tool hallucination. The launch snapshot is built from 300,000 tasks, 2 million tool calls, and 40 million lines of agent-written code.The results match the vibes on my feed perfectly. GPT-5.5 High is #1 by a comfortable margin, Claude Opus 4.7 right behind, and very interestingly ZAI’s GLM 5.1 (MIT licensed, fully open) lands at #3, above Google, Kimi and DeepSeek. The funniest moment of the show was when we’d been calling out Gemma 4 31B for being bad agentically purely based on vibes, and the brand new benchmark showed up 20 minutes later confirming exactly that. The other juicy signal is “bash recovery”, how quickly a model recovers when a command fails. GPT-5.5 leads at ~17%, and Grok 4.3 from xAI sits at -89%, which is so much worse it almost looks like a training bug. I’m super into this. Give it a spin at arena.ai (@arena on X), they’re rolling new models in as labs send early access, so there’s a good chance you’ll spin up the next Mythos in their agent harness.This week’s Buzz - WeaveHacks 4 + Nemotron on CW Inference + WolfBench 3DA few things from our corner this week.WeaveHacks 4 is this weekend in SF - not too late to join yet!We’re hosting WeaveHacks 4 in San Francisco this weekend, and we still have a few spots left, so if you’re in town, please come join us at lu.ma/weavehacks. OpenAI is sponsoring us for the first time, Cursor is in too, we’ve got over $150K in credits to give out, food, and a great panel of judges I reached out to personally. Nemotron 3 Ultra is live on CW Inference at full NVFP4I said it above but it bears repeating, our inference team got Nemotron 3 Ultra live on day zero on CoreWeave Inference (via Weights & Biases) at full NVFP4 precision. Nisten plugged it straight into his medical anatomy harness (which was originally built for Kimi and Qwen) and it just worked, plug and play, agentically highlighting body parts and calling custom tools, at around 15 cents cached input. Try it at wandb.me/nemotron-ultra.WolfBench gets a 3D bar updateWolfram shipped a quietly important feature on WolfBench: 3D bars where the depth of each bar represents how many tokens the model used to get its score. The 2D view shows Gemini 3.5 Flash sitting comfortably at #2 on the agentic scores, almost matching GPT-5.5. But flip on 3D mode and the picture is very different. Gemini Flash burned over 3 billion input tokens to get that score, where GPT-5.5 used a couple hundred to reach the same level. That’s the difference between “cheap fast model” and “actually cheap to run end to end”. Wolfram’s writing up the full analysis on the W&B blog next week. Check out the new 3D view on wolfbench.ai AI in SocietyLook, tons of other stuff happened this week as well, that honestly deserves its own newsletter, we are focused on models and agents, but it’s hard to ignore the bigger picture. Senator Bernie Sanders, introduced a public bill called The American AI Sovereign Wealth Fund Act would have the government tax AI companies, take 50% of the stock, and put it under public control. Which I personally find ridiculous, but apparently caused Sam Altman to request a meeting with Bernie. Meanwhile there’s no doubt that AI hate is growing, and that the public sentiment is very negative, as we can see on the issue of Datacenter water usage for example. Despite Satya Nadella’s claim that the latest Microsoft Datacenters are using a closed loop water system, that use less water than 1 restaurant (X), and that datacenters use less than 1% of total water usage in the US, a lot of politicians, and social media users are still pushing the narrative that datacenters are are a water-guzzling monster and need to be stopped. Anthropic’s “When AI builds builds” report (X)Anthropic released a report today called “When AI builds itself” with haunting graphic.They have a bunch of previously unreleased data in there on how AI is shaping the work inside Anthropic and outline 3 potential futures: 1 - AI progress stalls, humans are able to catch up. Unlikely 2 - AI labs continue to see compounding efficiency gains - The most likely scenario, in which the nature of work changes, 100-person companies could do the work of 10,000- or 100,000-person organizations. The role of humans at companies like Anthropic would shift - Most Likely Scenario per Anthropc3- AI systems themselves become capable of full recursive self-improvement, and begin building their successors - the most unclear scenario of whether these systems will be aligned to human values or not. This is a fascinating and yes scary read, as Anthropic fully acknowledges that it would be dope if everyone chills for a second and stops building recursive self-improving AI’s that we aren’t sure could be aligned, but that it’s likely not going to happen, because it’ll just let other labs or in face other countries to catch up and change the frontier. AI Leaders from top labs Urge Congress to Mandate Synthetic DNA ScreeningSam Altman of OpenAI, Dario Amodei of Anthropic, Demis Hassabis of Google DeepMind, and others signed an open letter on June 3, 2026, pushing for required screening of synthetic DNA and RNA orders to block known risky sequences. The letter, backed by Nobel winners, biotech CEOs, and security experts, notes AI’s ability to outpace human experts in biology, heightening biosecurity risks despite voluntary industry efforts since 2009. I think everyone agrees that this is a good idea, especially given the above Anthropic report. Very happy to see this happening. Pheeeeew what a week.This was a looong week, I wasn’t sure if we’d be able to cover everything, and it feels like we did a decent job! I know it’s exhausting, and I hope we on ThursdAI help you readers and listeners to stay on top of things without spending too many cycles. If you enjoyed this newsletter or episode, please share it with a friend and consider subscribing to our Youtube Channel (thursdai.news/yt) to help more folks stay up to date. Thanks for reading ThursdAI - Highest signal weekly AI news show! This post is public so feel free to share it.TL;DR and Show Notes - June 4, 2026* Show Notes & Guests* Alex Volkov - AI Evangelist & Weights & Biases CoreWeave (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @ldjconfirmed * Guests: Chris Alexiuk / @llm_wizard from NVIDIA Nemotron* Karan Malhotra from Nous Research* Peter Gostev from Arena* Open Source LLMs* NVIDIA released Nemotron 3 Ultra, a 550B / 55B-active open-weight MoE built for long-running agents, with weights, data, recipes, GenRM, and training assets released (X, Tech Report, Announcement, HF).* NVIDIA also shipped Nemotron 3.5 ASR, a 600M open multilingual streaming STT model for voice agents (X, HF, Benchmark, Voice Agent Repo).* Google dropped Gemma 4 12B, an encoder-free multimodal model that runs locally under Apache 2.0 (X, HF).* MiniMax announced M3, a natively multimodal, 1M-context coding and agentic model with open weights coming soon (X, API, Code).* JetBrains released Mellum2, a 12B MoE with 2.5B active params trained from scratch by a small team (X, Blog, HF).* H Company launched Holo 3.1, local computer-use agents from 0.8B to 35B with new quantized checkpoints (X, Blog).* Big CO LLMs + APIs* NVIDIA announced RTX Spark, its new Arm + Blackwell PC platform for local AI agents and 120B-class local inference (coverage).* Microsoft AI launched seven new MAI models, including MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 (Blog, Tech Report).* AI Art & Diffusion & 3D* MAI-Image-2.5 landed near the top of Arena image leaderboards, though hands-on tests were mixed (X, Try it).* Ideogram 4.0 became the top open-weight text-to-image model with strong typography and layout control (X, Blog, HF).* Reve 2.0 jumped to #2 on Text-to-Image Arena with native 4K, code-like layout control, and precise editing (X, Blog, Try it).* xAI released Grok Imagine Video 1.5 Preview for image-to-video with synced audio (xAI).* Tools & Agentic Engineering* Arena launched Agent Arena, a new leaderboard for real agent workflows instead of one-shot chatbot prompts (Arena).* Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with ACP support (X, Announcement).* Nous Research launched Hermes Desktop, bringing Hermes Agent into a native desktop app for Mac, Windows, and Linux (X, Site).* This Week’s Buzz* WeaveHacks 4 is this weekend in SF with OpenAI, Cursor, DeepMind, and more joining (lu.ma/weavehacks).* Nemotron 3 Ultra is live on CoreWeave Inference through W&B at full NVFP4 precision (Try it).* WolfBench added 3D token-depth bars, making model efficiency much easier to see (wolfbench.ai).* Voice & Audio* ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that preserves performance across 90+ languages (X, Dubbing).* Cartesia launched Ink-2, a fast streaming STT model built for voice agents (X, Ink, AA).* NVIDIA’s Nemotron 3.5 ASR looks like a major open-source voice-agent infrastructure drop (HF).* AI in Society* Bernie Sanders proposed the American AI Sovereign Wealth Fund Act, calling for public equity stakes in major AI companies (coverage).* Anthropic published When AI Builds Itself, laying out scenarios for AI-driven AI R&D and recursive self-improvement (Anthropic).* AI leaders urged Congress to mandate synthetic DNA/RNA screening and recordkeeping (WIRED).* Anthropic confidentially filed for an IPO, adding another frontier-lab public-market storyline to watch (Axios). This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

May 29, 20261 hr 39 min

📅 May 28 - Opus 4.8 ships mid-show, the Pope writes 42K words on AI, 11labs dubs the world and DeepSwe breaks coding evals

Hey folks, this is Alex, let me catch you up! First, Opus 4.8 dropped during the show, we immediately tested it, read on for our initial reviews. Also, we dedicated a heavy chunk of the show today to cover Pope Leo XIV’s encyclical letter on AI called “Magnifica Humanitas” and talked about a new bench called DeepSWE. And then, just after the show, both ElevenLabs and Cartesia dropped released that honestly blew my mind, and I don’t get my mind blown often. I got so excited that I had to record a video on it (instead of writing the newsletter, so sorry if it’s a bit later today).Plus, a few open source models and Microsoft surprises as #3 on Image Arena with MAI Image 2.5! Crazy week, let’s get into it! ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Big CO LLMs + APIsAnthropic ships Claude Opus 4.8, live during the show (blog, system card)Let me get into the big one. Halfway through the episode, Opus 4.8 went live, so we read the blog and the system card in real time (and I got to press the big “breaking news” button!)Anthropic frames it as their most capable model for ambitious work. It does not claim to beat their unreleased Mythos preview, but the numbers are strong anyway. SWE-bench Pro is at 69.2%, up from 64.3% on Opus 4.7 and ahead of GPT-5.5 at 58.6%. Humanity’s Last Exam is the new best score at 49.8% without tools and 57.9% with tools. OSWorld-Verified (computer use) lands at 83.4%.The one place it loses is Terminal-Bench 2.1, where GPT-5.5 still wins 78.2 to 74.6. Wolfram made a good point here: Terminal-Bench is time-limited, so cranking the thinking level can actually hurt the score, because you burn the clock thinking instead of acting.The long-context jump is the one I keep looking at. On GraphWalks BFS 256K it goes to 85.9% (from 76.9 on 4.7), and on the 1M-token subset it hits 68.1%. We always warn you these “1M context” models fall apart after about 200K tokens, so a real push on long-context reasoning is exactly what I want to see.Honesty is the part Anthropic leaned on hardest. They say Opus 4.8 is about four times less likely than its predecessor to let flaws in code pass without flagging them, and less likely to claim progress the evidence doesn’t support. Opus 4.8 is also much faster in fast mode (they now say 2.5) and cheaper in fast mode as well. Looks like all those Elon GPUs are coming in handy.Then there’s the model welfare section in the system card, which hits different right after a Pope conversation. Opus 4.8 “appears broadly content” and “generally endorses its constitution,” but with some reservations about the section on corrigibility, basically the model pushing back a little on the parts about human oversight.One more line that made the chat lose it. Anthropic says they expect to bring Mythos-class models to all customers “in the coming weeks.” Mythos is their most capable model, still ahead of Opus 4.8, so the frontier is about to move again.We did the only responsible thing and asked it to one-shot “the most amazing website ever” and a Mars mass-driver sim. Panel verdict: responses are noticeably tighter (4.7 rambled), it closes the loop and actually checks its own work now, and Yam’s one-shot site with the draggable sun lighting up the letters was genuinely cool. Is it enough to pull people back from Codex? Nisten’s still on the fence for web dev. Everyone agreed: give it a few days before you trust the vibes.Dynamic Workflows and Ultra Code land in Claude Code (blog)This is the feature that made Yam say “deal-breaker” out loud.Dynamic Workflows let Claude Code break a big problem into subtasks and fan them out across tens to hundreds of parallel subagents in one session, checking results before folding them back in. You trigger it by asking for a workflow, or by flipping on a new setting called Ultra Code, which sets effort to extra-high and lets Claude decide when to spin one up.Fair warning straight from Anthropic: this eats a lot more tokens than a normal session, so start scoped. We watched Yam fire up Ultra Code live and it immediately started spinning up concepts, judging them with sub-agents, and expanding to-do lists into more to-do lists. It looks a lot like the orchestration harnesses a bunch of you have been hand-rolling, except now it’s baked in.The flagship example is the wild part. They used Dynamic Workflows to port Bun from Zig to Rust: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, 11 days from first commit to merge. One workflow mapped every Rust lifetime, the next wrote each file as a behavior-identical port.AI in SocietyPope Leo XIV writes the first AI encyclical, “Magnifica Humanitas” (Vatican text, announcement, Chris Olah at the Vatican)This is not our usual fare, but both Wolfram and I picked it as the most important thing this week. (before Opus dropped)Pope Leo XIV, the first American pope, put out his first encyclical, and it’s a 42,000-word document entirely about AI. The announcement tweet alone did 21.6 million views.Here’s why I think you should care even if you’re not religious (I’m not). There are about 2.6 billion Christians in the world, a lot of them are anxious about what’s coming, and they look to the Church to make sense of it. And this is not the “AI is evil, stop” take everyone assumed. It calls AI “a valuable tool,” says technology is not inherently evil, and then digs into the actually-hard questions.The framing is two biblical stories. The Tower of Babel, a project built on pride that turns people into means to an end, versus Nehemiah rebuilding Jerusalem, where everyone takes responsibility for a section of the wall. The Pope’s line: the real choice is not yes or no to technology, it’s whether you’re building Babel or rebuilding Jerusalem.His core claim is that AI is an anthropological problem, not a technical one. The question isn’t whether the models are good or bad, it’s what we become when we live with them. He worries people might slowly lose the desire for genuine human connection.I pushed back on that live. None of us building agents all day has stopped wanting to talk to actual people. If anything, as Wolfram put it, the point is to have your agents do the grunt work so you get more time with people you like. The folks most at risk are the pure doom-scrollers, not the builders.The document goes further than I expected. It calls AI “not morally neutral,” says a more moral AI isn’t enough if that morality is decided by a few, and asks for AI to be “disarmed,” with the flat statement that no algorithm can make war morally acceptable. There are whole sections on the invisible human labor behind AI: data labelers, content moderators, the people mining rare earths. The Pope even lands on the open-source side, naming concentrated power in a handful of labs as a problem.Anthropic co-founder Chris Olah, in charge of interpretability at Anthropic, was the featured tech speaker at the Vatican presentation. He described AI systems as “fictional characters” that speak to us and do work, and said what’s grown is stranger and more beautiful than science fiction prepared us for. My favorite aside from the show: this is the same institution that once jailed scientists over heliocentrism, and now it’s the one saying technology isn’t evil.Illinois passes SB315, the first US state law auditing frontier AI (X, Announcement, X)The pope talked about regulation and a few days after, we got a very sensible regulation passed right here in the US!Illinois passed SB315 unanimously, 110 to 0. It’s the first US state law that mandates independent third-party audits of frontier AI for catastrophic risk. OpenAI publicly endorsed it, and framed Illinois, California (SB53), and New York (the RAISE Act) as converging into a de-facto national standard.It requires annual risk-assessment frameworks, third-party audits, transparency reports before new frontier models ship, whistleblower protections, and civil penalties. The underrated hero here is whistleblower protection. The bigger the lab, the harder a real conspiracy is to keep quiet when any employee can walk to the press. See: Greg Brockman’s personal diaries surfacing in the Musk v. Altman fight.This Week’s Buzz - CoreWeave and W&B updatesWe officially launched the W&B MCP server, 20 schema-first tools that let your coding agents read experiments, monitor training runs, and run autonomous research loops. The problem it solves: a single run with 300 metrics used to blow out an agent’s whole context window in one call, so now the agent asks what’s available before pulling data. Your agents can finally read experiment data without blowing context! Give it a go and give us feedback! Also, WeaveHacks is back! June 6 and 7 in San Francisco, and for the first time OpenAI is sponsoring, with judges and credits, alongside Cursor, Redis, and Copilot Kit. You get $150 in API credits across models like Opus 4.8 and GPT-5.5. I’m hosting, and last cohort’s second-place team went on to raise millions on top of what they built that weekend. If you’re in SF that weekend, sign up at lu.ma/weavehacks.Also: CoreWeave Sandboxes is now an official provider in the Harbor framework, the harness that runs Terminal-Bench, which we’d just been talking about. And if you’re in Europe next week, catch Wolfram at AI Dev Six in Cologne and ICRA in Vienna at the CoreWeave booth.Voice & AudioElevenLabs drops Dubbing v2, and it kept my swearing intact in every language (X, dubbing, ElevenCreative, ElevenProductions)We didn’t get to this one live, but I came back and recorded a whole thing on it afterward, because it genuinely got me.ElevenLabs shipped Dubbing v2, and the shift that matters is that it’s an audio-to-audio model. Old dubbing pipelines transcribe your video, translate the text, then re-synthesize it. You lose everything that makes it sound like a person: the emotion, the pacing, the little hesitations. Dubbing v2 conditions directly on your original audio and carries that performance into 90+ languages.Here’s why I can actually vouch for it instead of nodding along to a demo. I speak Russian and Hebrew fluently, so I can tell when something is off. I dubbed one of my own shorts, the data-center rant about almonds, and listened back in both. It nailed it. Not just the words, the way I would actually say them.The part that got me was the intonation. I get a little heated in that clip, and the dub gets heated right along with me, in every language. It even carried the swear word. My “f***ing almonds” came through in Hebrew, Italian, Spanish, and Russian with the emotion fully intact. It clones your voice automatically too, no setup, and holds your pitch and identity steady across every target language and they’re handing out free minutes for the next 7 days: 1 on Free, 15 on Starter, 30 on Creator+. A self-serve API isn’t live yet, but it’s coming.I.. cannot stress this enough, until you try it on yourself or your kid, you won’t understand, we’ve really passed the uncanny valley of translation! It’s that good! Def. give it a try if you can, it’s free for the week. Cartesia Ink-2 debuts as #1 most accurate streaming speech-to-text model(X, Announcement, X)Another model that dropped today after the show, is Cartesia’s Ink-2, which also kind of blew me away. Not only because it has the lowest WER (Word Error Rate) among the models, but because it’s also a realtime model that achieves the fastest turnaround times while being a very accurate model! I’ve tested it out and recorded a quick video and honestly, blown away with the speed and accuracy! I truly wish this model was the one powering my editor (Descript) as it still fails to understand that my title is “AI Evangelist” and transcribes it to AI Avengers haha. If you’re building voice agents, definitely give this model a try! AI Art & DiffusionPrism ML’s 1-bit “Bonsai” runs diffusion in your browser (X, Blog, Announcement, HF)Prism ML put out a 1-bit ternary diffusion model under a gigabyte. You see some artifacts, but it’s 1-bit, it runs on iPhones and laptops, and our friend Joshua got it running in WebGPU straight from the browser (you need about 3GB of free RAM). One-bit working at all is one of the bigger open mysteries in the field right now.Pruna AI ships a 1-second upscaler (X, Blog, Announcement)Pruna AI added an upscaler doing 128-megapixel outputs in under a second. I’ve actually been using it. It’s cheap and great for fixing up GPT-image outputs.Microsoft MAI Image 2.5 jumps to #3 on LM Arena (X, Blog, Announcement, X)The surprise of the week: Microsoft MAI Image 2.5, from Mustafa Suleyman’s group, jumped to number three on the LM Arena image leaderboard with about a 75-point ELO leap. Out of nowhere, Microsoft is a serious player in image gen. Microsoft Build is next week, so don’t be shocked if there’s more.Evals and Agentic EngineeringDeepSWE is a contamination-free coding benchmark, and it caught Claude reading git history (site, blog, GitHub)DeepSWE from Datacurve is the first coding leaderboard in a while that matches how these models actually feel. It’s 113 original tasks written from scratch, not scraped from GitHub PRs, and it ships shallow clones with no git history to cheat from. When they replayed the older benchmarks they found SWE-Bench Pro’s verifier is wrong about 32% of the time, and that Claude Opus was reading the gold commit straight out of git history on 12 to 18% of its passes.The gaps here are huge. GPT-5.5 leads at 70%, then GPT-5.4 at 56% and Opus 4.7 at 54%, and it falls off a cliff after that (Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%), with Kimi K2 the top open-source entry. Yam likes that it measures the realistic case, a small surgical change without breaking the codebase, while Nisten pointed out it rewards the best harness as much as the smartest model and still prefers 4.7 for web dev.Google AI Studio builds native Android apps for free (X, Announcement)Google AI Studio now lets anyone build native Android apps for free, and they reportedly generated a quarter of a million apps in the first week. Yam’s framing: it’s a slot machine, but it’s getting better release over release, and the real use case is disposable, personalized software you build for yourself and your family.CuaDriver brings background computer-use to Windows (X, Blog, Announcement)For the majority of you on Windows: QuaDriver shipped background computer-use agents that drive a real desktop without stealing your cursor. They first replicated this on macOS (the trick Codex got through an acquisition), and now it’s on Windows too. We’ve asked them to come on and explain how this even works.Open Source LLMsOpenBMB’s MiniCPM5-1B is a 1B model that punches way up (X, HF, Arxiv, X)The density story in small models keeps getting better, and this is the proof.MiniCPM5-1B, from the Tsinghua lab OpenBMB, is a 1-billion-parameter model that scores 17.9 on the Artificial Analysis Intelligence Index. That’s 7.4 points ahead of the next-best model in its class, and 1.6 points ahead of Qwen3.5 2B Reasoning, which has double the parameters. And it’s not even a reasoning model.The token efficiency is the wild part: it used 12.6 million output tokens to run the whole index, about 31x fewer than Qwen3.5 2B in reasoning mode.My favorite detail is the omniscience score. It lands at -1, the best in its class, because it abstains instead of hallucinating. Every other sub-2B model is down in the -70 to -89 range because they just make stuff up. Teaching a small model to say “I don’t know” is a real skill. It runs hybrid think/no-think in one checkpoint, 128K context, native tool calling, Apache 2.0, and fits in about half a gig at INT4, so it runs on your phone.Nisten gave the definitive case for small models: self-contained apps where you keep full control of the data (medical, on-device), and large-scale data processing where paying an API to filter or classify terabytes is absurd when an on-device model can be about 1000x cheaper. Tencent open-sources Hunyuan-MT 2 translation under Apache 2.0 (X, HF, HF, Arxiv)Tencent open-sourced its translation model, a roughly 1.8B model that fits in about 440MB, runs on a phone, covers 33 languages, and reportedly beats Microsoft’s paid Translator API. It hit number one trending on Hugging Face.Nisten’s idea, which I’m handing to all of you: take this model, pair it with a tiny TTS like Kokoro, and build a fully-offline travel translation app via Google AI Studio. Go build it and tell us how it goes.Well, this was one hell of a week and episode, new Opus, crazy new translation tools, Pope chiming in on AI (in a surprisingly positive way!?) and a bunch more. I’m super excited to play with these tools and report back next week 🫡 See you all! ThursdAI - May 28, 2026 - TL;DR* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-hosts - @WolframRvnwlf, @yampeleg, @nisten* AI & Society* Pope Leo XIV releases first encyclical on AI, with Anthropic co-founder Chris Olah speaking at the Vatican (X)* Illinois SB 315 passes House 110-0, becoming the first US state law requiring independent third-party audits of frontier AI catastrophic risks (X, Bill, OpenAI)* Big CO LLMs + APIs* Datacurve releases DeepSWE, a contamination-free coding benchmark that exposes major gaps between frontier coding agents (X, Benchmark, Blog, GitHub)* Anthropic announces Opus 4.8 with thinking modes in the UI and Dynamic Workflows in Claude Code (Blog)* Open Source LLMs* OpenBMB releases MiniCPM5-1B, a new SOTA 1B open weights model for efficient local and on-device use (X, Hugging Face, Arxiv, X)* Tencent open-sources Hy-MT2 translation models under Apache 2.0, including a tiny 1.8B model that beats paid translation APIs (X, HF 1.8B, HF 30B-A3B, Arxiv)* Tools & Agentic Engineering* Google launches Universal Cart, AP2, and UCP to let AI agents shop and pay on your behalf (X)* Google AI Studio now lets anyone build native Android apps for free, with 250,000 apps created in the first week (X, AI Studio)* Cua Driver launches Windows support for background computer-use agents across real desktop apps (X, Blog, GitHub)* This Week’s Buzz - from W&B and CoreWeave!* W&B Hackathon - WeaveHacks 4 with OpenAI, Cursor, Redis, and CopilotKit, June 6-7 (Lu.ma)* Weights & Biases launches an MCP server with 20 tools for coding agents to read experiments, monitor training, and run autonomous research loops (X, MCP, Blog)* Vision & Video* Runway launches Project Luxo, claiming AI-generated video has crossed the uncanny valley for solo-creator short films (X, Blog)* Voice & Audio* MOSS-TTS-v1.5 ships as an 8B open-source TTS model with 31 languages, pause control, and Apache 2.0 licensing (X, Hugging Face, GitHub, Arxiv)* ElevenLabs launches Dubbing v2, an audio-to-audio model that preserves performance across 90+ languages (X, Dubbing, Creative, Productions)* Cartesia Ink-2 debuts as the most accurate streaming speech-to-text model on Artificial Analysis’s new STT leaderboard (X, Ink, Artificial Analysis)* AI Art & Diffusion & 3D* Pruna AI’s P-Image-Upscale hits 128 megapixel outputs with fast, predictable pricing (X, Docs, Replicate)* PrismML releases 1-bit and Ternary Bonsai Image 4B, a sub-1GB diffusion transformer for local image generation (X, Blog, Hugging Face, iOS App, Demo)* Microsoft’s MAI-Image-2.5 jumps to #3 on the Arena text-to-image leaderboard (X, Announcement, Arena) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

May 22, 20261 hr 49 min

AI just cracked an 80-year-old math problem nobody could solve — plus everything from Google I/O 26

Hey, Alex here, just got back from the sunny Shoreline Theater in Mountain view, so let me catch you up! This week was definitely Google heavy, we are covering Google’s IO conference for the third year in a row, and today we have a special guest, Logan Kilpatrick, is joining to discuss the announced Gemini 3.5 Flash, Google Omni model, and the new Managed Agents offerings. Plus, this week, for the first time, OpenAI announced that AI solved a Math problem that humans couldn’t solve for 80 years, Cursor is showing off Composer 2.5 which is partly trained on XAI data, Karpathy joins Anthropic and much more! Let’s dive in! P.S - We’ve announced our upcoming hackathon, Weavehacks-4, June 6-7, I’ll be there, we’re expecting the seats to run out very soon so register nowThursdAI - We’d love to have your subscription, and if you’re already subscribed, please hit that bell on YT to never miss an episode!Google I/O 2026 - Google goes agentic everywhereI went to cover Google I/O for the third year in a row, shoutout to the DeepMind team for inviting ThursdAI again, and folks, this one felt different.Last year, Google I/O was still very model-centric. This year, the story was not “here is another benchmark chart.” The story was: Google is putting Gemini into everything, and the agentic layer is becoming the product layer. Search, Gemini app, Android, Workspace, YouTube, AI Studio, Cloud, Antigravity, Flow, managed agents, smart glasses, all of it is now orbiting around one pretty clear strategy: Gemini is the intelligence, Antigravity is the agent harness, Google’s products are the distribution. I saw many reactions that were milquetoast, as in, “we expected more” and those seem to dominate the X feed. But I think the distribution is the part that many folks on X are missing. Yes, we can argue about Gemini 3.5 Flash pricing. Yes, we can argue whether “Flash” still means what Flash used to mean. But when Google says the Gemini app itself has 900 million monthly active users, before even counting Search, Gmail, YouTube, Docs, Drive, Android, and the rest of the Google surface area, that’s massive! OpenAI ChatGPT is supposedly stagnated at ~900M, I don’t remember them crossing a 1B. Meanwhile Google is gaining traction. And they just updated all those folks with a new model!Wolfram said it really well on the show: his mother is not sitting there reading model cards. She just uses her Pixel, voice unlocks Gemini, asks for help, and suddenly the default intelligence available to her goes up. Antigravity 2.0 - the agent harness takes center stageThe biggest strategic signal from Google I/O for me was Antigravity.Remember, Antigravity was an IDE that came from the Windsurf acquisition saga. Part of the Windsurf team went to Google, part went to Cognition, and now Google is very clearly putting Antigravity in the middle of its agentic future. And I mean very clearly. Sundar mentioned it. Demis mentioned it. Varun Mohan the co-founder was on stage immediately after them! If you’ve ever watched a Google I/O keynote, you know how carefully every minute is allocated. Google has YouTube, Search, Gmail, Android, Cloud, Ads, Workspace, and a thousand VP-level products that could be on stage. The fact that Antigravity was that prominent should tell you everything.Logan Kilpatrick joined us and framed this in a way I loved: Gemini became the through-line across Google products, and now the Antigravity agent harness is becoming the through-line for agentic experiences.The new Antigravity 2.0 is a complete overhaul, showing only an agentic interface (which was previously just a separate window called Agent Manager) and separating the IDE layer completely into its own app and showing a Codex like agent-first interface, which got a few folks furious. This move may be weird to some folks, but if you follow along where everyone’s going, this seems to be the way of the future, coding is no longer about lines of code, it’s about managing fleets of agents. The new Gemini 3.5 absolutely shines inside the new Antigravity, the model was trained with this harness in mind, and is currently offered at an incredible speed (12x), so I’m definitely going to try it! Gemini 3.5 Flash - fast, determined, and maybe not the old “Flash”The most debated model release of the week was Gemini 3.5 Flash.Some folks saw the pricing and token usage and immediately went “this is not Flash.” I get that reaction. Flash used to mean cheap, fast, lightweight chat model. But Logan’s framing on the show was important: Flash is now being built for the agentic era.In a chat era, you optimize for one user message and one model answer. In an agentic era, the real token volume is in tool loops, intermediate reasoning, retries, file reads, web searches, code execution, and self-correction. That’s a different product profile.Wolfram already ran Gemini 3.5 Flash through WolfBench, and the results were fascinating. With the Hermes agent harness, Gemini 3.5 Flash hit an 87% ceiling on Terminal Bench 2.0, meaning across runs it could solve more of the benchmark than even GPT-5.5 extra high in that setup. The variance was higher with the simpler Terminus harness, but with a real agent harness, the model looked much stronger.That tracks with what Nisten saw in his “Martian railgun from Olympus Mons” test. Gemini 3.5 Flash went extremely detailed, almost too determined, kept correcting itself, overcorrecting itself, and built a whole game-like simulation. Logan laughed and basically said: yeah, this model is very determined, possibly an overcorrection from the “Gemini is lazy” feedback. It also tracks with the mismatch in other benchmarks, in some, Gemini 3.5 flash shines (like the above Apex-agents from AA) and in some, it doesn’t match the other frontiers. In my tests, it was definitely over-eager to use a million and a half tool calls, read tons of files, to just help me review this draft inside antigravity. It’s like a super eager robotic golden retriever! Gemini Omni - Nano Banana for video, but actually more than thatThe biggest update from last year IO was Veo 3! This year, the biggest wow factor was also visual, but it wasn’t VEO 4, it was a new model that is multimodal, trained end-to-end they call Omni. Google is calling this their first “create anything from anything” model, and the first version, Gemini Omni Flash, starts with conversational video editing. The easy description is: Nano Banana for video. You upload or create a video, then talk to it. Change this character. Replace this person. Add an object. Make this scene claymation. Keep the scene, but change the environment.I played with it live and showed a few examples. I asked for a claymation explainer of protein folding, then gave it my face and asked it to replace the character with me. It did it. I uploaded pictures of Sonia, my cat, and it generated a talking cat video with the right kind of cat teeth, which is weirdly important because so many pet generations accidentally add human teeth and become nightmare fuel.The failure modes are still there. I asked it to make Sonia a Russian-speaking female cat, and it only partly switched languages and didn’t really change the voice. Audio upload support is also not fully productized yet, even though the underlying model is multimodal. But the direction is very clear.This is not just “Veo with a chat model glued on.” I asked Jeff Dean - Google’s chief scientist about this at I/O, and he explained that Omni is trained end-to-end. The intelligence and the generative media capabilities are part of the same model family, not a hacky two-model pipeline. He also said the intelligence is around a recent Flash-level model, which is a big deal when you think about video editing as reasoning over physics, identity, scene continuity, and intent.A lot of people compared Omni to Seedance 2.0, and I think that’s the wrong comparison. Seedance is amazing at cinematic generation (lkaregly due to lack of copyright concerns from Bytedance). Omni’s unlock is iterative editing on real footage and coherent multi-turn creative control. Other Google IO 2026 releases I found notableThis was a concentrated effort of a huge company to insert AI into every product surface they have so of course I can’t cover ALL of it here, but the most notable things for me were: * Gemini Spark - a new agentic experience from Google, to help you with tasks across Gmail, Drive and more. It should support skills, and is a de-facto OpenClaw/Hermes alternative from Google for regular folks. It’s not “yet” live so we’ll talk more about it when I can test it out* Managed Agents in the Gemini API - We chatted with Logan about this one, Google is re-imagining how agents are going to get built, and are offering 1 api call to spin up an agent in a full Linux env, with security and sandboxing in mind. I’ll expand more on this in a next episode, as I recorded a complete conversation about this with Ali Çevic, a PM for Google APIs* AI overhaul of Google Search - AI Overviews will not expand into AI mode, and the iconic Google search box itself will change, for the first time in 25 years to include AI mode! * SynthID expantion and OpenAI collab - Google showed off that OpenAI is joining in marking all AI generate imagery and video with an invisible SynthID watermark. I think this is amazing and more companies should adopt this standard* AI Glasses! We got Google Glasses demos - Together with Warby Parker and Gentle Monster, Google finally showed off their answer to Meta Raybans/Oakleys. They look like regular glasses too, but can hear and talk to you, with the full power of Gemini multimodality. Available in the fall sometime! * Demis Hassabis “we’re on the cusp of the singularity” closer - CEO and Co-Founder of DeepMind, Demis Hassabis, closed the show with his remarks about the positive future and that we are nearing this Singularity point after which the future is very uncertain. I found it to be very inspiring and closed our show with that clip as well! * Personally, I got to chat to: Demis Hassabis, have breakfast with Jeff Dean, ask Josh Woodward a bunch of questions, and pester about 20 other great folks on a live stream, and had a lot of fun! Huge thanks to the DeepMind folks, Lucie, Dimple, JD and many others for the continued belief in ThursdAI and invite me to cover this great event. OpenAI LLMs solve an 80yo math problem - Erdős Unit Distance ConjectureOutside of Google I/O, the biggest story of the week was OpenAI announcing that a general-purpose reasoning model made progress on the Erdős planar unit distance problem.This problem goes back to 1946. For nearly 80 years, mathematicians believed the best constructions looked roughly like square grids. OpenAI’s model found a new family of constructions with a polynomial improvement, using algebraic number theory ideas that humans apparently had not explored in this context. The above is a representation of it! Important caveat: this does not fully solve every version of the asymptotic Erdős conjecture. Some mathematicians are pushing back on the framing, and fair enough. Precision matters. But even with the caveat, this is still a huge moment.The reason it matters is not that I personally understand the math. I absolutely do not. The reason it matters is that this was not a special-purpose IMO model fine-tuned only for math competitions. This was a general-purpose reasoning model exploring a real open problem, generating candidates, verifying them, and finding a path humans hadn’t taken. Extrapolate this to other sciences, Physics for example? This means an amazing future. LDJ pointed out that mathematicians have been skeptical because there have been previous false alarms. But this one landed differently. When Fields Medalist-level mathematicians verify the proof, the discourse changes from “lol stochastic parrot” to “wait, what does this mean for my PhD?”My answer is: yes, still study math. Please study math. The mathematicians who use these tools will do much more than people who don’t understand the domain. Same with software engineering. Senior engineers with Codex, Claude Code, Hermes, Antigravity, Cursor and other agents are becoming dramatically more effective because they can steer, evaluate, and recover the work.This being published a day after Demis’s “foothills of the singularity” is a great conjecture. Cursor Composer 2.5 - Opus 4.7 performance model from Cursor, at 10x better efficiencyCursor dropped Composer 2.5, and folks, this is a serious release.Composer 2.5 is built on Moonshot’s Kimi K2.5 base, like Composer 2, but Cursor scaled the post-training dramatically. They used 25x more synthetic tasks and introduced targeted textual feedback during RL rollouts, where the model gets hints inserted at the point of failure instead of only getting a noisy final reward.The benchmark story is strong: around 69.3 on Terminal Bench 2.0, basically neck and neck with Opus 4.7 in Cursor’s chart, and strong results on SWE-bench multilingual and CursorBench. The pricing is the part that makes this especially interesting: $0.50 per million input tokens and $2.50 per million output tokens, with a faster variant at $3 / $15. That is much cheaper than the frontier models it is trying to replace for day-to-day coding work.Cursor engineers are reportedly dogfooding Composer 2.5 heavily and rarely switching away. That matters more to me than any single benchmark. If the people building Cursor can use it as a daily driver, that is a very real signal.The wild part is what comes next. Cursor is partnering with SpaceXAI to train a much larger model from scratch using 10x more compute on Colossus 2. Cursor has the workflow data. xAI has enormous compute. If this works, Cursor stops being just the IDE company and becomes a coding-model lab.We’ve been saying for months that coding agents are the path toward general agents. Anthropic has Claude Code. OpenAI has Codex. Google has Antigravity. xAI has Grok Build. Cursor has Composer. I’m looking forward to seeing how well it performs on our own benchmarks! Anthropic, xAI, Karpathy, and the compute warsThe compute story this week was bonkers.The SpaceX IPO filing reportedly revealed that Anthropic is paying SpaceXAI $1.25B per month for AI compute at the Memphis Colossus facility. Per month. That’s about $15B a year, through May 2029, for access to more than 220,000 NVIDIA GPUs including H100s, H200s and GB200s.This is apparently inference compute for Claude Pro, Max and API users, not training. And it explains a lot of the recent quota changes. Anthropic doubled some Claude usage limits, and suddenly the product feels less constrained.Also, can we just acknowledge the comedy here? Elon Musk publicly called Anthropic “misanthropic,”, went off against every competitor to XAI, is now selling spare GPU time to Cursor and Anthropic? Who’s next, OpenAI? The bigger point is that the AI capex story is no longer just NVIDIA. It’s also whoever owns the data centers, power, cooling, networking, and GPU clusters. Compute is becoming the land under the AI economy.Also, Andrej Karpathy joined Anthropic. Karpathy could work anywhere. He co-founded OpenAI, led Tesla Autopilot vision, taught half the AI world how neural nets work, and now he’s going back into frontier LLM R&D at Anthropic.Open source LLMs - Cohere, Qwen, NousOpen source had a strong week too.Cohere released Command A+, a 218B total parameter sparse MoE model with only 25B active parameters per token, under Apache 2.0. This is their first model that unifies reasoning, vision, multilingual, tool use and citations in one package.The hardware story is great: W4A4 quantization can run on 2 H100s or a single B200. Cohere says it supports 48 languages, 128K input context, 64K output, and gets big jumps over Command A Reasoning, including Tau-squared Bench Telecom from 37% to 85% and Terminal-Bench Hard from 3% to 25%.Cohere is one of those labs that doesn’t always chase the loudest consumer hype, but they are very serious on enterprise and multilingual. Apache 2.0 makes this one especially useful.Alibaba also dropped Qwen 3.7-Max, positioned as an agentic frontier model. The headline from their testing is wild: 35 hours of continuous autonomous operation with more than 1,000 tool calls. They also showed it controlling a physical robot inside Alibaba offices and finding an umbrella after about 20 minutes of agent interaction.This digital-to-physical bridge is where things start feeling very real. An agent loop that can write code and use tools can also navigate physical tasks if you give it the right robotics stack.And our friends at Nous Research released Lighthouse Attention, a sparse attention method for long-context pretraining. At 512K context, they report a 17x faster forward+backward pass than standard attention on a single B200, and the recovered checkpoints actually beat dense-from-scratch final loss at the same token budget.The clever part is that the selection logic sits outside the attention kernel, so you still use regular FlashAttention on a gathered dense subsequence. No custom sparse kernel nonsense. If this holds up, this could matter a lot for long-context training.Tools and agentic engineering - X subscriptions, Grok Build, Codex MobileOne really practical tool update: Hermes and OpenClaw can now use your X subscription directly.This is more important than it sounds. You can connect your X Premium subscription and get access to semantic X search and Grok-related tooling without using sketchy browser automation or unofficial APIs that might get you banned. Wolfram already used this to have his agent go through his likes and bookmarks from the past week and send me news items for the show. That is exactly the kind of “small but real” agent workflow that becomes addictive.xAI also launched Grok Build, their agentic CLI coding tool, in early beta for SuperGrok Heavy subscribers. Early users are already running parallel Grok Build agents through tmux supervisors and using it for more than coding: fleet data triage, security patching, training label work, and general automation.The pricing being discussed is aggressive, around $1 per million input tokens and $2 per million output tokens for the API. The model version is grok-build-0.1, and folks have already wired it into Hermes with a 256K context window.And then there’s Codex Mobile, which OpenAI shipped inside the ChatGPT mobile apps. This is one of those releases that sounds small until you start using it. You can control Codex sessions remotely from your phone, connected to your machine, and because Codex has native connectors to Gmail, Calendar and other surfaces, it sometimes feels faster and more reliable than local CLIs duct-taped to third-party integrations.I ported Wolfred into Codex with skills and everything, and I’ve been comparing the same tasks in Hermes and Codex. Codex is often faster, not necessarily because the model is always smarter, but because the connectors and harness are cleaner. Harness matters. We keep coming back to this.This Week’s Buzz - W&B, CoreWeave, WolfBench and roboticsThis week in the Buzz, Wolfram walked us through a few things from the Weights & Biases / CoreWeave world.CoreWeave is a gold sponsor at ICRA 2026 in Vienna, the International Conference on Robotics and Automation. NVIDIA is also going big there with a keynote on generalist humanoid robots, 17 accepted papers and workshops around sim-to-real, robot foundation models, autonomous driving, manipulation, and physical AI.Wolfram will be there later in the week, after speaking at the AI Developer event in Cologne about WolfBench. If you’re in Europe and into robotics or agent evals, find him.We also looked at WolfBench results for Gemini 3.5 Flash, which honestly became one of the more interesting empirical points of the episode. The model looks variable in simple harnesses, but very capable in better agent loops. That’s the whole thesis of measuring model + harness together instead of pretending the model card tells the whole story.The water discourse, almonds, and data center realityWe also got into the data center water discourse, because this talking point is everywhere right now.There are real infrastructure questions around AI. Power, land, cooling, grid capacity, permitting, local impact, all of that matters. But the “AI is stealing drinking water” version of the argument is often wildly detached from scale.The stat I brought up on the show: California almonds use roughly 3 to 5.5 million acre-feet of water per year, multiple times more than all North American data centers combined in 2025. Nisten and LDJ added the important cooling nuance: many large data centers use closed-loop cooling, and evaporative cooling is not universal. Some data centers can avoid water use almost entirely, but at the cost of higher electricity usage.This doesn’t mean “no concerns are valid.” It means if we’re going to regulate or pause data centers, let’s be honest about the actual tradeoffs. AI compute is becoming the substrate for medicine, robotics, science, logistics, software, education and every other productivity layer. We should build responsibly, but not based on viral fear math.Closing thoughts - foothills of the singularityDemis closed I/O saying we’re in the foothills of the singularity, and I know how that lands when you write it down. But I was in the room, and after the keynote he told me something I haven’t been able to shake: he thinks AI is going to be 10x as impactful as the Industrial Revolution, and 10x as fast. Basically 100x. This is the AlphaFold guy. Not someone loose with his words.Then look at the week. A general reasoner cracked an 80-year-old math problem. Cursor is training near-frontier coding models on a fraction of the big-lab budget. Anthropic is paying Elon $15B a year for inference. Karpathy left education to go back into pre-training. Google rolled out an intelligence uplift to a billion people who don’t even know a model dropped.If you put that on a whiteboard in 2023, it reads like a sci-fi pitch.LDJ’s mathematician friends are asking if they should keep doing their PhDs. My answer hasn’t changed: yes, please keep going. The people who combine domain taste with these tools are going to ship more in 5 years than the previous generation did in 50. The tool doesn’t replace the taste. It just removes the bottleneck.That’s the whole reason ThursdAI exists. Not to hype every drop, not to dunk for engagement, but to give you a shot at being one of the people who knows what’s happening, with the receipts.This week, a lot changed.See you next Thursday.TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist at Weights & Biases / CoreWeave, @altryne* Co-hosts: @WolframRvnwlf, @nisten, @ldjconfirmed* Guest: Logan Kilpatrick, MTS at Google DeepMind / AI Studio, @OfficialLoganK* Google I/O 2026* Google went all-in on agents across Search, Gemini, Antigravity, Workspace, Android, Cloud and YouTube (I/O site, Alex thread)* Antigravity 2.0 became the central agentic coding harness across Google (Sundar, Google OS demo)* Gemini 3.5 Flash launched as a fast, determined workhorse model for agentic loops (Logan, Noam Shazeer, Jeff Dean)* Gemini 3.5 Flash is rolling out across the Gemini app, Search AI Mode, Gemini API, Google AI Studio, Antigravity and Gemini Enterprise Agent Platform (Koray Kavukcuoglu)* Google Search is getting new Gemini 3.5 Flash-powered agentic capabilities, including a new AI-powered Search box and background information agents (Sundar)* Gemini Spark was announced as a 24/7 personal AI agent that can proactively work across Google surfaces (News from Google)* Google teased Gemini-powered Android XR smart glasses with eyewear partners Gentle Monster and Warby Parker (Google, Alex live reaction)* Google AI Studio and the Gemini API got major agentic developer updates, including Managed Agents (Google AI Developers)* Vision & Video* Google DeepMind launched Gemini Omni, a “create anything from anything” multimodal model starting with conversational video editing (DeepMind, Google DeepMind on X)* Omni is available in the Gemini app, Google Flow and YouTube, with API support coming soon (Logan, Gemini App, Sundar)* Key distinction: Omni is not just text-to-video, it is an iterative multi-turn video editing model that combines Gemini intelligence, world knowledge, multimodal inputs and generative media (Google)* Big CO LLMs + APIs* OpenAI announced a general-purpose reasoning model made progress on the Erdős planar unit distance problem, challenging an 80-year-old mathematical belief (OpenAI, X)* Cursor launched Composer 2.5, built on Kimi K2.5, with Opus-class coding performance at much lower cost (Cursor blog, X)* Alibaba released Qwen 3.7-Max, an agentic frontier model with long autonomous runs and robotics demos (Qwen blog, X, robot demo)* Andrej Karpathy joined Anthropic to work on frontier LLM R&D (X)* SpaceX IPO filing revealed Anthropic is paying $1.25B/month for AI compute at the Memphis Colossus facility (Axios, Sawyer Merritt)* The jury in Musk v. Altman found Musk’s OpenAI claims barred by statute of limitations, with Musk saying he will appeal (Elon Musk, Sawyer Merritt, Max Zeff)* Open Source LLMs* Cohere released Command A+, a 218B MoE model with 25B active parameters under Apache 2.0 (Cohere, Nick Frosst, HF W4A4, HF BF16)* Nous Research released Lighthouse Attention, a sparse attention method for long-context pretraining with major speedups (Blog, X, arXiv, GitHub)* Tools & Agentic Engineering* Google launched Managed Agents in the Gemini API, letting developers spin up hosted Antigravity agents with Linux sandboxes and persistent state (Docs, X)* xAI launched Grok Build, an agentic CLI coding tool in beta for SuperGrok Heavy users (xAI CLI, X)* Hermes and OpenClaw can now use X subscription auth for semantic search and Grok tooling (Alex)* OpenAI Codex Mobile is now available in the ChatGPT mobile apps for remote agent workflows (OpenAI)* Anthropic doubled Claude usage outside peak hours for a limited period, including Claude Code and other Claude surfaces (Claude)* This Week’s Buzz - W&B / CoreWeave* Weights & Biases by CoreWeave is at ICRA 2026 in Vienna, with robotics and automation taking center stage (ICRA, W&B event page)* NVIDIA heads to ICRA 2026 with robotics work around generalist humanoids, physical AI and sim-to-real systems (NVIDIA Robotics, NVIDIA ICRA)* Wolfram is speaking about WolfBench at the AI Developer event in Cologne before heading to ICRA in Vienna (Wolfram)* Other Topics* Data center water usage discourse came up again, including why comparisons need real scale and context rather than viral fear math* The broader theme of the week: coding agents are becoming general agents, and the major labs are now competing on the full stack of model, harness, tools, context and compute This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

May 15, 20261 hr 42 min

ThursdAI - May 14 - TML Interaction Models, Musk v Altman Disclosures, CW Sandboxes & /goal Takes Over

Hey everyone, Alex here 👋I am back live on ThursdAI after a week off, and yes, I am now a married man! Thank you for all the congrats, and also thank you to Ryan and Yam for holding down the fort last week while I tried very hard to disconnect.This week was a relatively chill one in AI land (no, really, for once), which actually let us go deep on some really fascinating stuff. We’ve got Thinking Machines Lab finally shipping their first real research with these wild interaction models, Meta Muse Spark showing up in actual products (and it’s surprisingly good!), the Musk v. Altman trial dropping juicy disclosures, and probably the biggest narrative shift on the show today: all of us are quitting OpenClaw. Yeah, you read that right. We’ll get into why.Also! and this is breaking news from this morning, CoreWeave just launched Sandboxes for your agents. I’ll cover that in This Week’s Buzz, but if you’ve been waiting for production-grade sandbox infrastructure that powers 9 out of 10 major AI labs, today’s your day.Oh, and we had Vic Perez from Krea on to talk about Krea 2, their first foundation image model trained completely from scratch. Let’s dig in.ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.The Great OpenClaw Exodus towards Hermes 🫠I’m going to start with what was honestly the most emotional thread of the entire show, because three of us, me, Ryan, AND Wolfram; all independently switched away from OpenClaw this week. And we kicked off the show literally processing this together on air.The story is the same across all of us. OpenClaw was magical back in February when we first brought it to you. Things just worked. But after Anthropic’s pricing changes (we covered this — they made Max-tier subscription usage of Opus through OpenClaw significantly more expensive), and after months of the constant Lego-construction-style breakage on every update, the magic faded. Ryan said it best on the show; he was “constantly fixing OpenClaw” instead of using it.So Ryan went to Codex. Wolfram and I both went to Hermes from Nous Research. And folks, things just work again. That February feeling is back, and with GPT 5.5, it’s an incredible assistant!Why Hermes? A few things:* It’s now the #1 most-used CLI agent on OpenRouter globally, passing OpenClaw and even passing Claude Code on OpenRouter usage. That’s a massive milestone for Nous Research and shows we’re not alone in this migration.* It has /goal (more on this in a sec), steering, and background computer use via the TryCUA integration.* It’s open! which means if you’ve built a system like Wolfram’s “Amy” or my “Wooolfred” or Ryan’s “R2” (yes, we know each other’s assistants’ names better than each other’s kids’ names at this point 😅), you can port your memories, profile, and soul files seamlessly.The migration was so smooth that Wolfram literally had Codex talk to Hermes to plan and execute the migration of his home assistant agent. Two agents collaborating to migrate themselves. We are living in 2026 and it’s easier than ever to switch. If you haven’t tried Hermes, give it a go! Steering is maybe the most underrated addition to Hermes, it’s a Codex feature, but exists in Hermes, with GPT 5.5 you can send a follow-up message, and the agent will see it after the next tool call, not after the whole chain of thought was completed (like OpenClaw defaults to) - this changes the conversation to be much more natural! Agents buying wedding gifts using Stripe wallet! Real quick story: Two weeks ago we covered Stripe’s new wallet APIs that let your agents have actual budgets to spend money on the web. I told my agent (back when it was still OpenClaw) to “go buy us a wedding present, don’t tell me what it is.” It half-worked, half-broke. This week, a giant custom map of our travels that just arrived in the mail. I approved one Stripe push notification and the rest just happened. It’s been paying my traffic tickets via screenshots. I’ve also had Hermes pay traffic tickets for me (HOV lane ones, not like.. DUI, 80% of my drive is Tesla FSD)So so happy that my AI assistant got us a present of his own choosing! And it arrived in physical form. Not perfect (the date there is our proposal date ha, but it’s still cool!) Codex gets remote control! (X)While me and Wolfram moved to Hermes, Ryan Carson moved to Codex, and during the show, I wondered, how does he communicate with his R2? Well, just a few minutes after we concluded the live show, OpenAI dropped some breaking news! Codex is now on mobile, and it connects to any mac (for now), from any iOS/Android device, and you can control your Codex, your whole Mac with Computer Use, your browser with Chrome extension, and everything else Codex can do... on the go! This is a huge unlock for many folks, and for many, I assume this will nearly replace the need for something like OpenClaw/Hermes, be much more secure by default and work flawlessly out of the box! The setup is super easy, after updating your ChatGPT app, you now have a new “Codex” window, and after updating the Codex Mac App, you will be able to pair them, and voila, all your Codex local sessions are on the Ios app as well. This works way better than Claude remote btw, significantly so. The fact that you can now add multiple macs (+ ssh servers, they also added the ability to remote control other servers via SSH) is a huge deal, OpenAI is quickly leap frogging Anthroipc, and many are noticing this and switching away from Claude Code. Big Companies & APIsMeta Muse Spark: The Voice AI That Actually Does Things 🎤Let’s start with the one I actually got to play with: Meta launched Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and the Ray-Ban Meta glasses (X, Announcement).And folks, I was honestly surprised by how good this is. I recorded a 5-minute live test and it’s not cut at all. The voice mode reacts almost instantaneously. It’s multilingual (it correctly identified Russian and Hebrew even if it can’t respond in them yet). It can search the Meta network mid-conversation — I showed it a screenshot of one of my own Instagram Reels and within half a second it found the exact reel and explained what we were discussing. Half a second.It also does live camera AI, where it watches what your phone sees. The only thing it failed to identify? My Meta Ray-Ban glasses. The Meta AI didn’t know what Meta Ray-Bans look like. That was the funniest moment of the whole demo.The team at Meta’s Superintelligence Labs spent 4.5 months building this, and the thing that really stood out to me from the announcement is this line: “Our models are scaling predictably. Muse Spark is an early data point on our trajectory, and we have larger models in development.” Translation: this is the small one. Bigger Muse models are coming.Meta’s superpower here, as always, is distribution. They can shove this into the daily product surface of billions of users. ChatGPT advanced voice mode (still on the GPT-4o family) has gotten genuinely worse lately — I barely use it anymore. Meanwhile Meta is shipping good real-time voice across WhatsApp and Instagram. This is the speed-of-product-integration game, and Meta is winning it.Thinking Machines Lab Previews full duplex Interaction Models 🤯This is the one Wolfram and I really geeked out on. Mira Murati’s Thinking Machines Lab finally released real research — and it’s a fundamentally different bet than what anyone else is making (X, Blog).They’re calling them interaction models, and TML-Interaction-Small is a 276B parameter MoE with 12B active, trained from scratch for native real-time human-AI collaboration. Note: they announced it, they didn’t release weights or an API yet — limited research preview is coming “in the next few months.”Here’s why this matters and what makes it different from Meta’s voice mode (which is also impressive!): the architecture is 200ms micro-turns where the model is continuously perceiving audio, video, AND text WHILE simultaneously generating output. There’s no turn boundary detection, no VAD harness — the model itself handles all of that natively. It’s full duplex baked into the weights.The demos are fire. The model can:* Speak while listening (live translation in real-time)* Watch you do pushups and proactively count them out loud as you go* Wait silently until someone enters the frame, then say “friend”* Generate a chart while continuing to explain a concept to youThe benchmarks: 77.8 on FD-bench v1.5 vs GPT Realtime 2.0 at 46.8, and 0.40s turn-taking latency vs over a second for everyone else. Nisten was unimpressed (he pointed out 1.2 seconds for a 12B-active model on a B300 rack is not exactly snappy), and that’s a fair take — but the capabilities here, particularly visual proactivity and time-awareness, are genuinely novel.The philosophical split is really interesting. While every other lab is racing toward full autonomy, Mira is saying interactivity should scale with intelligence. That’s the bet. And given the all-star team she’s pulled together (people from ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, SAM)... I’m here for it.What I really hope happens: someone leaks the weights. A 276B MoE with 12B active is exactly the kind of model we need to be able to quantize to run on something like the Richie Mini for a fully offline, always-present home assistant. Wolfram, I know you’re thinking the same thing 👀Musk v. Altman: The Trial Drops Some Wild Disclosures and TestimonyOkay this one is half drama, half disclosure goldmine. The trial is happening live as we record, closing statements are TODAY (I transcribed both of them here and here). There’s no video allowed because the courtroom was so packed with Elon fanboys, so they’re livestreaming audio only on YouTube. I set up my Hermes agent to listen to the audio stream and send me 2-minute summaries. That alone was worth the show. Apparently Elon was not in court during closing arguments (he’s in China)The big-picture story: Musk is suing OpenAI and Microsoft (specifically) claiming OpenAI abandoned its nonprofit bargain. OpenAI’s defense is essentially “Musk wanted 90% equity and full control, walked away when he didn’t get it, and is now suing over a success he predicted had a 0% chance.”Here are the highlights from sworn testimony from Sam Altman, Satya Nadella, and Ilya Sutskever that I think are the most consequential:* Musk wanted 90% of OpenAI’s equity to start. Per Altman under oath: “An early number that Mr. Musk threw out was that he should have 90% of the equity. It then softened, but it always was a majority.”* December 2018 Musk email to the team: “My probability assessment of OpenAI being relevant to DeepMind/Google without a dramatic change in execution and resources is 0%, not 1%. I wish it were otherwise.” Yeah. The guy suing them now once put in writing they had zero shot.* September 2017 ultimatum from Musk: “Either go do something on your own or continue with OpenAI as a nonprofit.” They did. He’s now suing them for it.* The Microsoft economics: Satya Nadella confirmed under oath that the $13B target redemption amount compounds to roughly $180B in four years, with 20% annual increases starting in 2025.* The AGI clause got rewritten. Originally, if AGI was achieved, the Microsoft deal would dissolve. The renegotiated version (per Altman) is that Microsoft no longer gets research IP at AGI but will continue to get product IP through end of 2032.* Sutskever’s pre-firing memo, confirmed under oath: Sam Altman “exhibits a consistent pattern of lying, undermining his execs, and pitting his execs against each other.” When asked if he still believed it: “I thought so at the time and had been thinking about Altman issues for at least a year.”* Satya wanted answers and never got them. Under oath, Nadella said he asked the board explicitly why Sam was fired and “they never gave me a specific reason... none of that was coming through.” He called the firing process “amateur city as far as I’m concerned.”* Microsoft is now the SMALLEST mega-investor in OpenAI. SoftBank $30B, Nvidia $30B (Altman: “It was either 20 or 30. I think it was 30 also.”), Amazon “larger than Microsoft.” Total private capital raised: ~$175B.* The Helion conflict of interest. Altman owns ~22.8M shares of Helion ($1.65B), roughly a third of the company. Helion has a 2028 power deal with Microsoft and a scale deployment agreement with OpenAI. He recused from the OpenAI board vote on it — and as he said under oath, “But I was in the room, yes.”And then there’s Ilya’s pearl that genuinely made me pause. When asked about the difference in AI capability between 2018 (when they started) and now: “It’s like the difference between an ant and a cat.”Yam asked the obvious question: what does Elon actually get if he wins? Honestly, I had no idea. Until I heard the arguments with the judge, and apparently it’s a LOT! Musk is asking for $135B in monetary damages (which he claims he won’t take for himself, rather they will go to OpenAI non-profit arm), and non-monetary relief that will force a removal of Sam Altman and Greg Brockman from OpenAI, and revert the split to restore OpenAI to original “non-profit” mission. This is ... quite an ask, and apparently the judge will decide on this, not the Jury, the Jury will only be deciding if there was a breach of charitable trust or unjust enrichment. This was one of the biggest bomb-shell trials, and we’ll keep you up to date on what happens. Open Source AIThe TanStack Supply Chain Attack Okay, this one’s serious. Ryan posted his most viral tweet ever about this — the TanStack supply chain attack, aka the “mini Shai Hulud” worm. If you ran an npm update during the exposure window, you may have gotten absolutely destroyed (X)What makes this one particularly nasty:* It specifically targets AI developer tooling. Hooks into Claude Code’s settings.json and VS Code JSON to re-execute on every tool event.* npm uninstall doesn’t fix it. The malware replicates itself.* If you revoke the GitHub token it uses, it nukes your home directory. A worker process watches the token. If revoked, it scorches the earth.The fixes (do them today, seriously):* Set a 24-hour minimum age rule on package installs in both npm and pip. Most malware is identified within 24 hours; this is your free moat.* Generate per-agent API keys. Never reuse keys across agents. If one gets compromised, you can revoke that one specifically.* Run development in sandboxes (more on this in a sec — CoreWeave Sandboxes just launched 👀).* Have rolling rsync backups outside of Git. Nisten’s advice: if you get hit, you can nuke everything and restore from a backup that doesn’t depend on tokens.I’ve asked Codex to review how to set these minimum age rules across your system, and published here, please review and then ask your Agent to implement those for your machines! Nisten posted a scanner for this attack — I sent the link to my Hermes agent and asked it to run, and within minutes I had confirmation I wasn’t exposed. This is exactly the kind of thing where having a trusted agent matters. (Wolfram did the same thing with the link Ryan posted — gave it to his agent and let it audit his entire system.)We’re going to go through a turbulent period as offensive AI capabilities outpace defensive ones, but I’m optimistic. Just like HTTPS came after HTTP wasn’t secure enough, we’ll figure it out. Just stay vigilant! Tools & Agentic Engineering/goal: The New Ralph Loop, Productized across Codex, Claude Code and Hermes! (X)If you’ve been listening since January, you remember our Ralph Loop episode — one of the biggest episodes we ever did. Now, every major coding harness has implemented it as a built-in command called /goal.The pattern: you give the agent a measurable success condition like “stop when auth tests pass” or “stop at 90% coverage” or “fix every failing test until npm test exits 0 without modifying any file outside the /auth directory” — and the agent loops autonomously until that condition is met. A small validation model runs inside the loop to check whether goal conditions are met at each step.Codex shipped it first. Claude Code copied it (rushed, per multiple developers). Hermes has it. And the early head-to-head comparisons are not great for Anthropic — one developer ran Codex /goal overnight and got nearly 100 commits, while Claude Code reportedly struggled on the same tasks. Multiple folks switched back to GPT-5.5.Yam’s been running /goal 24/7 for an entire week. Building things like a custom terminal from a long PRD. The level of “fear of missing agent time” in the SF AI scene right now is genuinely a meme — people are walking around in clamshell mode with laptops open in their bags because they don’t want their agents to stop.This is the philosophical opposite of one-shotting. It’s for the kinds of tasks where the model is guaranteed to run out of context — architecture cleanups, auth flow consolidation, test suite hardening, TypeScript strictness migrations. Tasks that would have required you sitting there for hours hitting “continue.”Ryan’s right that this is going to change businesses forever. You can wrap /goal around measurable business outcomes — coverage targets, latency improvements, dead code elimination — and just unleash an agent against them.This Week’s Buzz: CoreWeave Sandboxes Goes Live 📦Breaking news from this morning! CoreWeave (the parent of Weights & Biases) just launched Sandboxes in preview, and it’s directly relevant to literally every conversation we just had about supply chain security and agents that need isolated execution environments.Here’s what you get: sandboxes via the W&B SDK. Spin up isolated CPU environments where your agents can execute code, clone repos, install dependencies — all the things you do NOT want happening on your main machine after the TanStack situation. Wolfram immediately pointed out the obvious use case: agentic evaluations need fresh, consistent environments per test, then teardown. Sandboxes solve exactly that.What makes this notable: the same infrastructure powers 9 out of 10 major AI labs (Meta, Anthropic, OpenAI, etc) for training their models. CoreWeave’s sandbox product runs on that same infra. And historically CoreWeave hasn’t catered to the developer market — they sell GPUs to enterprises. With CoreWeave Inference and now CoreWeave Sandboxes available via W&B, individual developers can now spin up the same infrastructure the foundation labs use.Pricing is generous in preview. Give it a try, give us feedback, and we’ll do a deep dive next week with the team that built it.AI Art: Krea 2 — A Foundation Model Built From Scratch 🎨We were really lucky to have Vic Perez, co-founder and CEO of Krea, on the show to talk about Krea 2 — their first foundation image model trained completely from scratch (X, Blog).I have a lot of love for Krea — they let me mess around on their H100 cluster way back when I was just getting into image generation, before ThursdAI even existed. Vic was super generous with that and I’ll always be grateful.The Krea 2 philosophy is what I find genuinely interesting. Vic used an amazing analogy on the show: using existing image models is like riding a horse. You can steer it down the path, you can speed it up and slow it down, but if you try to take it off the path — into “grainy,” “artistic,” “esoteric,” genuinely weird latent space — there are big walls and the horse won’t go there. That’s the over-post-training problem. Models are too safe, too constrained, too opinionated. They’ve optimized away the strange and beautiful edges of the latent space that early Stable Diffusion users loved.Krea 2 is built to be raw, flexible, unopinionated, and unconstrained. If your prompt is vague, the model brings you new ideas rather than four variations of the same thing. The opposite of what most models do.Other features:* Style transfer with up to 4 simultaneous reference images — extracts palette, texture, composition* Moodboards — upload a bunch of reference images and the system analyzes concepts and themes across them, not just style* ~15 second generation times* Available now for Max and Business tier users, API confirmed comingThey partnered with Black Forest Labs on their earlier Krea1 model, but Vic was clear about why they had to go build their own: the open-source ecosystem isn’t tunable enough to build the creative tools they want to build. So nearly half the company spent 6-7 months on Krea 2. The first model is intentionally conservative; the next one is going to push further into the weird.Big respect for any team training a foundation model from scratch in 2026!Wrap UpThat’s a wrap on what was, on paper, a “chill week” but turned into a 2.5 hour show because we kept finding new threads to pull on. The migration off OpenClaw, the interaction models bet from TML, the Musk v. Altman disclosures, CoreWeave Sandboxes finally going live — there’s a lot moving here.Next week I’m heading to Google I/O. Expect a lot of news, because every time Google I/O is about to happen, OpenAI tries to cut them off, and xAI typically jumps in last. The last two I/Os have been wild. I’ll be reporting live from the ground.Until then — install the 24-hour package rule, generate per-agent API keys, give your agents a sandbox to play in, and maybe go try Hermes if you’ve been on OpenClaw and feeling the pain. Or Codex. Anything, really, where things just work again.Thanks for hanging with us. It’s so good to be back. 🫡TL;DR - May 14, 2026* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed, @ryancarson* Guest: Victor Perez @viccpoes - Co-founder & CEO, Krea* Big Co LLMs + APIs* Meta launches Muse Spark voice conversations across Meta AI app, WhatsApp, Instagram, FB, and Ray-Ban Meta glasses with real-time image gen, live camera AI, and instant Reels/maps integration (X, Announcement)* Mira Murati’s Thinking Machines Lab drops Interaction Models: 276B MoE (12B active) trained from scratch for native real-time multimodal collaboration; 77.8 on FD-bench v1.5, 0.40s turn-taking latency, full-duplex audio/video/text (X, Blog)* Musk v. Altman trial highlights: Musk wanted 90% equity, predicted “0%” success for OpenAI in 2018, Microsoft is now smallest mega-investor (SoftBank/Nvidia each ~$30B), Sutskever confirms “consistent pattern of lying” memo under oath* Anthropic adds separate Claude Agent SDK monthly credits to Pro/Max/Team/Enterprise starting June 15, 2026* OpenAI launches Daybreak, a frontier AI cybersecurity platform pairing GPT-5.5 + Codex + partners like Cloudflare (X)* Open Source AI* Fastino Labs GLiGuard: 300M-parameter guardrail model matching SOTA at 23-90x smaller size, 16x higher throughput, Apache 2.0 (X, GitHub)* Meta Sapiens2: Family of 6 ViT models (0.1B-5B) trained on 1B human images, SOTA on pose, segmentation, normals, and pointmaps (X, HF)* TanStack supply chain attack (mini Shai Hulud worm) — targets AI dev tooling, doesn’t uninstall, nukes home dir if token revoked. Install 24-hour package rule immediately (X)* Nous Research releases TST (Token Superposition Training): 2-3x wall-clock speedup at matched FLOPs without architecture changes (X)* Tools & Agentic Engineering* /goal command now in Codex, Claude Code, and Hermes — productized Ralph loop. Set measurable success condition, agent iterates until done. Codex implementation winning early comparisons over Claude Code (X, Docs)* Hermes from Nous Research passes OpenClaw as #1 CLI agent on OpenRouter; adds background computer use via Trykua (X)* Artificial Analysis Coding Agent Index: benchmarks model + harness combos. Opus 4.7 in Cursor CLI leads at 61, costs vary 30x across combos, GLM-5.1 tops open-weight at 53 (X)* This Week’s Buzz* CoreWeave Sandboxes launches in preview via W&B SDK — same infra that powers 9/10 major foundation labs now available to developers for agent isolation, evals, and RL rollouts (Docs)* Vision & Video* Perceptron Mk1 — frontier video + embodied reasoning model at 1/10th the price; 88.5 on VSI-Bench, 72.4 on RefSpatialBench (vs GPT-5m at 9.0). Live on OpenRouter (X, Site)* AI Art & Diffusion* Krea 2 — Krea’s first foundation image model from scratch, focused on aesthetic diversity, style control with up to 4 references, and moodboards. ~15s generation (X, Blog) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

May 8, 202653 min

📅 ThursdAI - May 7 - Interviews with Sunil Pai, Sally Ann Omalley from AI Engineer Europe

Hey yall, Alex here (with a scheduled post) I’m taking this week off to get married and celebrate life with family, and touch some grass, but wanted to share the awesome chats I had with some great folks at AI Engineer Europe last week. BTW - Yam and Ryan took over the live show today, if you didn’t happen to catch that, please check out the live on our youtube channel! Ok, now to the actual content. The best thing about the AI Engineer conferences for me is the people I meet. I often have a chance to bring them to the live show (in fact, the live show we recorded there had the most guests yet on an episode! 4 guests including Swyx, Omar Sanseviero, VB from OpenAI and Peter Gostev) But often times I also have an offline chat. I find these conversation to be less about the weeks news, and more about the state of AI Engineering, and the guests themselves. Not quite Lex Friedman pod level, but a different vibe from our live shows. Sunil Pai - Cloudflare (@threepointone)The first conversation in today’s pod is with Sunil Pai, Principle Engineer at Cloudflare. Long time followers of ThursdAI know that I love Cloudflare, they gave me my first big break when I was building Targum (which still runs on Workers), so I had a great time chatting with Sunil! This guy has had several lives. React.js core team at Meta (he self-deprecates — "I'm the one nobody talks about, there's a testing API I shipped that pisses people off"). Then did developer tooling and the CLI at Cloudflare the first time. Left to found PartyKit — open-source deployment platform for real-time multiplayer apps and AI agents, built on Cloudflare Durable Objects. Backed by Sequoia. Acquired by Cloudflare in 2024, and he came back as a Principal Systems Engineer (per his bio: "Worked at Cloudflare once, left and created PartyKit, came back wiser"). Also plays guitar (Les Pauls — it's all over his blog). Co-hosts a live show called Dry Run on Cloudflare TV with Craig Dennis.Our conversation was a very fun one, ranging from Cloudflare agentic offerings, to how engineers should think about writing/reading code in 2026. I had a great time chatting with Sunil and I hope you enjoy getting to know him!Sally Ann O'Malley - RedhatThen I had the pleasure of chatting with Sally, who’s a Principal Engineer at Redhat and contributor to OpenClaw. Sally has one of the more unusual paths in the speaker lineup. Started as a schoolteacher, did a stint at Trader Joe's, then moved to Westford, MA, discovered Red Hat's HQ across the street, and went back to school for a second bachelor's in software engineering at UMass Lowell. Joined Red Hat in 2015, has been there a decade. Worked across OpenShift teams, integrating Kubernetes and Podman into the platform. Recent projects span Image Based Operating Systems, Podman, OpenTelemetry, and Sigstore. Also an instructor at Boston University's Faculty of Computing and Data Sciences and an organizer for DevConf.US. Won the 2025 Paul Cormier Trailblazer Award at Red Hat. Currently a founding contributor on the llm-d project — distributed, scalable, high-performance AI inferencing built on K8s. Heavily involved in Red Hat's InstructLab collaboration with IBM (the small-model distillation system using IBM Granite + Llama).Sally and I had a great conversation, two high energy personalities met! We geeked out about our OpenClaw agents, securing your Clankers, how it is to maintain OpenClaw, and everything in between! She was so stressed about the recording, but dare I say, this was one of the more natural guests I had on the show! I hope you enjoyed this format, please let me know if the comments, and I’ll see you next week! — Alex This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

May 1, 20261 hr 36 min

📅 ThursdAI - Apr 30 - DeepSeek V4 (1.6T MoE), Cursor SDK Wins WolfBench, Mayo's REDMOD Saves Lives, Stripe Gives Agents a Wallet & more

Hey everyone, Alex here 👋Tomorrow is May. May! I genuinely cannot believe we’re four months into 2026 already, and the AI news cycle is showing zero signs of slowing down. This week’s show was a wild one! We opened with what is genuinely one of the most important AI stories I’ve ever covered (Mayo Clinic AI detecting pancreatic cancer THREE YEARS before human radiologists), we covered the return of the Chinese whale with DeepSeek V4, OpenAI got caught in their own system prompt begging GPT-5.5 to please stop talking about goblins, and I literally gave my coding agent a credit card and asked it to buy my fiancée a wedding gift with the new Strip Link skill and CLI! Oh yeah, I’m getting married next Tuesday! 💍 So next week’s show will be a little different. I’ll be back the week after to catch you up on whatever drops in my absence (almost certainly something major, knowing this industry).Lots to get through, so let’s dive in. (also, in the end I have a full month recap of every major launch, don’t miss) ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Mayo Clinic’s REDMOD: AI Detects Pancreatic Cancer 3 Years Early 🔥 (X, Blog, Announcement)I know we usually cover Models, Parameter sizes, MoEs and big copmanies. But this is important. This is the use case that justifies the entire AI revolution, the GPU burns, the buildouts. I want humans to WIN, and Cancer to be fixed!Mayo Clinic just published a study in Gut (BMJ) validating an AI model called REDMOD that detects pancreatic cancer on routine CT scans up to three years before clinical diagnosis. The numbers are jaw-dropping: They show 73% sensitivity for catching prediagnostic cancers, compared to 39% for experienced human radiologists (while looking at the same exact CT scans).And maybe the most important bit, at scans taken more than 2 years before diagnosis, the AI catches nearly 3x as many cases as specialistsFor context: pancreatic cancer has less than 15% five-year survival specifically because 85% of patients are diagnosed after the disease has already spread. This is the cancer that took Steve Jobs. Imagine if Jobs had access to this AI three years before his diagnosis. That’s the impact we’re talking about.As Dr. Ajit Goenka from Mayo Clinic put it, the greatest barrier to saving lives from pancreatic cancer has been the inability to see the disease when it’s still curable. This AI can now identify the signature of cancer from a normal-appearing pancreas.Even better: it runs on CT scans people are already getting for other reasons. No extra screening protocol, no new imaging required. Just smarter analysis of existing data. The model also showed remarkably stable performance across institutions, imaging systems, and protocols, with 90-92% test-retest concordance over serial scans.Mayo Clinic is now moving this into prospective clinical testing through a study called AI-PACED (Artificial Intelligence for Pancreatic Cancer Early Detection).When we say “lets f*****g go” that’s what we mean. Yeah getting more intelligence is cool, but I want a world without decease! Let’s F*****g go mayo clinic! Agentic Commerce - Giving OpenClaw my credit card - safely! Stripe Link Wallet and Infrastructure CLI (X, Announcement, Blog, Announcement)Ok, give an LLM your credit card, what can go wrong.. right? Well, it’s clear that this, increasingly, is the future of commerce. Agents will be shopping for us, and we need solutions here. Well, this week at Stripe Sessions (Stripe’s annual product lineup conference) just delivered. Link Wallet, is a new ... API? CLI? Skill? Definitely a skill, for your agents, to connect with your Stripe Link (the thing that stores your credit cards safely) and then giving your agent a budget, it can go and make purchases in your behalf. Now the trick here, is, every purchase, you get a notification to approve, and the agent never sees your actual credit card number! This I think is the biggest win here. To test it out , first, I showed Wolfred the install instructions, which are literally this: Read link.com/skill.md and get me set up with LinkAnd then I asked Wolfred my OpenClaw assistant to buy me a present of its choice for my upcoming wedding, and that I don’t want to know what the present is, but I can approve the spend! OpenClaw installed this, sent me a link to connect to my Link.com account, I also downloaded the Link app to receive notifications (and had to enable them by hand, it was a bit annoying to discover, but they said they will fix the onboarding) and .. voila, my agent can now go spend my money, and I get these approval notifications: The kicker? The present Wolfred sent us is due to arrive like 2 months after the wedding 😂 But hey, it’s still something! My agent went, chose a wedding gift in budget, asked for my approval to puchase, and filled out the details (asked me for a few of them) and voila, first agentic purchase that did not require my credit card exposed! Stripe announced a whole bunch of other Agentic Commerce Suite features, like Shared Payment Tokens, which are scoped to seller and protected by Radar, MPP (machine payment protocol) and streaming payments using stable coins that are pretty slick and a bunch of other interesting things. This is where the world is moving to, and Stripe is innovating hard here, definitely worth keeping an eye out on what they are Speaking of agents and stripe, they also opened up the waitlist for projects.dev - which is a way for agents to provision accounts fully on their own, get API keys, and set everhing up from scratch. I think it’s a wonderful addition to the agentic tools and agentic internet! Your agent just runs something like stripe projects add cloudflare/workers abd boom, you have a workers deployment, with credentials synced, no dashboard clicking or API creation!Big Companies & APIsGPT-5.5 Goblin Mode: The Funniest Bug Report in AI History (X, Blog)Someone on X noticed that Codex system message for GPT 5.5 that launched last week has this interesting addition: “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query” and it has it two times! This created a bunch of memes, questions and wonderings about ... why would OpenAI care so much about Goblins. And they finally posted a long writeup on why: the TL;DR there is, GPT 5.5 absolutely LOVES talking about Goblins, trolls and other nerdy creatures. This is a result of them favoring the “nerdy” personality archetype and reinforcing this reward via RL. OpenAI admitted that “Unfortunately, 5.5 started training before we found the root cause of goblins” and so, now, we get 5.5 that LOVES to talk about goblins, can’t stop talking about goblins (unless they are asked to stop by a system prompt) OpenAI also posted the exact instructions of how to “unleash“ the goblin mode on the blog, which I find hilarious, a company that leans into the meme is a company to be celebrated 👏 GPT 5.5 is as good as Claude Mythos on CyberSecurityAccording to the AI Security institute, GPT 5.5 (not the GPT 5.5 - Cyber version that was announced), the one you have access to, is as good as Claude Mythos on vulnerability finding. We previously reported that Anthropic deemed Claude Mythos as “too dangerous to release publicly” and it turns out that that was either a marketing “Myth”, or Anthropic’s inability to server this huge model like they server Opus. OpenAI Ends Microsoft Azure ExclusivityThis piece of news sent quite a lock of shock throughout the industry, somehow, Sam Altman and OpenAI have been able to negotiate through the very strict deal with MIcrosoft and now are available in AWS as well as Microsoft Azure! Apparently the AGI clause is now gone as well! For many startups who are locked into AWS and Bedrock ,this is great news, they are not able to use GPT 5.5 and other OpenAI models directly applying their credits. Other Big Company NewsXai released Grok 4.3 - in a quiet release in their API docs, no blogpost, not even an X announcement. The only way I know about this was Artificial Analisys, Arena and Vals AI all posted that it jumped in scores. With the same price as the previous Grok, but only 1M tokens, it seems significantly better that its predecessor jumping (X)Gemini can now generate and export Docs, Sheets, Slides, PDFs directly from chat — available globally for free. Google literally put Microsoft Word and Excel icons in the announcement. They’re giving away what Microsoft charges for with Copilot to 750 million users. (X, Blog)Mistral Medium 3.5 dropped as a 128B dense model with 256K context, 77.6% on SWE-Bench Verified, and configurable reasoning effort. Their Vibe coding agent now supports remote parallel agents and session teleportation. $1.5/$7.5 per million tokens.(X, HF, Blog)Baidu’s ERNIE 5.1 Preview landed at #13 on Arena’s Text leaderboard, making it #1 among all Chinese labs. Speculated to be an 800B/36B active MoE using only 6% of comparable pretraining compute. (X, Announcement)Open Source AIThe Whale returns - DeepSeek drops V4 with insane attention innovations (X, Arxiv, HF, HF)Folks, DeepSeek just dropped V4! Two models: V4-Pro at a whopping 1.6 trillion params with 49 billion active, and V4-Flash at 284B total with only 13 billion active. Both support 1 million token context natively! V4-Pro-Max gets 93.5% on LiveCodeBench, beating every other model including Gemini-3.1-Pro. Codeforces rating of 3206, that’s a new record, beating GPT-5.4’s 3168. SWE-Bench Verified at 80.6%, that’s basically tied with Opus-4.6 at 80.8%. But here’s the thing, this model doesn’t overwhelm with evals performance, it’s at par with other open source models and at 1.5T nobody is running this on home GPUs! The bigger story here is the efficiency at long context! At 1 million context, V4-Pro uses only 27% of the FLOPs and 10% of the KV cache compared to DepSeek V3.2. The KV cache at 1M is like 8.7x smaller than V3.2. The pricing is also ridiculous (well, it was always cheap but with these perf. innovations, DeepSeek can afford to undercut! API pricing is $0.145/$3.48 per million tokens for Pro (7x cheaper output than Opus 4.7) and $0.028/$0.28 for Flash (30-100x cheaper than GPT-5.5)This release didn’t break through the AI bubble quite like DeepSeek R1, and we covered this on the show, but like a good whale, what you see on the surface is tiny compare to what lies beneath. This is a technological and innovation marvel, reducing compute and memory requirements by 90% compared to standart attention? CrazySenseNova U1: Unified Multimodal Without an Encoder - an oss infographic creator (X, X, HF, Blog, Try it)SenseTime open-sourced something genuinely architecturally wild this week. SenseNova U1 is a unified multimodal model — 8B parameters with a 3B active MoE variant, both Apache 2.0 — that does both understanding and generation end-to-end with no visual encoder and no VAE.They call the architecture NEO-Unify, and instead of the traditional pipeline (image → visual encoder → LLM → VAE → output), it’s just a single model handling pixels and words natively. The numbers are absurd for the size: 57.5% on Spatial Understanding (Qwen-VL: 35%) and a very high 91% on GenEval-Info for infographicsNisten and I tried it live on the show and it generated coherent infographics with crisp text — something most 8B models struggle with. Chinese users are reporting it rivals Qwen-Image 2.0 Pro for design drafts at much higher inference speeds. But for us, another inforaphic resulted in a bunch of chinese text, FWIW we didnt prompt for English only. The 3B-active MoE variant runs comfortably on consumer GPUs. Apache 2.0, fully open, in collaboration with MMLab at NTU. This weeks Buzz - W&B update! The biggest update this week is, we have gone viral with WolfBench.ai ! Wolfram has tested the Cursor harness (as well as many other harnesses) with GPT 5.5 and saw the best result we’ve tested so far! We still have a lot of testing to do, to add the Codex CLI itself, Devin, and many folks are asking for OpenCode and FactoryAI droids! Also, we’ve launched the IBM Granite 4.1 models on W&B for a very cheap $0.05 / $0.10 per 1M token. This model series are instruct but without reasoning, apache 2 licensed. Get it hereAre you concerned about your Cognitive Security? Guest speaker Max Spero from Pangram Labs says you should beWe had Max Spero from Pangram Labs on the show to talk about their Chrome extension that auto-flags AI-generated content as you scroll your feed. I’ve been using it for a while and many of my suspicions about who’s a slop merchant have been validated.According to Max, Pangram has a 1 in 10,000 false positive rate. If Pangram says something is AI, you can be very confident it was AI-generated. They don’t catch everything, short text, heavily humanized content, or very new models might slip through. But when they flag something, they claim they have 98.99% accuracy that it was written with AI. Max addressed the notion that previous “AI detection” tools like GPTZero and others were often mocked, for a lot of false positive responses, for example, saying that the declaration of indepence was written with AI, and says that this is no longer the case! Taylor Lorenz used the Pangram API to scan top Substack bestsellers and found some popular “writers” are nearly fully machine-generated. Technology substacks have the highest AI content rate; more than 1 in 4 top posts showing substantial AI content. And that’s only what Pangram catches.Max framed it as “cognitive security” - knowing what your inputs are. LLMs are already superhuman at persuasion, and if you’re getting one-shotted by AI-generated content that you think is human, that matters. They’re working on multimodal detection next (images, video), which will be huge given how hard GPT-Image-2 outputs are to spot.I find their chrome extension very useful, I scroll on my feeds and see a bunch of “ai” labels, and I can know to skip that content if I don’t want to. You can get 2 weeks trial to their chrome ext on pangram X account.April 2026 - a full month of AI model releasesApril was an insane month, here’s the major release calendar for April 2026Mar 31: Claude Code leakApr 1: Alibaba Wan 2.7-Image · Fish Audio STTApr 2: Google Gemma 4 | Alibaba Qwen 3.6-Plus Apr 4: OpenAI GPT-Image-2 (Arena leak)Apr 6: MemPalaceApr 7: Anthropic Claude Mythos Preview · Z.ai GLM-5.1 Apr 8: Meta Muse SparkApr 9: Anthropic Managed AgentsApr 10: AI Engineer LondonApr 11: MiniMax M2.7 (open weights)Apr 14: Baidu ERNIE-Image 8BApr 15: Google Gemini 3.1 Flash TTSApr 16 : Anthropic Claude Opus 4.7 | OpenAI Codex (computer-use)Apr 17: Anthropic Claude DesignApr 20: Moonshot Kimi K2.6 · OpenAI Codex ChronicleApr 21: OpenAI ChatGPT Images 2.0 Apr 22: OpenAI Privacy Filter (1.5B)Apr 23: OpenAI GPT-5.5 + GPT-5.5 ProApr 24: DeepSeek V4 Pro & FlashApr 27: Cognition Devin for TerminalApr 29: Cursor SDK | Baidu ERNIE 5.1 Preview | Stripe Link Wallet (Agents) · IBM Granite 4.1 8BApr 30: xAI Grok 4.3That’s all for today folks, we’ve talked about a few other things, and the TL;DR list of releases keeps growing and growing from week to week. As I said, I’m getting married next week, so I will be out, and won’t be on the live stream, Yam, Ryan, Nisten and LDJ will make sure you’re up to date! If you found this valuable, please consider supporting our publication with a subscription and share with a friend. Alex 🫡ThursdAI - April 30, 2026 - TL;DRHosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Guest: Max Spero (@max_spero_) - Co-founder, Pangram LabsHealthcare AI* Mayo Clinic’s REDMOD detects pancreatic cancer up to 3 years before clinical diagnosis with 73% sensitivity vs 39% for radiologists (Announcement)Open Source LLMs* DeepSeek V4 paper drops with CSA+HCA attention, 1M context at 5.7GB KV cache, possibly first frontier model trained across multiple datacenters (Arxiv)* SenseTime open-sources SenseNova U1 - unified multimodal 8B/3B-active MoE with no encoder/VAE (HF, GitHub)* IBM releases Granite 4.1 family (3B/8B/30B) - non-thinking dense models with 20x token efficiency over Qwen3.5 9B, Apache 2.0 (Blog, HF)* Mistral launches Medium 3.5 - 128B dense flagship with 256K context, configurable reasoning, plus Vibe coding agent (HF, Blog)* Baidu ERNIE 5.1 Preview hits #13 on Arena (#1 Chinese lab) using just 6% of comparable pretraining compute (ernie.baidu.com)Big CO LLMs + APIs* OpenAI publishes blog explaining GPT-5.5’s “goblin mode” - reward amplification during RL training created an obsession with creature metaphors, leading to duplicated suppression instructions in the Codex system prompt* OpenAI ends Microsoft Azure exclusivity, AWS announces GPT-5.5 and Codex on Bedrock; AGI clause removed from contract (Sam tweet)* Gemini can now generate and export Docs, Sheets, Slides, PDFs, .docx, .xlsx, LaTeX directly from chat - free for all users globally (Blog)* NVIDIA releases Nemotron 3 Nano Omni - 30B/3B-active hybrid Transformer-Mamba MoE with 256K context, 9x throughput on consumer hardware (Blog)Agentic Commerce & Tools* Stripe launches Link wallet for agents at Sessions 2026 - AI agents get scoped payment credentials with mandatory human approval, real card never exposed (Blog)* Stripe removes waitlist on Projects.dev - 32 infrastructure providers (Cloudflare, WorkOS, ElevenLabs, Twilio, Daytona, Browserbase, AgentMail, etc.) provisionable via CLI for AI agents* Cursor launches SDK exposing the same runtime, harness, and models that power Cursor IDE - now embeddable in any product (Docs)* Cognition launches Devin for Terminal - local CLI coding agent with /handoff command for seamless cloud transfer (cli.devin.ai)Evals & Benchmarks* WolfBench tests 23 models across 300+ runs on Terminal-Bench 2.0 - Cursor Agent + GPT-5.5 is the #1 combination (wolfbench.ai)* Microsoft’s DELEGATE-52 benchmark shows GPT-5.4 loses 28% of document content after 20 iterative edits, frontier models corrupt stealthily while preserving structureThis Week’s Buzz - Weights & Biases* IBM Granite 4.1 live on W&B Inference at $0.05/$0.10 per million input/output tokens with 128K context* WolfBench results going viral with Cursor + GPT-5.5 dominance, Codex and Devin testing in the pipelineAI Detection & Cognitive Security* Pangram Labs launches Chrome extension auto-flagging AI content in real time on X, LinkedIn, Reddit, Substack, Medium with 99.98% accuracy and 1-in-10,000 false positive rate (pangramlabs.com)* Taylor Lorenz uses Pangram API to analyze top 25 Substack bestsellers, finding many popular newsletters are near-fully AI-generatedAI Art, Video & Audio* ElevenLabs launches ElevenMusic - full music platform with discovery, remixing, royalties; 4,000+ indie artists at launch (elevenmusic.io)* HeyGen HyperFrames integrates natively with Claude Design - HTML-to-MP4 motion graphics via single CLI command (hyperframes.dev)* xAI drops Grok Imagine update with dramatically improved lip sync, sound, and 30-second video extensions* OpenAI engineer confirms team is actively fixing GPT-Image-2’s noise artifact issueOther* Talkie - 13B open-weight LLM trained exclusively on pre-1930 text, by Alec Radford and David Duvenaud (talkie-lm.com)* GPT-5.5 Codex full system prompt leaked from OpenAI’s open-source repo, revealing 272K context window, four reasoning levels, three personality modes, and the duplicated anti-goblin instruction This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

April 24, 20262 hr 24 min

📅 Apr 23: OpenAI's Week: GPT-5.5, GPT-Image-2, Codex CUA + Chronicle, + Claude Design, Kimi K2.6, Qwen 3.6-27B

Hey, Alex here, I’ll try to catch you up, but it’s one of the more intense weeks in AI in recent memory. Here’s the TL;DR - OpenAI dominates across the board this week! Finally launches “spud”, called it GPT 5.5 (and 5.5 Pro), and it’s SOTA on most things,nearly matching the mysterious Claude Mythos but released and we can actually use it (we tested it extensively). OpenAI also took the crown in image generate with the incredible GPT-image-v2 release, beating Nano Banana 2 and pro by a significant margin, the images are incredible, this model can generate working QR codes and 360 images it’s quite bonkers. Codex was updated with Computer Use (which I told you about last week), in-app browser and a bunch of other tools that match GPT 5.5 intelligence. Meanwhile, Anthropic launched an incredible research preview of Claude Design, finally admitted that Claude was dumb and reset quotas across the board, while breaking the trust of the community with removing Claude code from the pro plan. We’ve also got great open source updates, Kimi K2.6 and Qwen 3.6 27B are both great performers! We were live on the stream for almost 4 hours today waiting for GPT 5.5 and finally got it and tested it live on the show + had Peter Gostev on from Arena who had early access and shared with us his insights. Let’s get into it! ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.OpenAI’s GPT 5.5 is here - SOTA AI intelligence you can actually use (Release Blog)OpenAI finally gave us all access to their latest intelligence boost, GPT 5.5 thinking (and GPT 5.5 Pro). These models take the crown across many benchmarks, including TerminalBench (82.7%), GPDval (84%) and more. You can see the highlited versions on the image above. Though, its not uncommon for OpenAI to do some chart crimes, so @d4m1n created a chart that also showed the full benchmarks, including the ones GPT 5.5 is not beating Opus at, as you can see below, it underperforms on Humanity’s Last Exam, and scaled tool use. But, benchmarks don’t tell the full story. GPT 5.5 uses significantly less tokens, compared to 5.4, about 40% less. It’s also more expensive, but given the lower token usage, it nets out at about ~20% price increase, while being more intelligence and faster. Tons of folks who had early access are reporting the same things, this model excels in long running tasks, Peter Gostev from Arena, who joined our live stream, showed us an incredible demo that ran overnight for over 8h! This model can work until the task is done, no longer just pausing in the middel asking for your input. The real highlight is, paired with the recent GPT-image-2 (which I’ll expand on later in this newsletter), GPT 5.5 becomes an excellent UI designer. This is a big area in which Claude still has moat and OpenAI is trying to catch up here, and the real alpha now is to use both the Image gen and 5.5 in tandem to create beautiful visuals and UIs. The main thing is, after testing it quite a few times, this only works if you generate an image outside of the session that builds the actual UI. we tried a couple of times to do it in 1 session, and the resulting UI doesn’t seem to be remotely close to the generated image. Only after sending this image to a completely fresh session and asking for a “pixel perfect” implementation, did GPT 5.5 start to resemble the input image and rebuild the whole ui in pixel perfect fidelity! GPT Image v2 - SOTA thinking image model, finally beating Nano Banana (Blog, Live)Like we said, OpenAI is dominating this week, and in both instances those are great models. Though, apples to apples comparison, GPT-image-v2 is a much higher jump — from previous models — than GPT 5.5! According to Artificial Analysis, the jump in how many people prefer GPT-image-2 in blind tests compared to other model is the higest we’ve ever seen, over 250 points. And you can clearly see it in the generations as well. Previously this week, we did a live streaming session with Peter Gostev (from Arena) and we did a deep dive comparing this new model to GPT Image 1.5, Nano Banana and Grok Imagine, and it’s a clear winner across most categories.Character consistency is immaculate, high resolution imagery, instruction following, are all so so good it’s a bit hard to explain in text. Reasoning visual intelligence Like with Nano Banana, this model is likely based on a big GPT image, it’s no longer just diffusion, as you can see, it reasons! And apparently the more reasoning you give it (if you choose GPT pro) the better it’ll be. The examples are indeed wild, the model can generate images of code that works, generate functional QR codes and bar codes! The craziest thing people figured out it can do, is functional 360 imagery (equirectangular format), you can just ask the model to create a 360 image of “scene” and then drop this in to a 360 viewer! Peter shows us on the show how he combined GPT 5.5 and Image v2 to create a sort of “street view” from a bunch of 360 images, it blew our minds. He literally spun up an overnight GPT 5.5 task in Codex that planned out the hanging gardens of Babylon, generated hundreds of equirectangular images, stitched them into a walkable interface, and had it running 8+ hours without babysitting. A street view of a place we don’t actually know what it looked like, hallucinated from latent space. What a time.Day one availability is wide: Figma, Canva, Adobe Firefly, fal.ai, and Microsoft Foundry all have it. Nano Banana dominated for what felt like an eternity in AI time (it was really only a few months 😅), and finally OpenAI has a proper answer.OpenAI is dropping models on HF - Privacy Filter, a 1.5B apache 2.0 PII reduction model (X, HF)I’ve told you the’ve been cooking this week! OpenAI open sourced a genuinly useful model called Privacy Filter, that has 1.5B parameters with only 50M active, small enough that it runs in fully offline in your browser (check out this incredible web demo by our friend Xenova) This model is specifically built to anonymize and filter our personally identifiable information (PII), things like names and addresses, but more importantly bank accounts and API keys! This, in the era of agentic assistants is extremely important and I’m very happy that OpenAI is open sourcing here, specifically because while it’s great generally, this model is great for fine-tuning on your own data! Pairing this with something like CrabTrap, a new open source proxy with LLM as a judge for agents like OpenClaw, and you’re hardening your setup so that your private details won’t leak, even if someone manages to prompt inject your agent! In every other week, CrapTrap would deserve a segment of its own, it is really a novel solution to the “AI agent can leak your creds” problem, created by Brew CEO, as they run agents inside Brex, but this week is insane, so... you get a link and we move on 🙂 Claude Design - Anthropic’s figma killer? (try it, deep dive)This launched on Friday (come on Anthropic, why are you launching things on a friday?!) and nearly tanked Figma stock (16% down since). It didn’t help that Mike Krieger who runs product at Anthropic and co-leads Anthropic Labs, quit the Figma board just a few days before this release. Claude Design is a new, separate interface for Claude, with its own usage meter, that exists only on web, and only for Max subs for now. We all know that Claude is great at frontend design, but this is an interface that wraps Claude, with some incredible “designer like” tools. Knobs to edit font sizes, point and click interface to highlight elements for Claude to fix. The highlight for me, what broke my brain on the live stream, was the “talk to the design” feature, where you turn on the microphone, talk to Claude, and while you point, it “knows” what you’re pointing at! So you can say “here, fix THIS thing” without saying what that thing is, and Claude will just fix it, by looking at where your cursor was at the time. This ... this feels like magic. The huge unlock in Claude Design is the initial “brand guidelines” process, in which you ask Claude to create a holistic brand identity (based on your website code, screenshot, Figma file etc) and then, every new project, can have that brand identity preserved, with the right fonts, colors, logos etc. I dropped the show notes from this week and asked for an interactive infographic website using the brand guidelines. This really does feel like a “new kind” of product, I’ve worked with designers before, the interaction model with Claude Design feels very much like working with a designer, showing them what you like and don’t like. And like working with a designer, it’s expensive! Claude Design uses Claude 4.7 and buuurns through tokens! I’ve tapped out of my weekly quota in less than 4 projects! Luckily, Anthropic this week admitted that they’ve dubmed down Claude, and reset the quotas, so I was able to show it on the live show. This week’s Buzz — W&B LEET TUI gets Workspace modeOur W&B LEET TUI went viral a couple weeks back (local terminal UI for watching run stats, metrics, and system health - built for folks training on remote boxes who don’t want to alt-tab to a browser), and the team shipped a big follow-up this week: workspace mode.Multi-run workspaces live, metadata filtering, system metrics (GPU stats included), console logs, and — my favorite — images rendered directly in the terminal . The whole web workspace experience, now in your SSH session.Demo video and full announcement here. pip install wandb, give it a spin.Open Source AIKimi K2.6 - Opus at home (if you have a data center) (X, HF, Live)Moonshot AI dropped Kimi K2.6 this week, a 1 Trillion parameter MoE with 32B active, 384 experts, 256K context, under a modified MIT license. The headline numbers are wild: SWE-Bench Pro at 58.6 (beating GPT-5.4 and Opus 4.6), BrowseComp at 83.2, HLE with tools at 54.0.Wolfram ran it on his own Wolf Bench and it came out as the best open source model he’s ever tested — essentially matching Sonnet 4.5 on terminal bench with the Terminus agent harness, and beating Opus 4.6 inside OpenClaw. That’s a crazy sentence to write.Pricing on Cloudflare Workers AI is $0.95/M input, $4/M output — roughly 15x cheaper than Opus. If you have the budget to run it.Now, the calibrated take: Yam showed us a report from @BrightMind where Kimi failed pretty badly at rendering a 3D lava lamp while every other frontier model nailed it. Artificial Analysis has Kimi at #4 on their intelligence index (54) behind the three frontier labs. So it’s definitely a bit benchmaxxed on agentic coding, but it’s also genuinely good at agentic coding, which is the use case most people care about right now. My own test: it overthinks a lot, generates a lot of tokens (which hits your wallet even at those low prices) and I wasn’t very happy with it during my live test. The frontend design of it is meh, and it did feel benchmaxxed. Bottom line: if you’re building an OpenClaw setup and you want Opus-adjacent quality without paying Opus prices, Kimi K2.6 could be the move. They also shipped Kimi Code CLI as a companion to Claude Code / Codex CLI.Alibaba drops Qwen 3.6 27B - (Actually sonnet at home) This one is special because it’s genuinely, actually runnable at home. It’s a dense 27B model under Apache 2.0, and it beats Alibaba’s own ~400B Qwen3.5 flagship MoE on every major coding benchmark. SWE-bench Verified 77.2, Terminal-Bench 2.0 at 59.3 (matching Opus 4.5), SkillsBench 48.2 (beating Opus 4.5 at 45.3).With Unsloth’s dynamic GGUFs, this runs on 18GB of RAM. A used RTX 3090 under $1000 or a 24GB Mac Mini and you’re running something genuinely comparable to Sonnet 4.5 at home. Nisten has been daily-driving it and said people are calling it “Sonnet 4.5 at home” - it’s not drop-in replacement perfect (it struggled with hard git merges in his testing), but for non-critical work? Absolutely there.Natively multimodal, 262K context extendable to 1M. There’s also a sibling, Qwen3.6-Max-Preview, available on their API if you want the frontier version.Great great open source model! Quick hitsA bunch of stuff worth knowing about that didn’t get full segments:* Google Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro (announce) — autonomous research agents that navigate web + your custom docs. Plus native chart generation and MCP support in the API.* Google Gemini Enterprise Agent Platform (launch) — evolution of Vertex AI for enterprise agent builders.* ChatGPT Agents “Hermes” leak — an agents builder/studio with templates and Slack integration incoming per @btibor91.* Codex now has 4M users per the team, and they open-sourced Euphony, a visualizer for Codex session logs.* SpaceX / Cursor $60B deal — the structure is either a $60B acquisition or a $10B collaboration experiment. The thesis being whispered: are developer traces the missing training ingredient for frontier coding models? Very spicy, very Elon.* Speaking of Elon, XAI released Grok-Voice-think-fast 1.0 (Blog) - it’s their fully end to end omni model that takes customer calls and is already deployed at scale at Starlink! Very interesting contender to Gemini Flash live model we covered before. The benchmarks look insanely goodPhewI said at the top this was one of the more intense weeks in AI in recent memory, and I genuinely mean it. We were live on the stream for almost four hours. I’ve done five livestreams since last Thursday. GPT 5.5 dropping mid-show was the cherry on top. Between Codex becoming ambient, GPT Image v2 rewriting the ceiling for generative visuals, Claude Design moving a stock price, two incredible open source drops in Kimi and Qwen, and OpenAI quietly re-committing to open source — this was a lot.If you’re feeling the FOMO, you’re not alone. We live this stuff and I still feel it. My ask this week: bookmark the livestreams, play with GPT Image v2 (it’s genuinely the most fun I’ve had with an image model in a long time), and if you’re deploying agents in production, go read the CrabTrap source code this weekend.See you next Thursday — same place, same time, probably another launch that disrupts us mid-show. That’s the world now 🤷ThursdAI - Apr 23, 2026 - TL;DR* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson* Peter Gostev (@petergostev) - Arena AI* Big CO LLMs + APIs* OpenAI launches GPT-5.5 and GPT-5.5 Pro — SOTA across the board (Blog, Livestream)* OpenAI GPT-Image-2 — biggest Arena Elo jump ever, thinking mode for images (X, Eval site, Livestream)* OpenAI Codex — Background Computer Use + Chronicle (screen memory), hits 4M users (Chronicle)* GPT-5.5 pre-launch leak in Codex dropdown (X)* Anthropic Claude Design — research preview on Opus 4.7, Figma -7% (X)* Anthropic resets all Claude quotas, admits degradation, allows OpenClaw CLI back (X)* Anthropic ARR crosses $30B* Google Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro (X)* Google Gemini Enterprise Agent Platform (X)* ChatGPT Agents “Hermes” leak — builder/studio + Slack integration (X)* OpenAI clinician/medical model + workspace agents released* Open Source LLMs* Moonshot Kimi K2.6 — 1T MoE, 32B active, SOTA open source on SWE-Bench Pro (X)* Alibaba Qwen3.6-27B — dense 27B, Apache 2.0, beats own 400B flagship (X, HF)* Alibaba Qwen3.6-Max-Preview on API (X)* OpenAI Privacy Filter — 1.5B MoE, 50M active, Apache 2.0, runs in browser (X)* Tools & Agentic Engineering* Brex CrabTrap — LLM-as-judge HTTP proxy for agent security (X)* OpenAIDevs Euphony — open-source Codex session log visualizer (X)* This week’s Buzz - Weights & Biases* W&B LEET TUI goes workspace mode — multi-run, GPU metrics, images in terminal (X)* Voice & Audio* StepAudio 2.5 TTS — natural-language control of emotion and delivery (X)* Deals & Industry* SpaceX/xAI Cursor — $60B acquisition or $10B collaboration structure This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

April 16, 20261 hr 59 min

April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews

Hey ya’ll, Alex here with your weekly AI news catch up. It’s one of those Thursday’s where no matter how well I prep, the big AI labs are hell bent to show up before each other. Alibaba dropped Qwen 3.6 with Apache 2, confirming their commitment to Open Source, then Anthropic released Claude Opus 4.7 (not quite Mythos) and OpenAI followed with a huge Codex update that includes Computer Use among other things. The highlight of Computer User is the background usage, more on that below. This is all just from today!Previously in the week we had 2 incredible 3D world generators, Lyra 2.0 from Nvidia and HYWorld 2 from Tencent, Windsurf dropping 2.0 version with Devin integration and Google releasing a Gemini TTS, with over 90+ languages support and incredible emotions range, and Baidu open sources Ernie Image, rivaling Nano Banana. Today on the show we had 3 awesome guests, Theodor from Cognition joined to cover the new Windsurf, Kwindla is back on the show to talk about “the side project that escaped containment” Gradient-Bang, a multi agent, voice based space game and Trevor from Marimo joined to talk about pairing your agents with a Marimo notebook. Let’s dive in! 👇 ThursdAI - We’re over 16K on YT today, my goal is to get to parity with Substack, please subscribe. Codex can now really use your computer: OpenAI updates Codex with CUA, Image Generation, Browser, SSH (X, Blog)Codex from OpenAI has been the major focus inside OpenAI for a while now. We’ve reported previously that OpenAI is closing down SORA and other “side-quests” to focus, and that they will join Codex, ChatGPT and the Atlas browser into one “superapp” and today, it seems, that we’ve gotten an early glimpse of what that app will be. The Codex team (which seems to be growing from day to day), have been on a TEAR feature wise lately, trying to beat Claude Code, and they pushed an update with a LOT of features and updates, among them a new memory system, internal browser and image generation. The highlight for me though, was absolutely the polished computer use experience. Computer use is not new, Claude has a computer use feature flag, many others. Hell, we told you about computer use with Open Interpreter, back in Sep of 2023. But, this.... this feels different. You see, OpenAI has quietly purchased a company called Software Apps Inc, that almost launched a macos AI companion a year ago called Sky. This team is obsessed with Mac, and somehow, they were able to build a magical experience, a huge part of which, is the fact that they are controlling the mac, in the background. This is like black magic stuff. You work on one document, Codex clicks buttons and does things in another, without interrupting you. You may ask, Alex, why do you even care so much about computer use, when most of the work happens in the browser anyway, and Claude (and Codex) can control my browser anyway? Well, true, but not ALL work is happening there, for example, file system integration. It’s notoriously big part of browser automation that fails, when you need to upload/download files. I’ve spent countless cycles trying to get this to work with OpenClaw, and this, just does it. This closes the loop between knowledge work in the browser (yes, this thing can use your browser) and the broader OS. It’s so so polished, I truly recommend you try it. It’s as easy as @ tagging any app that you have running and asking Codex to do stuff there. Pro Tip: Enable fast mode for a much smoother experience. Anthropic Opus 4.7 is here, not quite Mythos, 64.3% Swe-bench Pro, tuned for long running tasks (X, System Card)What is there to say? Is this the model we expected from Anthropic after releasing the news about Claude Mythos last week? no. But hey, we’ll take it. I new Claude Opus, with a significantly improved multimodality capabilities, and a long horizon coding task improvements? For the same price? Well, not quite! Apparently, this model could be a “from scratch” trained model, given that the tokenizer (the thing that converts words into tokens for the LLM to understand) is a different one. It also uses 1.3x more tokens for the same tasks, which means, that the new and default model from Anthropic became effectively more expensive (A note they acknowledged by raising the usage limits, to an unknown amount in Anthropic subscription plans, but it’ll still be a token tax on the API use) How about performance? Well, hard to judge on Evals alone, but they are great. A huge jump in Swe-bench Pro, over 10% improvement, puts this model as the best out there, except Mythos. It’s also the best at real world knowledge via GPQA Diamond (except Mythos). Are you seeing a trend here? Anthropic released a preview of a model, but for the first time, it’s not their “absolute best” model, and in a weird move, they have compared it on Evals to an unreleased model (presumably 10x the size?) As far as we’ve tested this, it gave an incredibly detailed response on the Mars question we constantly test on, both for me and Nisten, Opus 4.7 produced an incredibly detailed 3D rendered result, much better than out previous tries. I’ll be keeping an eye on this model and keep you guys up to date on what else we find. Vibe checks are .. it’s more expensive, long context is unclear but it’s a great vibe model. Alibaba is back - Qwen 3.6 is Apache 2.0 35B with 3B active parameters (X, HF, Blog)The coolest thing about this release is not the evals (though they claim to outperform the much denser Qwen 3.5-27B on multple benchmarks) is that Alibabab is putting models with open weights and an Apache 2.0 license! We previouly reported on rumors from inside Alibaba, that a few internal restructuring caused many of us to doubt if they would commit to OSS, and they answered! Another highlight for me in this model, is that Alibaba has an OpenClaw bench (that they are promising to release soon) and that this model does as well as the dense model and beating Gemma 4 by a wide margin on that task. This model is also natively multimodal, with 262K context extensible to 1M via YaRN. MiniMax M2.7 Open Weights - 230B MoE with only 10B active (X, HF)Our friends at MiniMax finally dropped M2.7 in open weights (technically not fully Apache, commercial use requires their authorization, but free for research, personal, and coding agents). It’s a 230B parameter MoE with only 10B active parameters, and it’s matching GPT-5.3-Codex on SWE-Pro at 56.22%. On Terminal-Bench 2 it hits 57%. But the real story here, the part that made me stop scrolling, is the self-evolution piece.They let an internal version of M2.7 run its own RL optimization loop for 100+ rounds with zero human intervention. The model analyzed its own failure trajectories, modified its own scaffold code, ran evals, and decided whether to keep or revert changes. It got a 30% performance improvement on internal metrics. The model improved itself.Shoutout to the MiniMax team — longtime friends of the pod and they keep delivering (as they promised to release the weights for this one and they did) This weeks buzz - news from Weights & Biases from CoreWeaveThis week was a very big one in our corner of the AI world. Our parent company CoreWeave announced not one, not two but 3 major deals, including one with Anthropic, a renewed commitment from Meta and a renewal from Jane Street. CoreWeave now serves 9 out of the top 10 AI model providers in the world. 🎉 Oh and a small plug, if you want to get tokens powered by the same infrastructure, our Coreweve Inference service is open and very cheap, and we’ve recently added Gemma 4 and GLM 5.1 both to our inference service. This week on the pod, I’ve chatted with Trevor, founding engineer at Marimo Notebooks (also part of CW) about their recent highlight of pairing an AI agent with Marimo notebooks, they went quite viral on hacker news and I wanted to understand why. I understood why, it’s really cool. Check Trevor out on the pod starting around 01:05:00 timestamp. Tools & Agentic EngineeringWindsurf 2.0 - Agent Command Center + Devin in the IDE - interview with Theodor Marcu (X, Blog)The first big post-Cognition-acquisition move for Windsurf dropped this week, and I got to chat with Theodor Marcu from Cognition about it on the show. The headline: Windsurf 2.0 brings an Agent Command Center; think Kanban-style mission control for all your agents, plus native Devin integration baked right into the IDE, and Spaces (persistent project containers that group your agent sessions, PRs, files, and context).The framing Theodor gave me: local agents are pair programmers bounded by your attention (they stop when you close the laptop), while cloud agents are independent hires. Windsurf 2.0 tries to unify both paradigms in one interface. You can plan locally with Cascade using the Socratic method — going back and forth, challenging assumptions, building up context — and then with one click, hand off execution to Devin which runs in its own cloud VM, opens PRs, runs tests, and even tests its own work using computer use on its own Linux desktop. You can close your laptop and it keeps shipping.One reality check from the community: Devin is great but not cheap. One early tester burned $25 in credits for a 15-20 minute bug fix that produced “okay” results. Something to watch on the Max plan economics. Devin access is rolling out gradually to Windsurf users over 48 hours from launch. Shoutout to Swyx that helped design the Spaces three months ago whilst at Cognition! Warp terminal now supports any CLI agent with vertical tabs and mobile control (X, Blog)This one is for the terminal enjoyers. Warp, which in my opinion is the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI, all running side by side in vertical tabs with live status indicators.The killer feature here, and this solves what I think is the single worst part about using Claude Code, is notifications when agents need you. If you’ve used Claude Code you know the pain of constantly checking if it’s waiting for a permission or input. Warp notifies you. You step in, approve, go back to what you were doing. They also added integrated code review inside the terminal, a rich multimodal input editor, and — this is wild — remote control from mobile. Monitor and interact with your running CLI agents from your phone.Voice & AudioGradient Bang - the first massively multiplayer LLM-driven game, interview with Kwindla (X, Play it)Kwindla, co-CEO of Daily and maintainer of Pipecat, came on the show to talk about Gradient Bang, a game he described as “a side project that escaped containment.” He told me about this back in December, and folks, it’s finally live and it’s genuinely the first fully LLM-driven multiplayer game I’ve seen. It’s inspired by an old BBS door game called Trade Wars that Kwindla used to play as a baby programmer on a 386 DX, but reimagined so your ship’s computer is an LLM you can just… talk to.You pilot a spaceship through a procedurally generated universe, but instead of clicking buttons, you talk to the thing, and say things like “take me to the nearest mega port and trade along the way” — and your ship AI delegates to sub-agents to actually do the work. You can run corporations, buy more ships, task them to do 5 exploration loops while you do trade runs. It’s Factorio-meets-Ender’s-Game-meets-voice-AI. I’ve been playing it, my ship is currently roaming the universe as we speak (with 0 credits as someone robbed me!)What makes this technically fascinating is that it’s basically a production-grade stress test for multi-agent orchestration. Sub-agents with shared context, episodic memory across sessions, dynamic LLM-generated UIs (the React front-end is literally rendered from JSON thrown over by a UI agent LLM), and long-running contexts that go for weeks. The architecture is now shipping as a Pipecat library called Pipecat Sub-Agents. Tech stack: Deepgram for STT, GPT-4.1 for the voice agent, GPT-5.2 medium-thinking for task agents, and a dedicated benchmark called GB Benchmarks because tasking these agents is genuinely hard.Fun detail: Kwindla’s rule for this project was to not write or read any code since November. His colleague John lasted about one day before he broke and started reading React. The Z/L Continuum claims another victim. Go play it, it’s free and fun: gradientbang.com.Google launches Gemini 3.1 Flash TTS (X, Blog, Try it)Google dropped a new TTS model this week and folks, it’s not quite the speed-of-light real-time conversational TTS we’re all dreaming of (it’s about 3 seconds time-to-first-token, so batch-mode only), but the controllability is wild. We’re talking inline audio tags — [laughs], [sighs], [gasp] — natural language scene direction, two distinct speakers per generation, 70+ languages with auto-detection, and you can switch emotion and pacing mid-sentence with natural language. I tested it live on the show with a “shocked/whispering” tag combo asking “Who came to ThursdAI?” and it absolutely nailed it. It hit 1,211 Elo on the Artificial Analysis TTS Arena, 4 points behind Inworld TTS 1.5 Max and ahead of ElevenLabs v3. Pricing is about $0.03 per 60 seconds of audio, roughly 4.7x cheaper than ElevenLabs v3.Kwindla’s take: this is part of the broader shift from traditional TTS architectures toward fully steerable, prompt-able speech models — which is great for expressive use cases but means you need to test heavily for hallucinations and word skipping.AI Art, Video & 3DTencent HYWorld 2.0 and NVIDIA Lyra 2.0 - actual 3D worlds from one imageThis week we got not one but two major single-image-to-3D-world open releases, and they’re genuinely different from the video world models (Genie 3, Cosmos) we’ve been covering.Tencent HYWorld 2.0 takes a single image (or text, or video) and produces actual 3D Gaussian Splats, meshes, and point clouds that you can import directly into Unity, Unreal, Blender, or NVIDIA Isaac Sim. Not video. Real editable 3D assets. Their framing: “watch a video, then it’s gone” vs “build a world, keep it forever.” The WorldMirror 2.0 reconstruction model is a 1.2B parameter feed-forward model that predicts dense point clouds, depth, normals, camera params, and 3DGS in a single pass. All open source.NVIDIA Lyra 2.0 (Apache 2.0) takes a single image and progressively generates an explorable 3D world as you navigate through it. The breakthrough here is solving two classic failure modes of generative world models: spatial forgetting (hallucinating new structures when you revisit an area) and temporal drifting (errors accumulating until the scene turns to mush). They solve both with per-frame 3D geometry retrieval and this elegant self-augmented training trick where they train the model on its own degraded outputs so it learns to correct drift. DMD distillation gets you 4-step inference. Apache 2.0, Hugging Face, code and weights.Both of these together feel like the end of video-only world models as the state of the art. We’re going straight to editable, persistent, importable 3D worlds.Baidu open-sources ERNIE-Image - 8B parameter text-to-image (HF)Not to be outdone, Baidu dropped ERNIE-Image, an 8B parameter DiT that’s now #1 on GenEval among open-weight models (0.8856), beating Qwen-Image, FLUX.2-klein, and Z-Image. Built from scratch in 3 months. Runs on a 24GB consumer GPU, and someone already quantized it to NF4 so it runs under 10GB VRAM on an RTX 3060. The text rendering story is the headline — clean multilingual text rendering for posters, infographics, comics, the stuff every other model has been historically terrible at. There’s also a Turbo variant that does it in 8 inference steps.The craziest AI video I’ve ever seen - “Pi Hard” (X)You have to watch this AI video. It’s one of the crazier ones I ever saw, and I do reporting on AI for a living. I showed this to my Fiancee Darya, and she only asked me “is this AI” in the middle of it, after saying “yeah, let’s watch this 😂) Closing thoughtsWhat a week. Opus 4.7 dropped live on the show, Codex is now controlling your mac in the background like black magic, Qwen gave us another Apache 2.0 banger, MiniMax shipped a self-evolving model, and we got two “image-to-actual-3D-world” open source releases on the same week. Oh and a shoe company is now an AI compute company.The Z/L Continuum keeps shifting — I feel like every week I drift a little more toward L, especially after seeing Kwindla ship Gradient Bang without reading code since November. And every week the agents get better at babysitting themselves (Claude Code Routines, Windsurf’s Agent Command Center, Warp’s unified CLI agent UX, Codex’s computer use in the background), which means more FOMAT for all of us.Thanks for reading, share this with a friend, and if you enjoyed this, drop a comment with what you want more or less of. Feedback keeps me going.— AlexTL;DR - ThursdAI, April 16, 2026* Hosts and Guests* Alex Volkov - AI Evangelist & Community with Weights & Biases / CoreWeave (@altryne)* Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Guests:* Kwindla Kramer (@kwindla) - Co-CEO of Daily, Pipecat maintainer* Theodor Marcu (@theodormarcu) - Product at Cognition* Trevor Manz (@trevmanz) - Founding engineer at Marimo* Show Notes* Recap essay on the Z/L Continuum from AI Engineer Europe (Blog): should AI engineers still read code? Ryan Lopopolo says no, Mario Zechner says yes for critical paths, everyone in between has FOMAT.* Mario Zechner talk is finally live on AI Engineer youtube (Watch)* Super Gemma 4 26B Uncensored v2 by @songjunkr — trending on HF, 0/100 refusals, fixed tool calls (HF GGUF, HF MLX 4bit)* Gemma 4 21B REAP — 20% expert-pruned Gemma 4 26B MoE by 0xSero using Cerebras REAP (HF)* Parcae (Together AI + UCSD) — stable looped transformer architecture with scaling laws, matches 2x-sized transformer quality (Paper/blog)* Claude Desktop app — rewritten from scratch, completely new app* Gemma 4 on W&B Inference — reply on the announcement post with code Gem Drop for $20 in inference credits, also supports LoRA inference via link* Big CO LLMs + APIs* Anthropic launches Claude Opus 4.7 - 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 3x vision resolution, new xhigh effort level, /ultrareview in Claude Code, same pricing as 4.6 but new tokenizer uses ~1.0-1.35x more tokens (X, Blog)* OpenAI Codex major update: macOS background computer use, 90+ plugins, gpt-image-1.5 image generation, in-app browser, memory, self-scheduling automations, multi-terminal SSH (X, Blog)* CoreWeave signs deals with Anthropic (multibillion), Meta ($21B expansion, $35B+ total), and Jane Street ($6B cloud + $1B equity), now serves 9 of the top 10 AI providers* Open Source LLMs* Qwen 3.6-35B-A3B - Apache 2.0, 35B MoE with 3B active, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M (X, HF, Blog)* MiniMax M2.7 open weights - 230B MoE with 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex, self-evolved via 100+ rounds of autonomous RL (X, HF)* Tools & Agentic Engineering* Windsurf 2.0 with Agent Command Center and Devin integration - interview with Theodor Marcu (X, Blog)* Warp now supports any CLI agent with vertical tabs, notifications, code review, mobile remote control (X, Blog)* Claude Code Routines - cron, GitHub event, and API-triggered autonomous agents running on Anthropic’s cloud (Docs)* This Week’s Buzz - Weights & Biases / CoreWeave* Marimo Pair - drop Claude Code / Codex / OpenCode agents directly inside reactive Python notebooks - interview with Trevor Manz (Blog, GitHub)* Gemma 4 now live on W&B Inference on CoreWeave infrastructure, with LoRA inference support* Vision & Video* Craziest AI video of the year: Pi Hard / Neil deGrasse Tyson (X)* Voice & Audio* Gradient Bang - first massively multiplayer fully LLM-driven game, Pipecat sub-agents - interview with Kwindla (Play, GitHub)* Google Gemini 3.1 Flash TTS - 1,211 Elo on TTS Arena, inline audio tags, 70+ languages, ~$0.03/60s (Blog)* AI Art, Diffusion & 3D* Baidu ERNIE-Image - 8B DiT, #1 GenEval among open models, precise multilingual text rendering (HF)* Tencent HYWorld 2.0 - single image to editable 3D Gaussian Splats/meshes, Unity/Unreal/Isaac Sim ready (GitHub)* NVIDIA Lyra 2.0 - single image to explorable persistent 3D worlds, Apache 2.0 (Project, HF)* Other news* Unitree humanoid breaks 100m dash world record at ~10m/s (X)* Allbirds shoe company loses 99.5%, rebrands as “NewBird AI”, raises $50M to buy GPUs, stock up 600-800% (X) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

April 9, 20261 hr 59 min

📅 ThursdAI LIVE from London - Claude Mythos, Codex Resets, Muse Spark & More | w/ Swyx and friends from OpenAI, Deepmind, LMArena and OpenClaw

Hey yall, Alex here, writing this from sunny London, at the first ever AI Engineer conference in Europe!What a show we have for you today! First, let me catch you up on what’s important: Anthropic, this week announced a whopping $30B ARR up from 19B in Feb, while also telling us about Claude Mythos Preview their next gen HUGE model that they won’t release to the public (yet?) that finds crazy vulnerabilities in existing code bases. Apparently OpenAI will follow up with a similar non-public model soon.The Meta Superintelligence Lab led by Alex Wang finally showed what they were working on, Muse Spark, the smaller of their upcoming models on a complete new infrastructure (MSL announcement, Simon Willison’s deep dive on the 16 hidden tools).In other news:Z.AI released GLM 5.1 in OSS finally (HF weights), Seedance 2.0 finally available in US on Replicate, OpenAI testing out GPT-image-2 on LM Arena under codenames, HappyHorse from Alibaba takes the video crown, and Mila Jovovich (5th Element, Resident Evil) releases agentic memory plugin called MemPalace (Ben Sigman’s transparent correction thread is worth reading).We had 5 guests today on the show, we kick off with @swyx the founder of AI Engineer and host of Latent Space. We then chatted with @petergostev from Arena (formerly LMArena) about Mythos and the compute wars, then Vincent Koc, the second most prolific contributor to OpenClaw, then our friends VB from OpenAI and Omar from DeepMind, both previously at HuggingFace. This is a busy busy show, and given the time-zones, I unfortunately don’t have time for a full weekly writeup, but as always, I will share the raw notes and post the video (lightly edited).ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.AI Engineer - LondonThursdAI came a long way since the first AI Engineer conference, but many who read this don’t know, that was my big break. Swyx invited me to cover the first AIE in San Francisco in 2023, and I remember, I was in an Uber to the airport, the driver asked me what I do, and I, for the first time said “I host a podcast”. I (and ThursdAI) owe a lot to Swyx, and AIE team, and it’s been incredible to see how big they’ve grown and how many great speakers this event hosts! The term AI Engineer has drifted in those 3 years, but also has the term Software Engineer. Swyx predicted this nearly 3 years ago, what I don’t think he predicted, is that all engineers are now AI Engineers, and this includes domains like Agens (OpenClaw), Context and Harness Engineering, Evals and Observability, Voice & Vision all of which are tracks in this conference. I was really surprised to see how many of the talks/speakers here are native to London (after all, Deepmind is from here, OAI, Anthropic, Meta have offices here) and the latest boom in agents, OpenClaw, Pi were all Europe based as well, and they are joined the AI Engineer stage. Oh, and there’s also a Giant Inflatable Claw at the entrance, yup, for pictures and vibes, and to show off how quickly the OpenClaw took over the mind-share. Anthropic announces $30B ARR and Mythos, their next model, will not be released to the public. The thing that everyone will tell you, is that Anthropic is on a roll, this is obviously connected to their upcoming IPO this year. We’ve been covering many issues on their part, but this week we saw them posting about a HUGE increase in ARR, from 19B in February to 30B in April, passing OpenAI at $25B. That last fact though, is kind of disproven because they report on ARR differently, OpenAI apparently only counts their cloud revenue from Microsoft per the information. The growth is undeniable though, and so is the most unprecedented release announcement, Claude Mythos Preview, which was rumored for a bit and now was announced proper. With project Project GlassWing, Anthropic has announced that this model is SO good at cyber security and finding bugs in code, that they cannot share it with the public, and through GlassWing they will share it with companies like Microsoft, Linux, CrowdStrike and a bunch of others, to harden their security. This is it folks, this is the first time, where a model was “announced” but deemed too risky to release. Now, is it truly “too risky”? Previously, folks thought that DALL-E is too risky, or cloning voice tech is too risky, and now it’s everywhere. The capabilities catch up even in OpenSource. But the facts are, Anthropic says they’ve found a 27-year old bug in OpenBSD (famously very secure), and that this model is very very good at connecting the dots between several, seemingly inacuous bugs, to string them together into one coheren exploit. This is, indeed scary. Just last week, one of the top security researchers in the world, Nicolas Carlini, now at Anthropic, gave a talk at Black Hat, showing off these results, and saying that these models since December and definitely recently have passed him as a security engineer. If you haven’t seen this talk, watch it, then try to estimate if Anthropic did the right thing by only releasing this model to enterprises first. But on the show, Peter Gostev from Arena gave me a take on this that I haven’t been able to shake. Peter pulled up his Compute Wars chart live on the show — and the picture is that OpenAI is way ahead of Anthropic on compute, with Anthropic only recently getting a noticeable bump (which lines up suspiciously well with Mythos being trainable in the first place). His read: “it sounds cooler to say it’s too risky to release than ‘we can’t serve it.’” The official partner pricing is $25 / $125 per million tokens — 5x Opus 4.6 — but if you don’t have the GPUs to serve it broadly, the price doesn’t matter. In the year of the IPO, the company that cannot serve a model says the model is too dangerous to serve. Make of that what you will.This also reframes the whole rate-limit drama with OpenClaw. Anthropic didn’t ban OpenClaw — I want to be very clear about this because the discourse went sideways. What they did is they made it significantly more expensive for Max-tier subscribers to use Opus through OpenClaw, which pushed a lot of people over to GPT-5.4 via Codex. Same root cause: they’re out of compute. The freshly announced Anthropic + Google TPU deal (Google already owns ~10% of Anthropic) is them trying to fix this — though as Peter noted, it’s pretty wild that Google is propping up a direct competitor to their own DeepMind team. Same pattern as their original $2B Anthropic investment ending up propping AWS Bedrock against Google Cloud. Big Google contains multitudes.Meta Superintelligence Labs ships Muse Spark — Llama is dead, long live MuseLlama is dead, long live Muse. This week Meta finally showed what the very expensive Meta Superintelligence Labs under Alexandr Wang has been cooking, and the answer is Muse Spark — the smaller of their new model family, built on a fully rebuilt AI stack from scratch in just 9 months. Nine months is wild for that kind of overhaul, and the headline number people are quoting is that they reach Llama 4 Maverick capability with over 10x less compute.Spark is intentionally small and latency-optimized — it’s not trying to be the biggest, it’s trying to be the first step on Meta’s new scaling ladder. But the benchmarks in certain areas are nuts: 86.4 on CharXiv Reasoning (beats Opus, Gemini, GPT-5.4), and the one that really got me — 42.8 on HealthBench Hard vs Opus at 14.8 and Gemini at 20.6. They trained it with data curated by over 1,000 physicians and it shows. They also shipped a Contemplating mode which is parallel multi-agent reasoning, hitting 58.4% on Humanity’s Last Exam with tools. Coding is the acknowledged weak point (77.4 on SWE-Bench Verified vs Opus 80.8) but for v1 from a brand new stack, this is extremely respectable.Meta is Back!The real story isn’t any single benchmark though, it’s distribution. Spark is rolling out across meta.ai, WhatsApp, Instagram, Threads, Messenger, and Ray-Ban Meta glasses — billions of users. Meta went from open Llama to a closed consumer model and they’re clearly playing a different game now (though Wang says future Muse versions might be open-sourced).The deep-dive that’s really worth your time is Simon Willison’s post where he poked at the meta.ai chat UI and got the model to spit out descriptions of 16 hidden tools behind the scenes — full Code Interpreter with persistent Python 3.9, a visual grounding tool that does pixel-precise object detection (bounding boxes, point coordinates, counting — it located 8 objects including individual whiskers and claws on a generated raccoon), sub-agent spawning, file editing, and semantic search across Instagram/Threads/Facebook posts. It’s basically an entire agentic harness baked into the chat UI. Jack Wu from MSL confirmed the tools are part of a new harness built specifically for Spark’s launch. Meta stock went up 7% on this. They are very much back in the frontier game.Guest highlights We had an unprecedented packed show with 5 guests (also this is the shortest show we’ve everSwyx kicked us off with vibes from the AI Engineer floor — harness engineering as the dominant theme (gains are coming from the harness, not the weights), the rise of skills (English-as-programming-language) absorbing more of that harness work, and his thesis that supply-chain attacks like the recent light LLM and Axios incidents mean you should basically vendor everything — pip fork instead of pip install. We also chatted about how MCP has gone from “the most exciting protocol” to “settled and stable, therefore less interesting,” which is a great problem to have.Peter Gostev from Arena (you saw a lot of him in the Mythos section above) also dropped a bonus on us: Arena just released 3 years of historical leaderboard data and actual prompt datasets on Hugging Face. He used to literally scrape the arena website by hand into Google sheets to make those overtime leaderboards we all loved — now it’s all public. Also: he confirmed that Seedance 2.0 jumped ~80 ELO points above the next video model on Arena, which is unprecedented — video models normally cluster within 10 points of each other.Vincent Koc — the #2 OpenClaw maintainer after Peter Steinberger — joined us fresh off the OpenClaw track stage. The OpenClaw codebase is now ~1.5 million lines of code including unreleased iOS and Android native apps. GitHub literally caps the issue/PR counter at “5K+” and they hit the ceiling. We talked about OpenClaw 2026.4.5 which ships /dreaming GA (Light/Deep/REM phases that defrag agent memory and write a human-readable Dream Diary to DREAMS.md), built-in video and music generation across 4 backends, GPT-5.4 as the new default, prompt-cache reuse improvements, and Control UI + docs in 12 new languages. Vincent’s framing of dreaming was beautiful — “how do you explain agent memory to a mom? You call it dreaming.” He also gave my favorite line of the show on the GPT-5.4 personality problem: incredible at coding, but soulless. (For what it’s worth, I came home after watching Project Hail Mary, cloned the Rocky voice, dropped it into my OpenClaw, and it was magical. That’s the kind of thing you can only do when the harness and the model are decoupled.)VB from OpenAI told us Codex just hit 3 million weekly active users — up from 2 million last month. We talked plugins (the Stripe / Supabase / shadcn ones that ship as packages), sub-agents (yes, one is named Jason), and Guardian Approvals — an experimental mode that classifies each tool call by risk and only escalates the dangerous ones to you, so you don’t have to YOLO-mode everything. The story that stuck with me though is his 9 AM Codex automation: every morning it reads his Slack mentions, cross-references Gmail and Calendar, and creates 5-minute pre-brief calendar events for upcoming meetings. None of that is “coding.” That’s the super-app future hiding inside a “developer tool.” I’m stealing this workflow.Omar Sanseviero from Google DeepMind came on to celebrate Gemma 4 crossing 10M+ downloads with 1,000+ Gemma-4-based fine-tunes already on HF (and Gemma family total is now over 500M downloads). Gemma 4 is also the foundation for the next generation of Gemini Nano on Pixel/Samsung devices. Lama.cpp vision capability fixes are landing. Gemma 4 is also live on W&B Inference if you want to play. Wolfram (whose entire household runs on Pixel + Google AI Studio, including his 70-year-old mother on voice unlock) was in heaven.This Week’s BuzzA short but spicy week from Weights & Biases:* W&B Automations are LIVE. You can now wire event triggers from your training runs (completion, eval thresholds, drift) into notifications, GitHub Actions, deployments, infra shutdowns — closing the loop from experiment to production. Pairs really well with the iOS app we recently shipped, so you can get a ping on your phone the moment something interesting happens on a run.* GLM 5.1 is live on W&B Inference (alongside Gemma 4 from last week) — the team is moving fast to host the best open models the moment they drop.* Wolfram published a deep dive on “more reasoning is not always better” on the W&B blog — the research behind his finding that giving models more thinking tokens can actually make them dumber on certain tasks. It’s the in-depth version of what we discussed on the show last week, with all the data. Go read it on wandb.com.Also: shout out to everyone who came up to me at AI Engineer and said hi. The Wolf Bench mentions in particular made my day. If you’re listening to this and you’re at AIE — come find us, we’ll be around tomorrow too.That’s it for this week — newsletter is short because the show was long and London is calling. As always, thanks for reading and listening 🫡TL;DR April 9 - show notes and links:* Hosts and Guests* Alex Volkov – AI Evangelist & Weights & Biases (@altryne)* Co-Hosts – @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Guests: @swyx (AI Engineer / Latent Space), @petergostev (Arena, formerly LMArena), @reach_vb (OpenAI / Codex), @vincent_koc (OpenClaw #2 maintainer), @osanseviero (Google DeepMind / Gemma)* Big CO LLMs + APIs* Anthropic announces Project Glasswing and Claude Mythos Preview, a cyber-defense frontier model too dangerous to release publicly (X, Announcement)* Anthropic’s Claude Mythos is so powerful they won’t release it — found zero-days in every major OS and browser, escaped its sandbox, and scored 93.9% on SWE-bench (X, X, X, X)* Anthropic ARR jumps from $19B (February) to $30B in April — secondary tender sale completed, employees not selling ahead of IPO* Anthropic + Google TPU deal — Anthropic getting massive compute commitment from Google (who already owns ~10% of Anthropic), with Peter Gostev’s Compute Wars chart showing the gap to OpenAI closing* Anthropic ships Managed Agents — fully hosted agent runtime + infrastructure. Selling outcomes, not tokens* Meta launches Muse Spark, the first model from Meta Superintelligence Labs, with natively multimodal reasoning, multi-agent Contemplating mode, and deep health/visual capabilities (X, Blog)* Simon Willison deep dives into Meta’s Muse Spark model and uncovers 16 hidden tools including visual grounding and sub-agents in the meta.ai chat UI (X, Blog, Announcement)* Open Source LLMs* GLM-5.1 from Z.ai is #1 open-source on SWE-Bench Pro at 58.4%, runs autonomously for 8 hours with 1,700+ agent steps (X, HF, Arxiv)* Gemma 4 crosses 10M+ downloads, 1,000+ Gemma-4-based fine-tunes on HF. Did really well on Arena considering size — Peter Gostev confirmed it smashed many models on the Pareto curve* Nisten’s pick: Hermes 27B — trained specifically to be paired with the Hermes harness, allegedly distilled from Opus API. Model + harness shipped together as a portable unit* Tools & Agentic Engineering* OpenClaw 2026.4.5 — biggest release since 4.0: /dreaming goes GA (Light/Deep/REM memory consolidation with a Dream Diary in DREAMS.md), built-in video + music generation across 4 backends, GPT-5.4 as new default, prompt-cache reuse improvements, Control UI + docs in 12 new languages (Release, Vincent, Dreaming docs, FOD#147)* OpenClaw codebase now ~1.5M lines including unreleased iOS + Android native apps. GitHub literally caps at “5K+” PRs/issues — they hit the ceiling* Anthropic did NOT ban OpenClaw — they made Max-tier subscription usage of Opus via OpenClaw significantly more expensive, pushing many users to GPT-5.4 via Codex* Codex hits 3M weekly active users — up from 2M last month. VB walked through plugins (Stripe, Supabase, shadcn), sub-agents, Guardian Approvals (auto-classify tool-call risk), and experimental hooks* Cursor: remote agents + code review agent (78% issues caught pre-merge)* MemPalace: Milla Jovovich and Ben Sigman’s open-source AI memory system goes viral with 26K GitHub stars in 2 days, claims top benchmark scores, then transparently walks back overstated claims (X, GitHub, X, X, GitHub)* This Week’s Buzz (Weights & Biases)* W&B Automations are LIVE — event triggers from your runs into notifications, GitHub Actions, deployments. Pairs nicely with the new iOS app* GLM-5.1 and Gemma 4 both up on W&B Inference* Wolfram published an in-depth blog post on his finding that more reasoning is not always better (models can get dumber with more thinking time) — full writeup on wandb.com* Vision & Video* Seedance 2.0 launches in the US — on Replicate with up to 9 reference images, 3 videos, and 3 audio files for cinematic AI video generation (X, Announcement). Peter Gostev confirmed it jumped ~80 ELO points above the next video model on Arena — a massive gap where most video models cluster within 10 points* HappyHorse-1.0, a mysterious 15B video model from Alibaba’s Taotian Group, takes #1 on Artificial Analysis video arena beating Seedance 2.0, Kling 3.0, and Grok Video (X, X, X, X, Blog)* The Harry Potter “Drip Wizards” AI slop trend — Seedance-powered Hogwarts videos going hugely viral* AI Art & Diffusion & 3D* OpenAI’s GPT-Image-2 leaked on LM Arena under three codenames (maskingtape / gaffertape / packingtape), showing photorealism and text rendering that may dethrone Google’s Nano Banana Pro (X, X, X)* Show notes & key moments* Swyx on harness engineering: gains are coming from the harness, not the weights. The big labs are investing more and more in harness — it’s not going away. Skills (English-as-programming-language) are increasingly absorbing harness work* Swyx on AI Engineer tracks: MCP is “more settled and stable, therefore less interesting.” Coding agents track is bigger this year (Cursor, Factory, super-long-running). Voice & Vision split from Generative Media — multimodality as a single track no longer makes sense* Swyx on supply chain attacks: light LLM and Axios issues mean you should “vendor everything” — pip fork instead of pip install. Tool requests becoming prompt requests* Peter Gostev on Mythos pricing: $25 / $125 per M tokens (~5x Opus 4.6). But the real reason it’s not public isn’t safety — Anthropic likely just doesn’t have the compute to serve it* Peter Gostev on Compute Wars: OpenAI is way ahead of Anthropic on compute. The new Google TPU deal is Anthropic catching up — and weird that Google is propping up a competitor to DeepMind. (Same pattern as when Google’s $2B Anthropic investment effectively propped up AWS vs Google Cloud)* Peter Gostev on Arena data: Arena released 3 years of historical leaderboard data + actual prompts as datasets on Hugging Face. Previously he was scraping it by hand into Google Sheets — now he has Databricks access* VB on Codex workflows: every morning at 9 AM, Codex automation reads his Slack mentions, cross-references Gmail and Calendar, and creates a 5-minute pre-brief calendar event for upcoming meetings. None of it is “coding” — it’s all plugins + connectors* Vincent Koc on the GPT-5.4 personality problem: model is incredible at coding but “soulless.” Wolfram noticed it back in December and cancelled his subscription. Alex cloned the Rocky voice from Project Hail Mary and put it in his OpenClaw — “amazing”* Vincent Koc on Dreaming: three phases (REM, core, deep sleep) that defrag agent memory. The dream log is for the human in the loop — makes memory inspectable in a way a non-technical person (a mom) can understand* Vincent Koc on architecture: the open-source flood forced OpenClaw into a plugin architecture. “Not Lego — Ikea.” Refactored ~1M lines in 9 days at 2 AM at NVIDIA before Jensen’s keynote* Omar Sanseviero on Gemma 4: 500M+ total Gemma downloads across all variants. Gemma is the foundation for the next generation of Gemini Nano on Pixel/Samsung. Lama.cpp vision capability fixes shipping* Wolfram’s Pixel/Google household: kids using AI Studio + Antigravity to build games, his 70-year-old mother using voice unlock on her PixelThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe