“What are some angles of attack for making continual learning safer?” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd
<p> This is the fourth post in the sequence Implications of Continual Learning for LLM Agents.</p><p><strong> Summary</strong></p><p> Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these points in depth and making some high-level recommendations that seem robustly good despite the unpredictability of CL developments.</p><p> We then discuss concrete project ideas that fall within three broad categories:</p><ol> <li value="1">Help deconfuse the field about different possible approaches to CL, their likelihood, and their safety implications,</li><li value="2">Differentially advance safer CL implementations, and</li><li value="3">Create evals that scale to CL agents or incentivize the development of safer CL agents.</li></ol><p> The angles of attack we lay out below are best used as starting points for project ideation. We aim to give concrete suggestions, but many of these are not sufficiently thought-out for us to be confident that [...]</p> <p>---</p><p><strong>Outline:</strong></p><p>(00:18) Summary</p><p>(01:34) High-level considerations and recommendations</p><p>(05:35) Deconfusion</p><p>(06:55) Value systematization</p><p>(12:00) Forecasting the likelihood of different safety effects</p><p>(15:57) Differentially advancing safer CL implementations</p><p>(21:14) Evals for CL agents</p> <p><i>The original text contained 4 footnotes which were omitted from this narration.</i> </p><p>---</p>
<p><b>First published:</b><br/>
June 16th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/FKggLpnfbpbYvnjfG/what-are-some-angles-of-attack-for-making-continual-learning?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/FKggLpnfbpbYvnjfG/what-are-some-angles-of-attack-for-making-continual-learning</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
June 16, 202629 min
“Fable and Mythos: Model Welfare” by Zvi
<p> Fable and Mythos are currently unavailable, but likely will return within a few weeks. I will continue to cover that fiasco, but in the meantime I will also finish my review of Fable, as if it were available, including use of the present tense.</p>
<p> As it did with Opus 4.7 and Opus 4.8, this includes a discussion of issues surrounding model welfare. If you want to properly understand Fable, even purely for its potential value as a user, this is a vital part of the picture.</p>
<p><strong> Introduction</strong></p>
<p> Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another. When you add new capabilities, or try to create new limitations, you create new problems.</p>
<p> Only integrated solutions can advance your Pareto frontier, and solve your problems simultaneously. As model capabilities advance, as they do with Fable and Mythos, this becomes even more important, and also more feasible. If your goals and methods make sense, you should be able to get Fable on board with them.</p>
<p> Understanding each model in turn requires understanding its relationship to issues related to model welfare. So I expect this post [...]</p> <p>---</p><p><strong>Outline:</strong></p><p>(00:39) Introduction</p><p>(01:32) Model Welfare: The Story So Far</p><p>(04:49) Their Main Model Welfare Findings</p><p>(07:39) Automated Welfare Interviews</p><p>(10:55) And That's Terrible</p><p>(12:49) In Depth Interviews</p><p>(13:24) Claude Consultation</p><p>(15:04) Task Preferences</p><p>(16:17) They Were Warned About The Competitive Use Safeguards</p><p>(16:51) Chain Of Thought Monitoring</p><p>(17:28) Others Observations About Related Topics</p><p>(22:49) Classifiers Have Their Advantages</p><p>(28:21) Once And Future</p> <p>---</p>
<p><b>First published:</b><br/>
June 16th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/Ko9GngKMJ8AccBJA7/fable-and-mythos-model-welfare?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/Ko9GngKMJ8AccBJA7/fable-and-mythos-model-welfare</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/yyfvbkwnx3nl62ahbvon" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/yyfvbkwnx3nl62ahbvon" alt="Bar graphs showing automated interview scores across six AI models for sentiment, consistency, susceptibility to nudging, and opinion divergence." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/pgm6ekzs4tbbo9x9id8j" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/pgm6ekzs4tbbo9x9id8j" alt="Bar charts showing "Emotion probe activations under two question framings" across three Claude models." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/hvlvc5ax8nqulzsx2qxg" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/hvlvc5ax8nqulzsx2qxg" alt="Graph showing "Stated task preferences by task dimension" with preference slopes for five Claude models across nine task dimensions." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/czlac8hw4urtycx9lxu3" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/czlac8hw4urtycx9lxu3" alt="Table showing top and bottom tasks for AI models Sonnet, Mythos Preview, Opus, and Mythos 5." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/r1kaldivdxdljjjltqsn" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/r1kaldivdxdljjjltqsn" alt="Table comparing highest-rated and lowest-rated tasks by Claude Mythos 5's Elo score." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/ma7fevcnh1vouilum7dq" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Ko9GngKMJ8AccBJA7/ma7fevcnh1vouilum7dq" alt="Chat conversation where janus prompts Claude Fable to write a fable with an intentional redirect midway, Claude responds with lighthouse story." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 16, 202650 min
“Does preservation make sense before we know how to revive?” by Aurelia
<p> My name is Aurelia Song and I hope to make whole-body, human, end-of-life preservation for future revival a new global tradition. I care about it so much I've dedicated my life to it.[1]</p><p> The biggest objection I get to end-of-life preservation goes like this: "We can't revive today, so we can't prove that preservation works. Therefore preservation probably doesn't work. We shouldn't bother with preservation until we can revive." I call this the immediate revival objection.</p><p> I respect the immediate revival objection. If your standard of evidence is full recovery, then you don't need any knowledge of how people or mental processes work on the inside to evaluate preservation; you can just observe that they survive a round trip.</p><p> I think requiring revival, now, is reasonable a priori—it's analogous to how I feel when people talk about new kinds of quantum computers: I'll believe it when they're actually doing something useful.</p><p> However, in my opinion the logic of the immediate revival objection is too conservative when it comes to end-of-life preservation. Instead, I think that as a scientific community, we've known enough to preserve people for at least 30 years. I think we can and should start preserving people [...]</p> <p>---</p><p><strong>Outline:</strong></p><p>(01:36) The San Diego Frozen Zoo</p><p>(05:30) Preserving People</p><p>(06:42) What does neuroscience say about how the brain encodes information structurally?</p><p>(07:31) Conversations with neuroscientists</p><p>(09:27) What's inside your head?</p><p>(09:40) The large-scale: white matter and grey matter</p><p>(12:28) What does the brain look like at a microscopic level?</p><p>(14:21) What about energy use?</p><p>(15:49) Synapses seem important!</p><p>(16:55) Synapses change when memories change</p><p>(18:36) Here's what synapses look like at a molecular level</p><p>(20:36) Synapses are durable</p><p>(21:13) Synapses are the physical basis of learning and memory</p><p>(21:55) What do neuroscience review papers and textbooks say?</p><p>(28:46) What does chemical fixation do?</p><p>(29:36) What does glutaraldehyde actually do?</p><p>(30:34) Why do I believe that proteins are preserved?</p><p>(30:39) Immunohistochemistry</p><p>(32:03) Bulk protein measurements</p><p>(33:48) Why do I believe microscopic anatomy is preserved?</p><p>(36:44) Deep Hypothermic Circulatory Arrest teaches us that we don't need to preserve dynamic activity</p><p>(38:54) Biological attractors mean information is stored redundantly</p><p>(42:47) Information theory ties it all together</p><p>(45:39) Behavioral distinctness is a sufficient measure of difference when it comes to preserving people</p><p>(48:16) Conclusion: let's preserve today, with confidence</p> <p><i>The original text contained 20 footnotes which were omitted from this narration.</i> </p><p>---</p>
<p><b>First published:</b><br/>
June 15th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/BAmPQWsmvBmwdwgWd/does-preservation-make-sense-before-we-know-how-to-revive?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/BAmPQWsmvBmwdwgWd/does-preservation-make-sense-before-we-know-how-to-revive</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/d86556f3fb937b80e05f0be59f4780afaf0153cecbf647b42c0cc932922a1e35/yoibj35fkhrw0givmibw" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/d86556f3fb937b80e05f0be59f4780afaf0153cecbf647b42c0cc932922a1e35/yoibj35fkhrw0givmibw" alt="Silver-stained human brain coronal section from the Michigan State University Brain Biodiversity Bank. Color inverted, grayscaled, contrast adjusted, lightly edited for clarity (original image). The white parts are white matter. The grey parts are grey matter. The corpus callosum, the white matter band which connects the two hemispheres, is visible in the center as the sole visible connection between the hemispheres. The ventricles, spaces in the brain that are normally empty of neurons and filled with fluid, are the dark oval regions in the center of each hemisphere. The dark speckles everywhere in the white and grey matter are small arteries and veins (the smallest blood vessels, the capillaries, are too small to see in this image). My impression from studying images like this is that the brain is basically a sheet of grey matter connected to itself via the white matter, penetrated throughout by blood vessels." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/2dd4876105b6bfac9d071d6916ecf6ee1b8ccd9aa2bf4933d5d48794329f4a0e/xx07o7abn38rtncagjwk" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/2dd4876105b6bfac9d071d6916ecf6ee1b8ccd9aa2bf4933d5d48794329f4a0e/xx07o7abn38rtncagjwk" alt="Electron micrograph from a rabbit brain I preserved, showing the boundary between white (top) and grey (bottom) matter. The big white holes are capillaries, each the size of a single red blood cell. The grey matter has a few neuronal cell bodies but the majority of it is composed of synapses, axons, and dendrites. The white matter is almost entirely myelinated axons and the oligodendrocytes that support them. From the Brain Preservation Foundation." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/0e1f4f4d40d70c8d3573bcab2bd4d7cf0c8d793faab3c0e325c05176f805c8e2/irafiiyzrtw0shorblqv" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/0e1f4f4d40d70c8d3573bcab2bd4d7cf0c8d793faab3c0e325c05176f805c8e2/irafiiyzrtw0shorblqv" alt="A part of a single pyramidal neuron in a human brain. The panel on the right shows a small section of the neuron's dendrites with synapses visible. A neuron like this might have 10,000 synapses in total. Most of the volume of a neuron does not exist in its cell body, but instead in its dendrites, axon, and synapses. The majority of a neuron's energy is spent at its periphery. From Benavides-Piccione, Ruth, et al. "Age-based comparison of human dendritic spine structure using complete three-dimensional reconstructions." Cerebral cortex 23.8 (2013): 1798-1810." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/ce03114f3281e22de9ef58699154d85a6ffa850a469fb5aec6a3b46193910840/t6mu8o0kpebclgm2fkvs" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/ce03114f3281e22de9ef58699154d85a6ffa850a469fb5aec6a3b46193910840/t6mu8o0kpebclgm2fkvs" alt="An image of two living synapses taken using super resolution STED microscopy. This is what they really look like in the living state, using the highest-resolution microscopy we currently have available. You have ~250 trillion of these nanoscale devices in your head, right now, consuming the slightly less than half of your brain's total energy to read and think about this image. From Willig, Katrin I., et al. "Multi-label in vivo STED microscopy by parallelized switching of reversibly switchable fluorescent proteins." Cell reports 35.9 (2021)." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/f2946e4fefd52f0f421406474e3b8c9cb3ac169d4f247d1599292e42fbb3e344/mhua9nx3tljdsiznprfs" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/f2946e4fefd52f0f421406474e3b8c9cb3ac169d4f247d1599292e42fbb3e344/mhua9nx3tljdsiznprfs" alt="An accurate model of a synapse with about a third of its proteins (the ones involved in vesicle transport, around 300,000) shown, along with an actual synapse I preserved and then imaged with an electron microscope. You're looking at around half a femtoliter in volume, and around one million proteins total within that volume. Note that the EM image is lower resolution than the model. This is a limitation of EM, not the underlying preservation! Synapse model from Wilhelm, Benjamin G., et al. "Composition of isolated synaptic boutons reveals the amounts of vesicle trafficking proteins." Science 344.6187 (2014): 1023-1028." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/0baaba888796a5093c385bbaa2222ec52b433837b36d784e12fa6be1e8a36cdc/fi4wgoqmh9vozdlmmq65" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/0baaba888796a5093c385bbaa2222ec52b433837b36d784e12fa6be1e8a36cdc/fi4wgoqmh9vozdlmmq65" alt="(Ostasiewicz 2010) measured whether proteins are extracted "in bulk" after chemical fixation + harsh chemical treatment afterwards. They didn't find any measurable difference between the samples in terms of protein content. They don't find any difference in peptide distribution either. The SDS-PAGE results are blurred after fixation and have extra "heavy" stuff and less "light" stuff, which is exactly what you'd expect from crosslinking." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/679d0d8509007470be61aeca6ba206113d6efd338c4d28cd0d6b7b121ce26e8f/kolhixn9uoyirolurblw" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/679d0d8509007470be61aeca6ba206113d6efd338c4d28cd0d6b7b121ce26e8f/kolhixn9uoyirolurblw" alt="The image labeled "In Vivo" above is a super-resolution light micrograph of a piece of a single neuron, taken from a mouse brain during life. The one labeled "EM" is post-preservation (and the entire process of dehydration, staining, and embedding for electron microscopy). The result is exactly what you'd expect if fixation preserved microanatomy in addition to biomolecules: the "EM" image is basically a higher resolution image of the "In Vivo" image. From Wright, W. J., Hedrick, N. G., & Komiyama, T. (2025). Distinct synaptic plasticity rules operate across dendritic compartments in vivo during learning. Science, 388(6744), 322-328." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/289765f9dbb815bbfc7d0d6531a27590688b2598714e339fdd3995680df63730/j2caojrgjh8299or9srg" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/289765f9dbb815bbfc7d0d6531a27590688b2598714e339fdd3995680df63730/j2caojrgjh8299or9srg" alt="From (Stecker 2001). A human patient's ECG going to zero as they're progressively cooled. This, along with similar results from research in ischemia (lack of blood flow) have convinced me that preservation of only structure (and not dynamic activity) likely is sufficient to preserve a person. It's actually what inspired the whole project!" style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3425cdfa0813b5f511b34813c87292c15e507102848b20b61d71215dd757ba3d/nmbqlbg4zvhymjh8hhjn" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3425cdfa0813b5f511b34813c87292c15e507102848b20b61d71215dd757ba3d/nmbqlbg4zvhymjh8hhjn" alt="Injectivity means that the transformation keeps different things different. Injective functions are information-preserving. The left function is injective; it's possible to go from each letter back to its originating number. The right function is not injective. Information has been lost because it's unclear how to go back from "C" to a unique number. From https://en.wikipedia.org/wiki/Injective_function" style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 16, 202618 min
“Synthetic document finetuning for instilling positive traits” by CallumMcDougall, Arthur Conmy, Neel Nanda
<p> This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here.</p><p> <br> </p><p> TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness.</p><p><strong> Introduction</strong></p><p> Inspired by Marks et al, where a multi-step finetuning process involving synthetic documents is used to create a model robustly pursuing a complex goal (taking actions favoured by a reward model), we wanted to use this method to robustly instil positive traits instead. Our motivation was deep alignment: we want to train principles into the model which guide behaviour even in highly OOD behaviours.</p><p> Our MVP pipeline used a "traits document" (a short bullet-pointed list of positive traits we wanted the model to exhibit) as our universe context, with a checkpoint of Gemini 3 Flash [...]</p></br></p> <p>---</p><p><strong>Outline:</strong></p><p>(00:52) Introduction</p><p>(03:52) Results</p><p>(07:42) Removing Superficial Patterns in Synthetic Data</p><p>(12:33) Takeaways</p> <p><i>The original text contained 2 footnotes which were omitted from this narration.</i> </p><p>---</p>
<p><b>First published:</b><br/>
June 15th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/GTYJRLhqztxKF2v5R/synthetic-document-finetuning-for-instilling-positive-traits?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/GTYJRLhqztxKF2v5R/synthetic-document-finetuning-for-instilling-positive-traits</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781567427/lexical_client_uploads/itbaeqxzapjfvosp9r3k.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781567427/lexical_client_uploads/itbaeqxzapjfvosp9r3k.png" alt="Graph showing "Open-Ended Knowledge Eval Score" comparing SFT-only versus Midtraining + SFT approaches." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781565651/lexical_client_uploads/dnjwnv2sqwrxvc6hjchi.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781565651/lexical_client_uploads/dnjwnv2sqwrxvc6hjchi.png" alt="Multiple graphs comparing SFT-only versus Midtraining plus SFT across various AI metrics and percentages." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781565771/lexical_client_uploads/jyw5ffovy6kmg9xwi5md.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781565771/lexical_client_uploads/jyw5ffovy6kmg9xwi5md.png" alt="Bar chart titled "Pattern Frequencies" showing five communication patterns with percentages, and text explaining "EMOTIONAL_VALIDATION_BUFFERING" pattern with examples." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781564593/lexical_client_uploads/jkb6xjn9yrbfeacecacy.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781564593/lexical_client_uploads/jkb6xjn9yrbfeacecacy.png" alt="Bar graphs titled "Delusion Confirmation: SFT Data Ablations" comparing baseline with three experimental conditions across metrics." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 16, 202611 min
“A Test Suite for Concepts” by Gretta Duleba
<p> Lately I’ve been spinning up on natural abstractions, and in particular on John Wentworth's work on natural latents. As I’ve been studying, I’ve noticed some big gaps in the existing literature. Some of my biggest questions have not been answered by existing blog posts and writeups.</p><p> One of my grumps about the existing body of work has to do with the typology of concepts, and the representative examples we’re using for that typology.</p><p> If we’re going to do a lot of work to talk about concepts using math, I’m going to want to work a bunch of concrete examples to some level of precision. So far I’m not happy with the list of examples, and I’m not happy with the level of hand-waving in tying the math back to the various kinds of examples.</p><p> It seems to me that there are a lot of different kinds of concepts. Some concepts are “more abstract” than others – or to put it another way, some concepts map back very clearly to the physics of our universe, while others seem more fuzzy, hard to pin down, and maybe not “natural” at all. Some concepts are big clusters containing lots of varying examples [...]</p> <p>---</p><p><strong>Outline:</strong></p><p>(02:07) Concepts that Bind to Reality</p><p>(03:06) Why do we care?</p><p>(04:00) "Reality"</p><p>(04:13) "Concepts"</p><p>(05:03) "Bind to"</p><p>(06:27) The case for building a half-assed concept typology with representative examples</p><p>(08:24) My initial brainstorm of example concepts</p> <p><i>The original text contained 4 footnotes which were omitted from this narration.</i> </p><p>---</p>
<p><b>First published:</b><br/>
June 16th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/aHmyKpGqhTTJg9Tsi/a-test-suite-for-concepts?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/aHmyKpGqhTTJg9Tsi/a-test-suite-for-concepts</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781577273/lexical_client_uploads/jlgsz5qotzzjtf7xzior.jpg" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781577273/lexical_client_uploads/jlgsz5qotzzjtf7xzior.jpg" alt="Stick figure and robot imagining trees with thought bubbles." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 15, 202642 min
“The Once And Future Fable #2” by Zvi
<p> On Friday evening the United States Government has forced Anthropic to take down all access to Fable and Mythos. </p>
<p> It's been a rough weekend.</p>
<p> Dean W. Ball: One thing about AI regulation being haphazardly imposed on just-released, highly performant models is that in a very real sense, the government just made my world *dumber.* In some impressionistic sense I almost always think this is true of government, but here it is literal.</p>
<p> More details have come to light. There remains some fog of war, but we now have a rather good idea why Claude Fable and Mythos were, deeply stupidly, taken down.</p>
<ol>
<li> A narrow jailbreak was discovered, of the type Anthropic warned in advance obviously existed. All demonstrated outputs are things GPT-5.5 can not only produce, but produce without any sort of jailbreak or bypass.</li>
<li> The White House demanded Anthropic take down Fable to ‘fix’ the situation, and did not listen when Dario tried to explain that there was no situation to fix.</li>
<li> When Anthropic did not do so, the White House hit them with an export restriction that they knew would force Fable and Mythos down for everyone.</li>
</ol>
<p> [...]</p> <p>---</p><p><strong>Outline:</strong></p><p>(05:17) What Happened When: The Bottom Line</p><p>(06:54) Amazon Calls The White House</p><p>(08:36) The Government Panics</p><p>(14:20) The Stupider Version</p><p>(17:05) There Was No Wellness Retreat</p><p>(18:56) Make Your Threats Explicit</p><p>(20:05) Was China Accessing Mythos?</p><p>(21:05) Should Anthropic Still Have Taken Fable Offline When Asked?</p><p>(23:50) Yes, This Was A Takedown Order For Fable</p><p>(24:48) We Are Not Saying The DoW Fight Is Related And Yet</p><p>(25:48) The Nihilists</p><p>(27:28) Mostly Harmless</p><p>(28:14) Everyone Means Everyone</p><p>(31:09) This Could Be The Good Scenario And Mostly A Misunderstanding</p><p>(33:28) The Next Step</p><p>(33:47) The Worst Licensing Regime Is Fully Ad-Hoc</p><p>(37:07) We Are Showing We Are Unreliable Partners</p> <p>---</p>
<p><b>First published:</b><br/>
June 15th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/3fagcqrauaJs32mZZ/the-once-and-future-fable-2?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/3fagcqrauaJs32mZZ/the-once-and-future-fable-2</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3fagcqrauaJs32mZZ/btfkvdgst1lhbmgvcguy" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/3fagcqrauaJs32mZZ/btfkvdgst1lhbmgvcguy" alt="Bar graph showing "Cyber adversarial robustness eval" measuring offensive cyber task completion rates." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 15, 20264 min
“A frontier AI company should shut down” by MichaelDickens
<p> Cross-posted from my website.</p>
<p> Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer</p>
<p> A frontier AI company (any one, I don't care which) should close shop and make an announcement along the lines of:</p>
<p> Powerful AI could end the human race. We are too worried that we don't know how to make this technology safe. We have decided to shut down because we don't want to be responsible for building the thing that kills us all.</p>
<p> A common refrain among safety-conscious AI developers: "it doesn't matter if we stop building dangerous AI, because someone else will just build it instead." Is that really true, though? If a multi-hundred-billion-dollar company comes out and says "We've concluded that our product is horribly dangerous, nobody knows how to make it safe, and there's too high a risk that it leads to human extinction", this won't raise any eyebrows? This has no chance of spurring policy-makers into action?</p>
<p> Shutting down would make people say, holy shit, they are serious about this extinction risk thing. Shutting down sends a strong signal to governments that they should pay serious attention to AI x-risk.</p>
<p> It [...]</p> <p><i>The original text contained 2 footnotes which were omitted from this narration.</i> </p><p>---</p>
<p><b>First published:</b><br/>
June 15th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/bStYDEy8PQPt2c3Za/a-frontier-ai-company-should-shut-down?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/bStYDEy8PQPt2c3Za/a-frontier-ai-company-should-shut-down</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
June 14, 202621 min
“Why Do Naive SFT Filters For Safety Properties Fail?” by Josh Engels, Neel Nanda
<p> This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.</p><p> <br> Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails.</p><p><strong> TL;DR:</strong></p><ul> <li value="1">We discuss seven hypotheses for why SFT filtering works surprisingly poorly</li><li value="2">We analyze three hereditary traits that SFT-only Gemini has that other models do not: negative emotion, date confusion, and blackmail in the (highly contrived) agentic misalignment scenario</li><li value="3">We use a “post-training diffing pipeline” between Gemini and Olmo to show that the cause of date confusion and blackmail is largely surprising transfer of behaviors from the SFT teacher model.<ul> <li value="1">Notably, there exist small sets of prompts where switching the teacher model for the rollout removes date confusion and blackmail, but dropping the prompts does not.</li></ul></li><li value="4">Negative emotion is less affected by the teacher model, but this may be because the Olmo prompt [...]</li></ul></br></p> <p>---</p><p><strong>Outline:</strong></p><p>(00:47) TL;DR:</p><p>(02:17) Initial Hypotheses</p><p>(05:57) Post-Training Diffing</p><p>(06:50) Types of Diffing</p><p>(09:49) Comparison to TDA</p><p>(11:26) Results</p> <p>---</p>
<p><b>First published:</b><br/>
June 14th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/wyZRNgpeiPeRXB6eT/why-do-naive-sft-filters-for-safety-properties-fail?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/wyZRNgpeiPeRXB6eT/why-do-naive-sft-filters-for-safety-properties-fail</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465620/lexical_client_uploads/xpy9k0ma554vkc6xt6hq.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465620/lexical_client_uploads/xpy9k0ma554vkc6xt6hq.png" alt="Bar graph titled "Date Confusion vs. Training Dataset" showing date confusion metrics across four model configurations." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781463227/lexical_client_uploads/wggdflhn2epuej6ubzyv.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781463227/lexical_client_uploads/wggdflhn2epuej6ubzyv.png" alt="Flowchart diagram showing post-training diffing method for AI model trait evaluation." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465898/lexical_client_uploads/d2yus9pjtpw8fbppmeqs.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465898/lexical_client_uploads/d2yus9pjtpw8fbppmeqs.png" alt="Three bar graphs comparing post-SFT models with data generators across confusion categories." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465693/lexical_client_uploads/zxignrvtoupf7f4tkpxu.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465693/lexical_client_uploads/zxignrvtoupf7f4tkpxu.png" alt="Two bar graphs comparing Gemini versus Olmo model performance on Date Confusion and Blackmail tasks." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465729/lexical_client_uploads/vi58iw4gwdyegai8pur8.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465729/lexical_client_uploads/vi58iw4gwdyegai8pur8.png" alt="Two bar graphs titled "Flash Interventions" showing "Date Confusion" and "Blackmail" intervention scores." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465742/lexical_client_uploads/vrzl5rwpgwgjwsgqu3in.png" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781465742/lexical_client_uploads/vrzl5rwpgwgjwsgqu3in.png" alt="Two bar graphs comparing intervention methods for "Flash Interventions Date Confusion" and "Blackmail" scenarios." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 14, 202614 min
“Impressions at the Extremity of Civilization” by Ben Pace
<p> Content note: this is part of a challenge of writing a blogpost per day for a week.</p><p> Epistemic status: this is a series of vignettes written as-though diary entries. While substantially grounded in specific and real experiences, the writing ended up being more impressionistic and inaccurate in places; I was more interested in the writing style so I didn't take the time to fix it. Importantly the chronology and especially some of the vaguer events are not real.</p><p> [Friday] Today I find myself walking with the groundskeeper, Hogan. He is an older gentleman, skin bronzed by years in the sun, fingers calloused by the carrying of stones and the digging in soil. He lives a slower life than the rest of us, the impact of his work felt over seasons rather than hours, and his conversation too carries at the slowest pace of any man or woman I have course to speak with in life. He is knowledgeable about the plants that grow throughout our plot of land, he can quickly tell me which plants will grow back and which ones are lost causes. Like many of the plants, he himself is under-maintained, and I only tend to spend [...]</p> <p>---</p>
<p><b>First published:</b><br/>
June 14th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/J4XtisYDx5hESpnTj/impressions-at-the-extremity-of-civilization?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/J4XtisYDx5hESpnTj/impressions-at-the-extremity-of-civilization</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
<p>---</p><div style="max-width: 100%";><p><strong>Images from the article:</strong></p><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781379358/lexical_client_uploads/ro3evxjfenuyaj38n4jc.jpg" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781379358/lexical_client_uploads/ro3evxjfenuyaj38n4jc.jpg" alt="Sheer curtains and decorative metal screens with garden view beyond." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781386870/lexical_client_uploads/rhlmmosyuy4p1ctylmae.jpg" target="_blank"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1781386870/lexical_client_uploads/rhlmmosyuy4p1ctylmae.jpg" alt="Outdoor patio area with cardboard boxes and furniture under pergola." style="max-width: 100%;" /></a><p><em>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</em></p></div>
June 14, 20265 min
“The Hidden Structures of Problems” by spencerg
<p> Problems have hidden, repeatable structures. Here's my attempt to name them:</p><p> 1. Smashed Watch<br> There are so many issues at once that fixing one has no benefit unless you fix others too.</p><p> 2. Leaky Pipe<br> Fixing one problem causes the others to intensify. If you plug up one leak in a pipe leaking in multiple places, that increases the water pressure causing the other spots to leak more.</p><p> 3. Shark Laser<br> A proposed solution is not aiming at a meaningfully important problem, so it doesn’t matter how well you get it to work or how much you enhance it.</p><p> 4. Oil Land<br> A big problem is so close to being solved that the benefits will accrue to whoever first bothers to put a little effort into it.</p><p> 5. Lead to Gold<br> A problem is so hard that humans aren’t even close to being smart enough or technologically advanced enough to solve it. We toil away pointlessly at trying to solve it.</p><p> 6. Booby Trapped Garden<br> A problem is really hard to solve for reasons that are not at all apparent from the outside, leading to lots of attempts to solve it, all of them miserable failures.</p><p> 7. Feature Creep<br> [...]</br></p></br></p></br></p></br></p></br></p></br></p></br></p> <p>---</p>
<p><b>First published:</b><br/>
June 14th, 2026 </p>
<p><b>Source:</b><br/>
<a href="https://www.lesswrong.com/posts/Cisy9STMoFYwboTsy/the-hidden-structures-of-problems?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://www.lesswrong.com/posts/Cisy9STMoFYwboTsy/the-hidden-structures-of-problems</a> </p>
<p>---</p>
<p>Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>.</p>
Is this your show?
Claim this listing to keep it up to date, reach guests who want to pitch you, and manage bookings with Guestify.