Redwood Research Blog

By Redwood Research

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.


Category: Technology

Open in Apple Podcasts


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 2
Reviews: 0
Episodes: 115

Description

Narrations of Redwood Research blog posts. Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques for mitigating those risks.

Episode Date
“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff
Jun 10, 2026
“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck Shlegeris
Jun 08, 2026
“Retrying vs Resampling in AI Control” by James Lucassen, Adam Kaufman
May 29, 2026
“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny
May 28, 2026
“Full automation of AI R&D probably yields a large speed up even without a software-only singularity” by Ryan Greenblatt
May 27, 2026
“Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny
May 18, 2026
“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen
May 15, 2026
“How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff
May 11, 2026
“A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris
May 07, 2026
“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen
May 01, 2026
“Research Sabotage in ML Codebases” by Eric Gan
Apr 29, 2026
“Recursive forecasting” by Arun Jose, Alex Mallen
Apr 28, 2026
“AI companies should publish security assessments” by Ryan Greenblatt
Apr 27, 2026
“Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen
Apr 27, 2026
“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan
Apr 21, 2026
“Introducing LinuxArena” by Tyler
Apr 20, 2026
“Current AIs seem pretty misaligned to me” by Ryan Greenblatt
Apr 15, 2026
“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, Ryan Greenblatt
Apr 14, 2026
“Logit ROCs: Monitor TPR is linear in FPR in logit space” by Kerrick Staley, Aryan Bhatt, Julian Stastny
Apr 12, 2026
“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by Ryan Greenblatt
Apr 11, 2026
“My picture of the present in AI” by Ryan Greenblatt
Apr 07, 2026
“AIs can now often do massive easy-to-verify SWE tasks” by Ryan Greenblatt
Apr 06, 2026
“Blocking live failures with synchronous monitors” by James Lucassen, Adam Kaufman
Mar 30, 2026
“Reward-seekers will probably behave according to causal decision theory” by Alex Mallen
Mar 28, 2026
“AI’s capability improvements haven’t come from it getting less affordable” by Anders Cairns Woodruff
Mar 27, 2026
“Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Cairns Woodruff, Alex Mallen
Mar 12, 2026
“The case for satiating cheaply-satisfied AI preferences” by Alex Mallen
Mar 10, 2026
“Frontier AI companies probably can’t leave the US” by Anders Cairns Woodruff
Feb 26, 2026
“Announcing ControlConf 2026” by Buck Shlegeris
Feb 26, 2026
“Will reward-seekers respond to distant incentives?” by Alex Mallen
Feb 16, 2026
“How do we (more) safely defer to AIs?” by Ryan Greenblatt
Feb 12, 2026
“Distinguish between inference scaling and “larger tasks use more compute”” by Ryan Greenblatt
Feb 11, 2026
“Fitness-Seekers: Generalizing the Reward-Seeking Threat Model” by Alex Mallen
Jan 29, 2026
The inaugural Redwood Research podcast
Jan 06, 2026
“Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts” by Ryan Greenblatt
Jan 01, 2026
“Measuring no CoT math time horizon (single forward pass)” by Ryan Greenblatt
Dec 26, 2025
“Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance” by Ryan Greenblatt
Dec 22, 2025
“BashArena and Control Setting Design” by Adam Kaufman, jlucassen
Dec 18, 2025
“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck Shlegeris
Dec 04, 2025
“Will AI systems drift into misalignment?” by Josh Clymer
Nov 15, 2025
“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt
Nov 03, 2025
“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt
Oct 30, 2025
“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover
Oct 23, 2025
“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt
Oct 22, 2025
“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt
Oct 16, 2025
“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt
Oct 10, 2025
“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris
Oct 09, 2025
“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt
Oct 08, 2025
“Notes on fatalities from AI takeover” by Ryan Greenblatt
Sep 23, 2025
“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt
Sep 22, 2025
“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny
Sep 19, 2025
“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover
Sep 17, 2025
“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt
Sep 09, 2025
“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt
Sep 03, 2025
“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt
Aug 27, 2025
“Notes on cooperating with unaligned AI” by Lukas Finnveden
Aug 24, 2025
“Being honest with AIs” by Lukas Finnveden
Aug 21, 2025
“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt
Aug 20, 2025
“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris
Aug 09, 2025
“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris
Aug 09, 2025
“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt
Jul 28, 2025
“Should we update against seeing relatively fast AI progress in 2025 and 2026?” by Ryan Greenblatt
Jul 28, 2025
“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris
Jul 18, 2025
“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris
Jul 18, 2025
“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar
Jul 14, 2025
“Recent Redwood Research project proposals” by Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Vivek Hebbar
Jul 14, 2025
“What’s worse, spies or schemers?” by Buck Shlegeris, Julian Stastny
Jul 09, 2025
“Ryan on the 80,000 Hours podcast” by Buck Shlegeris
Jul 08, 2025
“How much novel security-critical infrastructure do you need during the singularity?” by Buck Shlegeris
Jul 05, 2025
“Two proposed projects on abstract analogies for scheming” by Julian Stastny
Jul 04, 2025
“There are two fundamentally different constraints on schemers” by Buck Shlegeris
Jul 02, 2025
“Jankily controlling superintelligence” by Ryan Greenblatt
Jun 27, 2025
“What does 10x-ing effective compute get you?” by Ryan Greenblatt
Jun 24, 2025
“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck Shlegeris
Jun 23, 2025
“Making deals with early schemers” by Julian Stastny, Olli Järviniemi, Buck Shlegeris
Jun 20, 2025
“Prefix cache untrusted monitors: a method to apply after you catch your AI” by Ryan Greenblatt
Jun 20, 2025
“AI safety techniques leveraging distillation” by Ryan Greenblatt
Jun 19, 2025
“When does training a model change its goals?” by Vivek Hebbar, Ryan Greenblatt
Jun 12, 2025
“The case for countermeasures to memetic spread of misaligned values” by Alex Mallen
May 28, 2025
“AIs at the current capability level may be important for future safety work” by Ryan Greenblatt
May 12, 2025
“Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking” by Julian Stastny, Buck Shlegeris
May 08, 2025
“Training-time schemers vs behavioral schemers” by Alex Mallen
May 06, 2025
“What’s going on with AI progress and trends? (As of 5/2025)” by Ryan Greenblatt
May 03, 2025
“How can we solve diffuse threats like research sabotage with AI control?” by Vivek Hebbar
Apr 30, 2025
“7+ tractable directions in AI control” by Ryan Greenblatt
Apr 29, 2025
“Clarifying AI R&D threat models” by Josh Clymer
Apr 25, 2025
“How training-gamers might function (and win)” by Vivek Hebbar
Apr 24, 2025
“Handling schemers if shutdown is not an option” by Buck Shlegeris
Apr 18, 2025
“Ctrl-Z: Controlling AI Agents via Resampling” by Buck Shlegeris
Apr 16, 2025
“To be legible, evidence of misalignment probably has to be behavioral” by Ryan Greenblatt
Apr 15, 2025
“Why do misalignment risks increase as AIs get more capable?” by Ryan Greenblatt
Apr 11, 2025
“An overview of areas of control work” by Ryan Greenblatt
Apr 09, 2025
“An overview of control measures” by Ryan Greenblatt
Apr 06, 2025
“Buck on the 80,000 Hours podcast” by Buck Shlegeris
Apr 05, 2025
“Notes on countermeasures for exploration hacking (aka sandbagging)” by Ryan Greenblatt
Apr 04, 2025
“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt
Apr 03, 2025
“Prioritizing threats for AI control” by Ryan Greenblatt
Mar 19, 2025
“How might we safely pass the buck to AI?” by Josh Clymer
Feb 19, 2025
“Takeaways from sketching a control safety case” by Josh Clymer
Jan 30, 2025
“Planning for Extreme AI Risks” by Josh Clymer
Jan 29, 2025
“Ten people on the inside” by Buck Shlegeris
Jan 28, 2025
“When does capability elicitation bound risk?” by Josh Clymer
Jan 22, 2025
“How will we update about scheming?” by Ryan Greenblatt
Jan 19, 2025
“Thoughts on the conservative assumptions in AI control” by Buck Shlegeris
Jan 17, 2025
“Extending control evaluations to non-scheming threats” by Josh Clymer
Jan 13, 2025
“Measuring whether AIs can statelessly strategize to subvert security measures” by Buck Shlegeris, Alex Mallen
Dec 20, 2024
“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris
Dec 18, 2024
“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris
Nov 18, 2024
“Win/continue/lose scenarios and execute/replace/audit protocols” by Buck Shlegeris
Nov 15, 2024
“Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming” by Buck Shlegeris
Oct 10, 2024
“A basic systems architecture for AI agents that do autonomous research” by Buck Shlegeris
Sep 26, 2024
“How to prevent collusion when using untrusted models to monitor each other” by Buck Shlegeris
Sep 25, 2024
“Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” by Buck Shlegeris
Aug 26, 2024
“Fields that I reference when thinking about AI takeover prevention” by Buck Shlegeris
Aug 13, 2024
“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by Ryan Greenblatt
Jun 17, 2024