Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.
Episode | Date |
---|---|
AF - Meta Questions about Metaphilosophy by Wei Dai
|
Sep 01, 2023 |
AF - Red-teaming language models via activation engineering by Nina Rimsky
|
Aug 26, 2023 |
AF - Causality and a Cost Semantics for Neural Networks by scottviteri
|
Aug 21, 2023 |
AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann
|
Aug 20, 2023 |
AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor
|
Aug 16, 2023 |
AF - Reducing sycophancy and improving honesty via activation steering by NinaR
|
Jul 28, 2023 |
AF - How LLMs are and are not myopic by janus
|
Jul 25, 2023 |
AF - Open problems in activation engineering by Alex Turner
|
Jul 24, 2023 |
AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope
|
Jul 23, 2023 |
AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti
|
Jul 21, 2023 |
AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth
|
Jul 19, 2023 |
AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan
|
Jul 18, 2023 |
AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse
|
Jul 02, 2023 |
AF - Agency from a causal perspective by Tom Everitt
|
Jun 30, 2023 |
AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H
|
Jun 26, 2023 |
AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger
|
Jun 16, 2023 |
AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons
|
May 31, 2023 |
AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden
|
May 30, 2023 |
AF - Wikipedia as an introduction to the alignment problem by SoerenMind
|
May 29, 2023 |
AF - [Linkpost] Interpretability Dreams by DanielFilan
|
May 24, 2023 |
AF - Conjecture internal survey | AGI timelines and estimations of probability of human extinction from unaligned AGI by Maris Sala
|
May 22, 2023 |
AF - Some background for reasoning about dual-use alignment research by Charlie Steiner
|
May 18, 2023 |
AF - $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions by johnswentworth
|
May 16, 2023 |
AF - Difficulties in making powerful aligned AI by DanielFilan
|
May 14, 2023 |
AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes
|
Apr 27, 2023 |
AF - How Many Bits Of Optimization Can One Bit Of Observation Unlock? by johnswentworth
|
Apr 26, 2023 |
AF - Endo-, Dia-, Para-, and Ecto-systemic novelty by Tsvi Benson-Tilsen
|
Apr 23, 2023 |
AF - Thinking about maximization and corrigibility by James Payor
|
Apr 21, 2023 |
AF - Concave Utility Question by Scott Garrabrant
|
Apr 15, 2023 |
AF - Shapley Value Attribution in Chain of Thought by leogao
|
Apr 14, 2023 |
AF - Announcing Epoch’s dashboard of key trends and figures in Machine Learning by Jaime Sevilla
|
Apr 13, 2023 |
AF - Lessons from Convergent Evolution for AI Alignment by Jan Kulveit
|
Mar 27, 2023 |
AF - What happens with logical induction when... by Donald Hobson
|
Mar 26, 2023 |
AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner
|
Mar 23, 2023 |
AF - The space of systems and the space of maps by Jan Kulveit
|
Mar 22, 2023 |
AF - [ASoT] Some thoughts on human abstractions by leogao
|
Mar 16, 2023 |
AF - What is a definition, how can it be extrapolated? by Stuart Armstrong
|
Mar 14, 2023 |
AF - Implied "utilities" of simulators are broad, dense, and shallow by porby
|
Mar 01, 2023 |
AF - Scarce Channels and Abstraction Coupling by johnswentworth
|
Feb 28, 2023 |
AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger
|
Feb 24, 2023 |
AF - AI that shouldn't work, yet kind of does by Donald Hobson
|
Feb 23, 2023 |
AF - The Open Agency Model by Eric Drexler
|
Feb 22, 2023 |
AF - EIS VII: A Challenge for Mechanists by Stephen Casper
|
Feb 18, 2023 |
AF - EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by Stephen Casper
|
Feb 17, 2023 |
AF - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by Lawrence Chan
|
Feb 16, 2023 |
AF - EIS IV: A Spotlight on Feature Attribution/Saliency by Stephen Casper
|
Feb 15, 2023 |
AF - The Cave Allegory Revisited: Understanding GPT's Worldview by Jan Kulveit
|
Feb 14, 2023 |
AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis
|
Jan 31, 2023 |
AF - Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") by David Scott Krueger
|
Jan 30, 2023 |
AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger
|
Jan 25, 2023 |
AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger
|
Jan 25, 2023 |
AF - Thoughts on hardware / compute requirements for AGI by Steve Byrnes
|
Jan 24, 2023 |
AF - Thoughts on hardware / compute requirements for AGI by Steve Byrnes
|
Jan 24, 2023 |
AF - Gemini modeling by Tsvi Benson-Tilsen
|
Jan 22, 2023 |
AF - Shard theory alignment requires magic. by Charlie Steiner
|
Jan 20, 2023 |
AF - Thoughts on refusing harmful requests to large language models by William Saunders
|
Jan 19, 2023 |
AF - Löbian emotional processing of emergent cooperation: an example by Andrew Critch
|
Jan 17, 2023 |
AF - Underspecification of Oracle AI by Rubi J. Hudson
|
Jan 15, 2023 |
AF - World-Model Interpretability Is All We Need by Thane Ruthenis
|
Jan 14, 2023 |
AF - AGISF adaptation for in-person groups by Sam Marks
|
Jan 13, 2023 |