Daily Paper Cast Podcast Republic

Daily Paper Cast

By Jingwen Liang, Gengyu Wang

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.

Image by Jingwen Liang, Gengyu Wang

Category: Science

Open Website

Rate for this podcast

Subscribers: 0
Reviews: 0
Episodes: 1610

Description

We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, LLM ML, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Episode	Date
Urban Socio-Semantic Segmentation with Vision-Language Reasoning Read the full episode description	Jan 17, 2026
STEP3-VL-10B Technical Report Read the full episode description	Jan 17, 2026
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs Read the full episode description	Jan 17, 2026
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning Read the full episode description	Jan 17, 2026
Controlled Self-Evolution for Algorithmic Code Optimization Read the full episode description	Jan 16, 2026
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Read the full episode description	Jan 16, 2026
MAXS: Meta-Adaptive Exploration with LLM Agents Read the full episode description	Jan 16, 2026
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning Read the full episode description	Jan 16, 2026
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Read the full episode description	Jan 16, 2026
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL Read the full episode description	Jan 16, 2026
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG Read the full episode description	Jan 16, 2026
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding Read the full episode description	Jan 16, 2026
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences Read the full episode description	Jan 15, 2026
Solar Open Technical Report Read the full episode description	Jan 15, 2026
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions Read the full episode description	Jan 15, 2026
User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale Read the full episode description	Jan 15, 2026
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands Read the full episode description	Jan 15, 2026
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking Read the full episode description	Jan 15, 2026
MemoBrain: Executive Memory as an Agentic Brain for Reasoning Read the full episode description	Jan 15, 2026
Motion Attribution for Video Generation Read the full episode description	Jan 15, 2026
3AM: Segment Anything with Geometric Consistency in Videos Read the full episode description	Jan 15, 2026
BabyVision: Visual Reasoning Beyond Language Read the full episode description	Jan 14, 2026
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning Read the full episode description	Jan 14, 2026
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head Read the full episode description	Jan 14, 2026
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests Read the full episode description	Jan 14, 2026
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts Read the full episode description	Jan 14, 2026
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors Read the full episode description	Jan 14, 2026
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent Read the full episode description	Jan 14, 2026
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization Read the full episode description	Jan 13, 2026
MMFormalizer: Multimodal Autoformalization in the Wild Read the full episode description	Jan 13, 2026
CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature Read the full episode description	Jan 13, 2026
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Read the full episode description	Jan 13, 2026
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards Read the full episode description	Jan 13, 2026
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis Read the full episode description	Jan 13, 2026
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking Read the full episode description	Jan 13, 2026
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Read the full episode description	Jan 10, 2026
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers Read the full episode description	Jan 10, 2026
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes Read the full episode description	Jan 10, 2026
Token-Level LLM Collaboration via FusionRoute Read the full episode description	Jan 10, 2026
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting Read the full episode description	Jan 09, 2026
Evolving Programmatic Skill Networks Read the full episode description	Jan 09, 2026
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning Read the full episode description	Jan 09, 2026
Benchmark^2: Systematic Evaluation of LLM Benchmarks Read the full episode description	Jan 09, 2026
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields Read the full episode description	Jan 08, 2026
LTX-2: Efficient Joint Audio-Visual Foundation Model Read the full episode description	Jan 08, 2026
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization Read the full episode description	Jan 08, 2026
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence Read the full episode description	Jan 08, 2026
NitroGen: An Open Foundation Model for Generalist Gaming Agents Read the full episode description	Jan 08, 2026
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits Read the full episode description	Jan 07, 2026
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation Read the full episode description	Jan 07, 2026
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer Read the full episode description	Jan 07, 2026
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation Read the full episode description	Jan 07, 2026
GARDO: Reinforcing Diffusion Models without Reward Hacking Read the full episode description	Jan 07, 2026
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams Read the full episode description	Jan 07, 2026
VINO: A Unified Visual Generator with Interleaved OmniModal Context Read the full episode description	Jan 07, 2026
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization Read the full episode description	Jan 06, 2026
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos Read the full episode description	Jan 06, 2026
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation Read the full episode description	Jan 06, 2026
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation Read the full episode description	Jan 06, 2026
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning Read the full episode description	Jan 06, 2026
Deep Delta Learning Read the full episode description	Jan 06, 2026
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction Read the full episode description	Jan 06, 2026
Nested Learning: The Illusion of Deep Learning Architectures Read the full episode description	Jan 06, 2026
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling Read the full episode description	Jan 03, 2026
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space Read the full episode description	Jan 03, 2026
mHC: Manifold-Constrained Hyper-Connections Read the full episode description	Jan 02, 2026
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models Read the full episode description	Jan 02, 2026
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem Read the full episode description	Jan 02, 2026
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction Read the full episode description	Jan 02, 2026
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss Read the full episode description	Dec 31, 2025
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation Read the full episode description	Dec 31, 2025
Yume-1.5: A Text-Controlled Interactive World Generation Model Read the full episode description	Dec 31, 2025
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents Read the full episode description	Dec 31, 2025
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation Read the full episode description	Dec 31, 2025
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion Read the full episode description	Dec 31, 2025
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone Read the full episode description	Dec 31, 2025
SpotEdit: Selective Region Editing in Diffusion Transformers Read the full episode description	Dec 31, 2025
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models Read the full episode description	Dec 31, 2025
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion Read the full episode description	Dec 30, 2025
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding Read the full episode description	Dec 30, 2025
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents Read the full episode description	Dec 30, 2025
Latent Implicit Visual Reasoning Read the full episode description	Dec 27, 2025
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning Read the full episode description	Dec 27, 2025
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times Read the full episode description	Dec 26, 2025
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models Read the full episode description	Dec 26, 2025
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation Read the full episode description	Dec 26, 2025
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation Read the full episode description	Dec 26, 2025
SemanticGen: Video Generation in Semantic Space Read the full episode description	Dec 25, 2025
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies Read the full episode description	Dec 25, 2025
LongVideoAgent: Multi-Agent Reasoning with Long Videos Read the full episode description	Dec 25, 2025
SpatialTree: How Spatial Abilities Branch Out in MLLMs Read the full episode description	Dec 25, 2025
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI Read the full episode description	Dec 24, 2025
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding Read the full episode description	Dec 24, 2025
Region-Constraint In-Context Generation for Instructional Video Editing Read the full episode description	Dec 24, 2025
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation Read the full episode description	Dec 24, 2025
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation Read the full episode description	Dec 24, 2025
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction Read the full episode description	Dec 24, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows Read the full episode description	Dec 23, 2025
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence Read the full episode description	Dec 23, 2025
When Reasoning Meets Its Laws Read the full episode description	Dec 23, 2025
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience Read the full episode description	Dec 23, 2025
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation Read the full episode description	Dec 23, 2025
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing Read the full episode description	Dec 23, 2025
Are We on the Right Way to Assessing LLM-as-a-Judge? Read the full episode description	Dec 23, 2025
Kling-Omni Technical Report Read the full episode description	Dec 20, 2025
Adaptation of Agentic AI Read the full episode description	Dec 20, 2025
LLaDA2.0: Scaling Up Diffusion Language Models to 100B Read the full episode description	Dec 20, 2025
Next-Embedding Prediction Makes Strong Vision Learners Read the full episode description	Dec 20, 2025
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors Read the full episode description	Dec 20, 2025
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model Read the full episode description	Dec 20, 2025
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation Read the full episode description	Dec 20, 2025
Generative Refocusing: Flexible Defocus Control from a Single Image Read the full episode description	Dec 20, 2025
DeContext as Defense: Safe Image Editing in Diffusion Transformers Read the full episode description	Dec 20, 2025
Step-GUI Technical Report Read the full episode description	Dec 19, 2025
DEER: Draft with Diffusion, Verify with Autoregressive Models Read the full episode description	Dec 19, 2025
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing Read the full episode description	Dec 19, 2025
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices Read the full episode description	Dec 19, 2025
Puzzle Curriculum GRPO for Vision-Centric Reasoning Read the full episode description	Dec 19, 2025
MMGR: Multi-Modal Generative Reasoning Read the full episode description	Dec 18, 2025
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans? Read the full episode description	Dec 18, 2025
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling Read the full episode description	Dec 18, 2025
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling Read the full episode description	Dec 18, 2025
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics Read the full episode description	Dec 18, 2025
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value Read the full episode description	Dec 18, 2025
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding Read the full episode description	Dec 17, 2025
Towards Scalable Pre-training of Visual Tokenizers for Generation Read the full episode description	Dec 17, 2025
Memory in the Age of AI Agents Read the full episode description	Dec 17, 2025
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management Read the full episode description	Dec 17, 2025
LongVie 2: Multimodal Controllable Ultra-Long Video World Model Read the full episode description	Dec 17, 2025
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Read the full episode description	Dec 17, 2025
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Read the full episode description	Dec 17, 2025
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics Read the full episode description	Dec 17, 2025
KlingAvatar 2.0 Technical Report Read the full episode description	Dec 17, 2025
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment Read the full episode description	Dec 17, 2025
EgoX: Egocentric Video Generation from a Single Exocentric Video Read the full episode description	Dec 16, 2025
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry Read the full episode description	Dec 16, 2025
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder Read the full episode description	Dec 16, 2025
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties Read the full episode description	Dec 16, 2025
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground Read the full episode description	Dec 13, 2025
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving Read the full episode description	Dec 13, 2025
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation Read the full episode description	Dec 13, 2025
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification Read the full episode description	Dec 13, 2025
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning Read the full episode description	Dec 13, 2025
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation Read the full episode description	Dec 12, 2025
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain Read the full episode description	Dec 12, 2025
OmniPSD: Layered PSD Generation with Diffusion Transformer Read the full episode description	Dec 12, 2025
Composing Concepts from Images and Videos via Concept-prompt Binding Read the full episode description	Dec 12, 2025
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance Read the full episode description	Dec 11, 2025
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform Read the full episode description	Dec 11, 2025
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality Read the full episode description	Dec 11, 2025
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory Read the full episode description	Dec 11, 2025
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Read the full episode description	Dec 10, 2025
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs Read the full episode description	Dec 10, 2025
Unified Video Editing with Temporal Reasoner Read the full episode description	Dec 10, 2025
Voxify3D: Pixel Art Meets Volumetric Rendering Read the full episode description	Dec 10, 2025
Scaling Zero-Shot Reference-to-Video Generation Read the full episode description	Dec 10, 2025
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems Read the full episode description	Dec 10, 2025
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows Read the full episode description	Dec 09, 2025
EditThinker: Unlocking Iterative Reasoning for Any Image Editor Read the full episode description	Dec 09, 2025
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks Read the full episode description	Dec 09, 2025
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture Read the full episode description	Dec 09, 2025
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle Read the full episode description	Dec 06, 2025
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length Read the full episode description	Dec 06, 2025
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction Read the full episode description	Dec 06, 2025
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning Read the full episode description	Dec 06, 2025
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation Read the full episode description	Dec 06, 2025
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion Read the full episode description	Dec 06, 2025
PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing Read the full episode description	Dec 06, 2025
Qwen3-VL Technical Report Read the full episode description	Dec 05, 2025
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach Read the full episode description	Dec 05, 2025
PretrainZero: Reinforcement Active Pretraining Read the full episode description	Dec 05, 2025
ViDiC: Video Difference Captioning Read the full episode description	Dec 05, 2025
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Read the full episode description	Dec 04, 2025
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration Read the full episode description	Dec 04, 2025
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework Read the full episode description	Dec 04, 2025
MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory Read the full episode description	Dec 04, 2025
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch Read the full episode description	Dec 04, 2025
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation Read the full episode description	Dec 04, 2025
Guided Self-Evolving LLMs with Minimal Human Supervision Read the full episode description	Dec 04, 2025
SimScale: Learning to Drive via Real-World Simulation at Scale Read the full episode description	Dec 04, 2025
InnoGym: Benchmarking the Innovation Potential of AI Agents Read the full episode description	Dec 04, 2025
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Read the full episode description	Dec 03, 2025
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights Read the full episode description	Dec 03, 2025
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Read the full episode description	Dec 03, 2025
How Far Are We from Genuinely Useful Deep Research Agents? Read the full episode description	Dec 03, 2025
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards Read the full episode description	Dec 03, 2025
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout Read the full episode description	Dec 03, 2025
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment Read the full episode description	Dec 03, 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models Read the full episode description	Dec 03, 2025
LFM2 Technical Report Read the full episode description	Dec 03, 2025
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Read the full episode description	Dec 02, 2025
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models Read the full episode description	Dec 02, 2025
Vision Bridge Transformer at Scale Read the full episode description	Dec 02, 2025
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning Read the full episode description	Dec 02, 2025
Architecture Decoupling Is Not All You Need For Unified Multimodal Model Read the full episode description	Dec 02, 2025
Multimodal Evaluation of Russian-language Architectures Read the full episode description	Nov 28, 2025
Latent Collaboration in Multi-Agent Systems Read the full episode description	Nov 28, 2025
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation Read the full episode description	Nov 28, 2025
GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms Read the full episode description	Nov 27, 2025
MedSAM3: Delving into Segment Anything with Medical Concepts Read the full episode description	Nov 27, 2025
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning Read the full episode description	Nov 27, 2025
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation Read the full episode description	Nov 27, 2025
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation Read the full episode description	Nov 27, 2025
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward Read the full episode description	Nov 27, 2025
GigaWorld-0: World Models as Data Engine to Empower Embodied AI Read the full episode description	Nov 27, 2025
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space Read the full episode description	Nov 27, 2025
Soft Adaptive Policy Optimization Read the full episode description	Nov 27, 2025
General Agentic Memory Via Deep Research Read the full episode description	Nov 26, 2025
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning Read the full episode description	Nov 26, 2025
Computer-Use Agents as Judges for Generative User Interface Read the full episode description	Nov 26, 2025
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation Read the full episode description	Nov 26, 2025
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Read the full episode description	Nov 26, 2025
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios Read the full episode description	Nov 26, 2025
In-Video Instructions: Visual Signals as Generative Control Read the full episode description	Nov 26, 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Read the full episode description	Nov 25, 2025
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story Read the full episode description	Nov 25, 2025
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Read the full episode description	Nov 25, 2025
SAM 3: Segment Anything with Concepts Read the full episode description	Nov 25, 2025
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Read the full episode description	Nov 21, 2025
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation Read the full episode description	Nov 21, 2025
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity Read the full episode description	Nov 21, 2025
VisPlay: Self-Evolving Vision-Language Models from Images Read the full episode description	Nov 21, 2025
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset Read the full episode description	Nov 21, 2025
VIDEOP2R: Video Understanding from Perception to Reasoning Read the full episode description	Nov 20, 2025
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models Read the full episode description	Nov 20, 2025
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models Read the full episode description	Nov 20, 2025
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space Read the full episode description	Nov 20, 2025
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark Read the full episode description	Nov 20, 2025
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs Read the full episode description	Nov 20, 2025
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding Read the full episode description	Nov 20, 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Read the full episode description	Nov 19, 2025
P1: Mastering Physics Olympiads with Reinforcement Learning Read the full episode description	Nov 19, 2025
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling Read the full episode description	Nov 19, 2025
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance Read the full episode description	Nov 19, 2025
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model Read the full episode description	Nov 19, 2025
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation Read the full episode description	Nov 19, 2025
GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning Read the full episode description	Nov 19, 2025
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models Read the full episode description	Nov 19, 2025
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Read the full episode description	Nov 19, 2025
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models Read the full episode description	Nov 18, 2025
DoPE: Denoising Rotary Position Embedding Read the full episode description	Nov 18, 2025
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation Read the full episode description	Nov 18, 2025
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation Read the full episode description	Nov 18, 2025
AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery Read the full episode description	Nov 18, 2025
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers Read the full episode description	Nov 18, 2025
Virtual Width Networks Read the full episode description	Nov 18, 2025
One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models Read the full episode description	Nov 15, 2025
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation Read the full episode description	Nov 15, 2025
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Read the full episode description	Nov 15, 2025
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains Read the full episode description	Nov 11, 2025
DeepEyesV2: Toward Agentic Multimodal Model Read the full episode description	Nov 11, 2025
Visual Spatial Tuning Read the full episode description	Nov 11, 2025
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks Read the full episode description	Nov 11, 2025
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Read the full episode description	Nov 08, 2025
V-Thinker: Interactive Thinking with Images Read the full episode description	Nov 08, 2025
Scaling Agent Learning via Experience Synthesis Read the full episode description	Nov 08, 2025
Diffusion Language Models are Super Data Learners Read the full episode description	Nov 07, 2025
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation Read the full episode description	Nov 07, 2025
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions Read the full episode description	Nov 07, 2025
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization Read the full episode description	Nov 06, 2025
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Read the full episode description	Nov 06, 2025
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought Read the full episode description	Nov 06, 2025
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation Read the full episode description	Nov 05, 2025
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph Read the full episode description	Nov 05, 2025
The Underappreciated Power of Vision Models for Graph Structural Understanding Read the full episode description	Nov 05, 2025
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback Read the full episode description	Nov 05, 2025
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Read the full episode description	Nov 05, 2025
PHUMA: Physically-Grounded Humanoid Locomotion Dataset Read the full episode description	Nov 05, 2025
UniREditBench: A Unified Reasoning-based Image Editing Benchmark Read the full episode description	Nov 05, 2025
World Simulation with Video Foundation Models for Physical AI Read the full episode description	Nov 05, 2025
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning Read the full episode description	Nov 04, 2025
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Read the full episode description	Nov 04, 2025
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning Read the full episode description	Nov 04, 2025
The End of Manual Decoding: Towards Truly End-to-End Language Models Read the full episode description	Nov 01, 2025
Kimi Linear: An Expressive, Efficient Attention Architecture Read the full episode description	Nov 01, 2025
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents Read the full episode description	Nov 01, 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Read the full episode description	Nov 01, 2025
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation Read the full episode description	Nov 01, 2025
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations Read the full episode description	Oct 29, 2025
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning Read the full episode description	Oct 24, 2025
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping Read the full episode description	Oct 24, 2025
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts Read the full episode description	Oct 24, 2025
Language Models are Injective and Hence Invertible Read the full episode description	Oct 24, 2025
GigaBrain-0: A World Model-Powered Vision-Language-Action Model Read the full episode description	Oct 24, 2025
LightMem: Lightweight and Efficient Memory-Augmented Generation Read the full episode description	Oct 23, 2025
Efficient Long-context Language Model Training by Core Attention Disaggregation Read the full episode description	Oct 23, 2025
World-in-World: World Models in a Closed-Loop World Read the full episode description	Oct 23, 2025
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation Read the full episode description	Oct 23, 2025
Chem-R: Learning to Reason as a Chemist Read the full episode description	Oct 23, 2025
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation Read the full episode description	Oct 23, 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Read the full episode description	Oct 23, 2025
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model Read the full episode description	Oct 23, 2025
IF-VidCap: Can Video Caption Models Follow Instructions? Read the full episode description	Oct 23, 2025
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science Read the full episode description	Oct 22, 2025
PICABench: How Far Are We from Physically Realistic Image Editing? Read the full episode description	Oct 22, 2025
Glyph: Scaling Context Windows via Visual-Text Compression Read the full episode description	Oct 22, 2025
FineVision: Open Data Is All You Need Read the full episode description	Oct 22, 2025
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model Read the full episode description	Oct 22, 2025
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation Read the full episode description	Oct 22, 2025
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling Read the full episode description	Oct 22, 2025
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning Read the full episode description	Oct 21, 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Read the full episode description	Oct 21, 2025
NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks Read the full episode description	Oct 21, 2025
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs Read the full episode description	Oct 21, 2025
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset Read the full episode description	Oct 21, 2025
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery Read the full episode description	Oct 21, 2025
Latent Diffusion Model without Variational Autoencoder Read the full episode description	Oct 21, 2025
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA Read the full episode description	Oct 18, 2025
Agentic Entropy-Balanced Policy Optimization Read the full episode description	Oct 18, 2025
WithAnyone: Towards Controllable and ID Consistent Image Generation Read the full episode description	Oct 18, 2025
AI for Service: Proactive Assistance with AI Glasses Read the full episode description	Oct 18, 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Read the full episode description	Oct 18, 2025
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints Read the full episode description	Oct 18, 2025
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents Read the full episode description	Oct 18, 2025
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding Read the full episode description	Oct 18, 2025
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar Read the full episode description	Oct 18, 2025
BitNet Distillation Read the full episode description	Oct 18, 2025
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model Read the full episode description	Oct 16, 2025
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training Read the full episode description	Oct 16, 2025
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation Read the full episode description	Oct 16, 2025
Scaling Language-Centric Omnimodal Representation Learning Read the full episode description	Oct 16, 2025
Robot Learning: A Tutorial Read the full episode description	Oct 16, 2025
Detect Anything via Next Point Prediction Read the full episode description	Oct 16, 2025
A Survey of Vibe Coding with Large Language Models Read the full episode description	Oct 16, 2025
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution Read the full episode description	Oct 16, 2025
Dr.LLM: Dynamic Layer Routing in LLMs Read the full episode description	Oct 16, 2025
Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models Read the full episode description	Oct 16, 2025
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs Read the full episode description	Oct 15, 2025
Diffusion Transformers with Representation Autoencoders Read the full episode description	Oct 15, 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Read the full episode description	Oct 15, 2025
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States Read the full episode description	Oct 15, 2025
Spotlight on Token Perception for Multimodal Reinforcement Learning Read the full episode description	Oct 15, 2025
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment Read the full episode description	Oct 15, 2025
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training Read the full episode description	Oct 15, 2025
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration Read the full episode description	Oct 15, 2025
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models Read the full episode description	Oct 15, 2025
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions Read the full episode description	Oct 15, 2025
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Read the full episode description	Oct 14, 2025
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Read the full episode description	Oct 14, 2025
TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling Read the full episode description	Oct 14, 2025
AutoPR: Let's Automate Your Academic Promotion! Read the full episode description	Oct 14, 2025
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs Read the full episode description	Oct 14, 2025
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities Read the full episode description	Oct 14, 2025
StreamingVLM: Real-Time Understanding for Infinite Video Streams Read the full episode description	Oct 14, 2025
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels Read the full episode description	Oct 14, 2025
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Read the full episode description	Oct 14, 2025
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? Read the full episode description	Oct 14, 2025
Agent Learning via Early Experience Read the full episode description	Oct 11, 2025
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization Read the full episode description	Oct 11, 2025
MemMamba: Rethinking Memory Patterns in State Space Model Read the full episode description	Oct 11, 2025
UniVideo: Unified Understanding, Generation, and Editing for Videos Read the full episode description	Oct 11, 2025
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning Read the full episode description	Oct 11, 2025
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs Read the full episode description	Oct 11, 2025
Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning Read the full episode description	Oct 11, 2025
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning Read the full episode description	Oct 11, 2025
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety Read the full episode description	Oct 11, 2025
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense Read the full episode description	Oct 11, 2025
Cache-to-Cache: Direct Semantic Communication Between Large Language Models Read the full episode description	Oct 10, 2025
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer Read the full episode description	Oct 10, 2025
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding Read the full episode description	Oct 10, 2025
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models Read the full episode description	Oct 10, 2025
MATRIX: Mask Track Alignment for Interaction-aware Video Generation Read the full episode description	Oct 10, 2025
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training Read the full episode description	Oct 10, 2025
Vibe Checker: Aligning Code Evaluation with Human Preference Read the full episode description	Oct 10, 2025
Less is More: Recursive Reasoning with Tiny Networks Read the full episode description	Oct 09, 2025
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning Read the full episode description	Oct 09, 2025
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs Read the full episode description	Oct 09, 2025
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use Read the full episode description	Oct 09, 2025
Fast-dLLM v2: Efficient Block-Diffusion LLM Read the full episode description	Oct 09, 2025
CoDA: Coding LM via Diffusion Adaptation Read the full episode description	Oct 09, 2025
Drax: Speech Recognition with Discrete Flow Matching Read the full episode description	Oct 09, 2025
Paper2Video: Automatic Video Generation from Scientific Papers Read the full episode description	Oct 08, 2025
MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information Read the full episode description	Oct 08, 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Read the full episode description	Oct 08, 2025
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation Read the full episode description	Oct 08, 2025
Imperceptible Jailbreaking against Large Language Models Read the full episode description	Oct 08, 2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models Read the full episode description	Oct 08, 2025
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights Read the full episode description	Oct 08, 2025
Optimal Scaling Needs Optimal Norm Read the full episode description	Oct 08, 2025
Apriel-1.5-15b-Thinker Read the full episode description	Oct 07, 2025
Large Reasoning Models Learn Better Alignment from Flawed Thinking Read the full episode description	Oct 07, 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation Read the full episode description	Oct 07, 2025
LongCodeZip: Compress Long Context for Code Language Models Read the full episode description	Oct 04, 2025
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation Read the full episode description	Oct 04, 2025
ExGRPO: Learning to Reason from Experience Read the full episode description	Oct 04, 2025
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions Read the full episode description	Oct 04, 2025
Interactive Training: Feedback-Driven Neural Network Optimization Read the full episode description	Oct 04, 2025
ModernVBERT: Towards Smaller Visual Document Retrievers Read the full episode description	Oct 04, 2025
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Read the full episode description	Oct 04, 2025
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search Read the full episode description	Oct 03, 2025
GEM: A Gym for Agentic LLMs Read the full episode description	Oct 03, 2025
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators Read the full episode description	Oct 03, 2025
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation Read the full episode description	Oct 03, 2025
PIPer: On-Device Environment Setup via Online Reinforcement Learning Read the full episode description	Oct 03, 2025
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights Read the full episode description	Oct 03, 2025
ACON: Optimizing Context Compression for Long-horizon LLM Agents Read the full episode description	Oct 03, 2025
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Read the full episode description	Oct 02, 2025
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain Read the full episode description	Oct 02, 2025
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play Read the full episode description	Oct 02, 2025
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning Read the full episode description	Oct 02, 2025
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning Read the full episode description	Oct 02, 2025
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Read the full episode description	Oct 02, 2025
OceanGym: A Benchmark Environment for Underwater Embodied Agents Read the full episode description	Oct 02, 2025
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models Read the full episode description	Oct 02, 2025
Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners Read the full episode description	Oct 02, 2025
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder Read the full episode description	Oct 02, 2025
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention Read the full episode description	Oct 01, 2025
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Read the full episode description	Oct 01, 2025
Multiplayer Nash Preference Optimization Read the full episode description	Oct 01, 2025
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark Read the full episode description	Oct 01, 2025
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR Read the full episode description	Oct 01, 2025
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing Read the full episode description	Oct 01, 2025
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer Read the full episode description	Oct 01, 2025
Democratizing AI scientists using ToolUniverse Read the full episode description	Oct 01, 2025
Visual Jigsaw Post-Training Improves MLLMs Read the full episode description	Oct 01, 2025
When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance Read the full episode description	Oct 01, 2025
LongLive: Real-time Interactive Long Video Generation Read the full episode description	Sep 30, 2025
Quantile Advantage Estimation for Entropy-Safe Reasoning Read the full episode description	Sep 30, 2025
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Read the full episode description	Sep 30, 2025
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Read the full episode description	Sep 30, 2025
ReviewScore: Misinformed Peer Review Detection with Large Language Models Read the full episode description	Sep 30, 2025
Variational Reasoning for Language Models Read the full episode description	Sep 30, 2025
Language Models Can Learn from Verbal Feedback Without Scalar Rewards Read the full episode description	Sep 30, 2025
MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning Read the full episode description	Sep 30, 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning Read the full episode description	Sep 30, 2025
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping Read the full episode description	Sep 30, 2025
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models Read the full episode description	Sep 27, 2025
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines Read the full episode description	Sep 27, 2025
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources Read the full episode description	Sep 27, 2025
Tree Search for LLM Agent Reinforcement Learning Read the full episode description	Sep 27, 2025
Seedream 4.0: Toward Next-generation Multimodal Image Generation Read the full episode description	Sep 27, 2025
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets Read the full episode description	Sep 27, 2025
AutoIntent: AutoML for Text Classification Read the full episode description	Sep 27, 2025
Video models are zero-shot learners and reasoners Read the full episode description	Sep 26, 2025
SIM-CoT: Supervised Implicit Chain-of-Thought Read the full episode description	Sep 26, 2025
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR Read the full episode description	Sep 25, 2025
Reinforcement Learning on Pre-Training Data Read the full episode description	Sep 25, 2025
Do You Need Proprioceptive States in Visuomotor Policies? Read the full episode description	Sep 25, 2025
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe Read the full episode description	Sep 25, 2025
LIMI: Less is More for Agency Read the full episode description	Sep 24, 2025
Qwen3-Omni Technical Report Read the full episode description	Sep 24, 2025
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models Read the full episode description	Sep 24, 2025
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System Read the full episode description	Sep 24, 2025
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs Read the full episode description	Sep 24, 2025
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation Read the full episode description	Sep 23, 2025
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Read the full episode description	Sep 23, 2025
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification Read the full episode description	Sep 23, 2025
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Read the full episode description	Sep 20, 2025
FlowRL: Matching Reward Distributions for LLM Reasoning Read the full episode description	Sep 20, 2025
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration Read the full episode description	Sep 20, 2025
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Read the full episode description	Sep 20, 2025
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning Read the full episode description	Sep 20, 2025
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation Read the full episode description	Sep 20, 2025
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale Read the full episode description	Sep 19, 2025
SAIL-VL2 Technical Report Read the full episode description	Sep 19, 2025
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era Read the full episode description	Sep 19, 2025
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research Read the full episode description	Sep 18, 2025
Scaling Agents via Continual Pre-training Read the full episode description	Sep 18, 2025
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning Read the full episode description	Sep 18, 2025
Towards General Agentic Intelligence via Environment Scaling Read the full episode description	Sep 18, 2025
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents Read the full episode description	Sep 18, 2025
ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization Read the full episode description	Sep 18, 2025
Single-stream Policy Optimization Read the full episode description	Sep 18, 2025
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling Read the full episode description	Sep 17, 2025
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning Read the full episode description	Sep 17, 2025
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts Read the full episode description	Sep 17, 2025
IntrEx: A Dataset for Modeling Engagement in Educational Conversations Read the full episode description	Sep 16, 2025
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Read the full episode description	Sep 16, 2025
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Read the full episode description	Sep 13, 2025
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning Read the full episode description	Sep 13, 2025
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Read the full episode description	Sep 13, 2025
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Read the full episode description	Sep 13, 2025
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents Read the full episode description	Sep 13, 2025
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis Read the full episode description	Sep 13, 2025
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark Read the full episode description	Sep 13, 2025
Can Understanding and Generation Truly Benefit Together -- or Just Coexist? Read the full episode description	Sep 13, 2025
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining Read the full episode description	Sep 13, 2025
A Survey of Reinforcement Learning for Large Reasoning Models Read the full episode description	Sep 12, 2025
RewardDance: Reward Scaling in Visual Generation Read the full episode description	Sep 12, 2025
3D and 4D World Modeling: A Survey Read the full episode description	Sep 12, 2025
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning Read the full episode description	Sep 12, 2025
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning Read the full episode description	Sep 11, 2025
Visual Representation Alignment for Multimodal Large Language Models Read the full episode description	Sep 11, 2025
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search Read the full episode description	Sep 11, 2025
Reconstruction Alignment Improves Unified Multimodal Models Read the full episode description	Sep 11, 2025
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward Read the full episode description	Sep 11, 2025
Reverse-Engineered Reasoning for Open-Ended Generation Read the full episode description	Sep 10, 2025
Does DINOv3 Set a New Medical Vision Standard? Read the full episode description	Sep 10, 2025
Symbolic Graphics Programming with Large Language Models Read the full episode description	Sep 09, 2025
Set Block Decoding is a Language Model Inference Accelerator Read the full episode description	Sep 09, 2025
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth Read the full episode description	Sep 06, 2025
From Editor to Dense Geometry Estimator Read the full episode description	Sep 06, 2025
Towards a Unified View of Large Language Model Post-Training Read the full episode description	Sep 06, 2025
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks Read the full episode description	Sep 06, 2025
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions? Read the full episode description	Sep 06, 2025
Open Data Synthesis For Deep Research Read the full episode description	Sep 05, 2025
Robix: A Unified Model for Robot Interaction, Reasoning and Planning Read the full episode description	Sep 05, 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Read the full episode description	Sep 04, 2025
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model Read the full episode description	Sep 04, 2025
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Read the full episode description	Sep 04, 2025
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion Read the full episode description	Sep 04, 2025
Baichuan-M2: Scaling Medical Capability with Large Verifier System Read the full episode description	Sep 04, 2025
Kwai Keye-VL 1.5 Technical Report Read the full episode description	Sep 04, 2025
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic Read the full episode description	Sep 04, 2025
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning Read the full episode description	Sep 03, 2025
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning Read the full episode description	Sep 02, 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Read the full episode description	Sep 02, 2025
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling Read the full episode description	Aug 28, 2025
VibeVoice Technical Report Read the full episode description	Aug 28, 2025
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics Read the full episode description	Aug 28, 2025
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space Read the full episode description	Aug 28, 2025
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation Read the full episode description	Aug 28, 2025
Spacer: Towards Engineered Scientific Inspiration Read the full episode description	Aug 28, 2025
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning Read the full episode description	Aug 28, 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Read the full episode description	Aug 27, 2025
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation Read the full episode description	Aug 27, 2025
MV-RAG: Retrieval Augmented Multiview Diffusion Read the full episode description	Aug 27, 2025
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs Read the full episode description	Aug 26, 2025
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR Read the full episode description	Aug 26, 2025
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks Read the full episode description	Aug 26, 2025
Intern-S1: A Scientific Multimodal Foundation Model Read the full episode description	Aug 23, 2025
Mobile-Agent-v3: Foundamental Agents for GUI Automation Read the full episode description	Aug 23, 2025
Deep Think with Confidence Read the full episode description	Aug 23, 2025
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries Read the full episode description	Aug 23, 2025
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization Read the full episode description	Aug 22, 2025
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models Read the full episode description	Aug 22, 2025
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Read the full episode description	Aug 22, 2025
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds Read the full episode description	Aug 22, 2025
Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization Read the full episode description	Aug 22, 2025
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL Read the full episode description	Aug 21, 2025
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos Read the full episode description	Aug 21, 2025
Prompt Orchestration Markup Language Read the full episode description	Aug 21, 2025
Ovis2.5 Technical Report Read the full episode description	Aug 20, 2025
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning Read the full episode description	Aug 20, 2025
4DNeX: Feed-Forward 4D Generative Modeling Made Easy Read the full episode description	Aug 20, 2025
Next Visual Granularity Generation Read the full episode description	Aug 20, 2025
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models Read the full episode description	Aug 20, 2025
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs Read the full episode description	Aug 20, 2025
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study Read the full episode description	Aug 20, 2025
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds Read the full episode description	Aug 20, 2025
SSRL: Self-Search Reinforcement Learning Read the full episode description	Aug 19, 2025
DINOv3 Read the full episode description	Aug 19, 2025
Thyme: Think Beyond Images Read the full episode description	Aug 19, 2025
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Read the full episode description	Aug 19, 2025
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization Read the full episode description	Aug 19, 2025
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning Read the full episode description	Aug 16, 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale Read the full episode description	Aug 16, 2025
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts Read the full episode description	Aug 16, 2025
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing Read the full episode description	Aug 16, 2025
Story2Board: A Training-Free Approach for Expressive Storyboard Generation Read the full episode description	Aug 15, 2025
Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery Read the full episode description	Aug 15, 2025
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation Read the full episode description	Aug 15, 2025
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing Read the full episode description	Aug 15, 2025
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory Read the full episode description	Aug 15, 2025
AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving Read the full episode description	Aug 15, 2025
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL Read the full episode description	Aug 14, 2025
Complex Logical Instruction Generation Read the full episode description	Aug 14, 2025
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models Read the full episode description	Aug 14, 2025
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches Read the full episode description	Aug 14, 2025
ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability Read the full episode description	Aug 13, 2025
WideSearch: Benchmarking Agentic Broad Info-Seeking Read the full episode description	Aug 13, 2025
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation Read the full episode description	Aug 13, 2025
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems Read the full episode description	Aug 13, 2025
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent Read the full episode description	Aug 13, 2025
SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens Read the full episode description	Aug 13, 2025
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization Read the full episode description	Aug 13, 2025
MolmoAct: Action Reasoning Models that can Reason in Space Read the full episode description	Aug 13, 2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Read the full episode description	Aug 12, 2025
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off Read the full episode description	Aug 12, 2025
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens Read the full episode description	Aug 08, 2025
VeriGUI: Verifiable Long-Chain GUI Dataset Read the full episode description	Aug 08, 2025
Efficient Agents: Building Effective Agents While Reducing Cost Read the full episode description	Aug 08, 2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience Read the full episode description	Aug 08, 2025
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning Read the full episode description	Aug 08, 2025
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success Read the full episode description	Aug 08, 2025
Agent Lightning: Train ANY AI Agents with Reinforcement Learning Read the full episode description	Aug 08, 2025
Qwen-Image Technical Report Read the full episode description	Aug 06, 2025
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension Read the full episode description	Aug 06, 2025
CellForge: Agentic Design of Virtual Cell Models Read the full episode description	Aug 06, 2025
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following Read the full episode description	Aug 06, 2025
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report Read the full episode description	Aug 06, 2025
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models Read the full episode description	Aug 05, 2025
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training Read the full episode description	Aug 05, 2025
PixNerd: Pixel Neural Field Diffusion Read the full episode description	Aug 05, 2025
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving Read the full episode description	Aug 02, 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding Read the full episode description	Aug 02, 2025
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents Read the full episode description	Aug 01, 2025
BANG: Dividing 3D Assets via Generative Exploded Dynamics Read the full episode description	Aug 01, 2025
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning Read the full episode description	Aug 01, 2025
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels Read the full episode description	Jul 31, 2025
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again Read the full episode description	Jul 31, 2025
ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge Read the full episode description	Jul 31, 2025
Agentic Reinforced Policy Optimization Read the full episode description	Jul 30, 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Read the full episode description	Jul 30, 2025
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence Read the full episode description	Jul 30, 2025
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning Read the full episode description	Jul 30, 2025
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment Read the full episode description	Jul 30, 2025
Reconstructing 4D Spatial Intelligence: A Survey Read the full episode description	Jul 30, 2025
Deep Researcher with Test-Time Diffusion Read the full episode description	Jul 29, 2025
$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention Read the full episode description	Jul 26, 2025
Group Sequence Policy Optimization Read the full episode description	Jul 26, 2025
MUR: Momentum Uncertainty guided Reasoning for Large Language Models Read the full episode description	Jul 26, 2025
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization Read the full episode description	Jul 26, 2025
Pixels, Patterns, but No Poetry: To See The World like Humans Read the full episode description	Jul 25, 2025
Yume: An Interactive World Generation Model Read the full episode description	Jul 25, 2025
DesignLab: Designing Slides Through Iterative Detection and Correction Read the full episode description	Jul 25, 2025
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning Read the full episode description	Jul 25, 2025
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning Read the full episode description	Jul 24, 2025
Step-Audio 2 Technical Report Read the full episode description	Jul 24, 2025
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning Read the full episode description	Jul 24, 2025
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers Read the full episode description	Jul 24, 2025
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Read the full episode description	Jul 24, 2025
GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding Read the full episode description	Jul 23, 2025
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization Read the full episode description	Jul 23, 2025
The Invisible Leash: Why RLVR May Not Escape Its Origin Read the full episode description	Jul 23, 2025
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining Read the full episode description	Jul 23, 2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization Read the full episode description	Jul 23, 2025
GR-3 Technical Report Read the full episode description	Jul 23, 2025
Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling Read the full episode description	Jul 23, 2025
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction Read the full episode description	Jul 23, 2025
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Read the full episode description	Jul 23, 2025
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Read the full episode description	Jul 22, 2025
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Read the full episode description	Jul 22, 2025
A Survey of Context Engineering for Large Language Models Read the full episode description	Jul 19, 2025
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Read the full episode description	Jul 19, 2025
$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning Read the full episode description	Jul 19, 2025
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Read the full episode description	Jul 19, 2025
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Read the full episode description	Jul 19, 2025
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Read the full episode description	Jul 19, 2025
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization Read the full episode description	Jul 19, 2025
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs Read the full episode description	Jul 18, 2025
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models Read the full episode description	Jul 17, 2025
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes Read the full episode description	Jul 17, 2025
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Read the full episode description	Jul 16, 2025
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation Read the full episode description	Jul 16, 2025
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Read the full episode description	Jul 16, 2025
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Read the full episode description	Jul 16, 2025
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Read the full episode description	Jul 16, 2025
Test-Time Scaling with Reflective Generative Model Read the full episode description	Jul 15, 2025
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning Read the full episode description	Jul 15, 2025
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models Read the full episode description	Jul 15, 2025
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering Read the full episode description	Jul 15, 2025
KV Cache Steering for Inducing Reasoning in Small Language Models Read the full episode description	Jul 15, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Read the full episode description	Jul 15, 2025
Neural-Driven Image Editing Read the full episode description	Jul 15, 2025
Scaling RL to Long Videos Read the full episode description	Jul 12, 2025
T-LoRA: Single Image Diffusion Model Customization Without Overfitting Read the full episode description	Jul 12, 2025
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Read the full episode description	Jul 12, 2025
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Read the full episode description	Jul 12, 2025
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Read the full episode description	Jul 12, 2025
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling Read the full episode description	Jul 12, 2025
PyVision: Agentic Vision with Dynamic Tooling Read the full episode description	Jul 12, 2025
4KAgent: Agentic Any Image to 4K Super-Resolution Read the full episode description	Jul 11, 2025
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data Read the full episode description	Jul 11, 2025
Perception-Aware Policy Optimization for Multimodal Reasoning Read the full episode description	Jul 11, 2025
MIRIX: Multi-Agent Memory System for LLM-Based Agents Read the full episode description	Jul 11, 2025
Rethinking Verification for LLM Code Generation: From Generation to Testing Read the full episode description	Jul 11, 2025
SingLoRA: Low Rank Adaptation Using a Single Matrix Read the full episode description	Jul 10, 2025
A Survey on Latent Reasoning Read the full episode description	Jul 10, 2025
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion Read the full episode description	Jul 10, 2025
How to Train Your LLM Web Agent: A Statistical Diagnosis Read the full episode description	Jul 10, 2025
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling Read the full episode description	Jul 10, 2025
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization Read the full episode description	Jul 10, 2025
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents Read the full episode description	Jul 10, 2025
MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos Read the full episode description	Jul 10, 2025
MemOS: A Memory OS for AI System Read the full episode description	Jul 09, 2025
Should We Still Pretrain Encoders with Masked Language Modeling? Read the full episode description	Jul 09, 2025
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Read the full episode description	Jul 09, 2025
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture Read the full episode description	Jul 09, 2025
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Read the full episode description	Jul 09, 2025
Pre-Trained Policy Discriminators are General Reward Models Read the full episode description	Jul 09, 2025
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset Read the full episode description	Jul 09, 2025
WebSailor: Navigating Super-human Reasoning for Web Agent Read the full episode description	Jul 05, 2025
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion Read the full episode description	Jul 05, 2025
Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback Read the full episode description	Jul 05, 2025
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction Read the full episode description	Jul 05, 2025
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Read the full episode description	Jul 05, 2025
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Read the full episode description	Jul 05, 2025
Kwai Keye-VL Technical Report Read the full episode description	Jul 04, 2025
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory Read the full episode description	Jul 04, 2025
Depth Anything at Any Condition Read the full episode description	Jul 04, 2025
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Read the full episode description	Jul 04, 2025
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Read the full episode description	Jul 03, 2025
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Read the full episode description	Jul 03, 2025
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Read the full episode description	Jul 03, 2025
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings Read the full episode description	Jul 03, 2025
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation Read the full episode description	Jul 03, 2025
Ovis-U1 Technical Report Read the full episode description	Jul 02, 2025
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning Read the full episode description	Jul 02, 2025
VMoBA: Mixture-of-Block Attention for Video Diffusion Models Read the full episode description	Jul 02, 2025
Calligrapher: Freestyle Text Image Customization Read the full episode description	Jul 02, 2025
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing Read the full episode description	Jul 01, 2025
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Read the full episode description	Jul 01, 2025
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation Read the full episode description	Jul 01, 2025
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback Read the full episode description	Jun 17, 2025
Effective Red-Teaming of Policy-Adherent Agents Read the full episode description	Jun 17, 2025
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation Read the full episode description	Jun 17, 2025
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning Read the full episode description	Jun 14, 2025
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks Read the full episode description	Jun 14, 2025
Text-Aware Image Restoration with Diffusion Models Read the full episode description	Jun 14, 2025
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation Read the full episode description	Jun 14, 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos Read the full episode description	Jun 14, 2025
Discrete Audio Tokens: More Than a Survey! Read the full episode description	Jun 14, 2025
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models Read the full episode description	Jun 13, 2025
Seedance 1.0: Exploring the Boundaries of Video Generation Models Read the full episode description	Jun 13, 2025
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation Read the full episode description	Jun 13, 2025
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation Read the full episode description	Jun 13, 2025
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation Read the full episode description	Jun 13, 2025
PlayerOne: Egocentric World Simulator Read the full episode description	Jun 13, 2025
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation Read the full episode description	Jun 13, 2025
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models Read the full episode description	Jun 12, 2025
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Read the full episode description	Jun 12, 2025
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling Read the full episode description	Jun 12, 2025
Reinforcement Pre-Training Read the full episode description	Jun 11, 2025
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance Read the full episode description	Jun 11, 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices Read the full episode description	Jun 11, 2025
SpatialLM: Training Large Language Models for Structured Indoor Modeling Read the full episode description	Jun 11, 2025
Image Reconstruction as a Tool for Feature Analysis Read the full episode description	Jun 11, 2025
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning Read the full episode description	Jun 11, 2025
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Read the full episode description	Jun 10, 2025
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion Read the full episode description	Jun 10, 2025
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Read the full episode description	Jun 10, 2025
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs Read the full episode description	Jun 10, 2025
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training Read the full episode description	Jun 07, 2025
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development Read the full episode description	Jun 07, 2025
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Read the full episode description	Jun 07, 2025
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Read the full episode description	Jun 07, 2025
Video World Models with Long-term Spatial Memory Read the full episode description	Jun 07, 2025
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights Read the full episode description	Jun 07, 2025
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Read the full episode description	Jun 07, 2025
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models Read the full episode description	Jun 07, 2025
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Read the full episode description	Jun 07, 2025
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Read the full episode description	Jun 07, 2025
MiMo-VL Technical Report Read the full episode description	Jun 06, 2025
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning Read the full episode description	Jun 06, 2025
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment Read the full episode description	Jun 06, 2025
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark Read the full episode description	Jun 06, 2025
A Controllable Examination for Long-Context Language Models Read the full episode description	Jun 06, 2025
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Read the full episode description	Jun 06, 2025
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis Read the full episode description	Jun 06, 2025
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models Read the full episode description	Jun 06, 2025
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning Read the full episode description	Jun 05, 2025
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments Read the full episode description	Jun 05, 2025
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Read the full episode description	Jun 05, 2025
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Read the full episode description	Jun 05, 2025
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs Read the full episode description	Jun 05, 2025
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Read the full episode description	Jun 05, 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Read the full episode description	Jun 05, 2025
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation Read the full episode description	Jun 05, 2025
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Read the full episode description	Jun 04, 2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Read the full episode description	Jun 04, 2025
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Read the full episode description	Jun 04, 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping Read the full episode description	Jun 04, 2025
ARIA: Training Language Agents with Intention-Driven Reward Aggregation Read the full episode description	Jun 04, 2025
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models Read the full episode description	Jun 04, 2025
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks Read the full episode description	Jun 04, 2025
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles Read the full episode description	Jun 04, 2025
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding Read the full episode description	Jun 04, 2025
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning Read the full episode description	Jun 04, 2025
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Read the full episode description	Jun 03, 2025
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time Read the full episode description	Jun 03, 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can? Read the full episode description	Jun 03, 2025
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Read the full episode description	Jun 03, 2025
Large Language Models for Data Synthesis Read the full episode description	Jun 03, 2025
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation Read the full episode description	Jun 03, 2025
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization Read the full episode description	Jun 03, 2025
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models Read the full episode description	Jun 03, 2025
Table-R1: Inference-Time Scaling for Table Reasoning Read the full episode description	May 31, 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Read the full episode description	May 31, 2025
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos Read the full episode description	May 31, 2025
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason Read the full episode description	May 31, 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost Read the full episode description	May 31, 2025
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Read the full episode description	May 31, 2025
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Read the full episode description	May 31, 2025
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Read the full episode description	May 30, 2025
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Read the full episode description	May 30, 2025
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing Read the full episode description	May 30, 2025
Skywork Open Reasoner 1 Technical Report Read the full episode description	May 30, 2025
Sherlock: Self-Correcting Reasoning in Vision-Language Models Read the full episode description	May 30, 2025
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Read the full episode description	May 30, 2025
SageAttention2++: A More Efficient Implementation of SageAttention2 Read the full episode description	May 30, 2025
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start Read the full episode description	May 30, 2025
Fostering Video Reasoning via Next-Event Prediction Read the full episode description	May 30, 2025
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination Read the full episode description	May 30, 2025
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Read the full episode description	May 29, 2025
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Read the full episode description	May 29, 2025
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Read the full episode description	May 29, 2025
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data Read the full episode description	May 29, 2025
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation Read the full episode description	May 29, 2025
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond Read the full episode description	May 29, 2025
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning Read the full episode description	May 29, 2025
Exploring the Latent Capacity of LLMs for One-Step Text Generation Read the full episode description	May 29, 2025
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence Read the full episode description	May 29, 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Read the full episode description	May 29, 2025
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model Read the full episode description	May 28, 2025
Shifting AI Efficiency From Model-Centric to Data-Centric Compression Read the full episode description	May 28, 2025
Alchemist: Turning Public Text-to-Image Data into Generative Gold Read the full episode description	May 28, 2025
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs Read the full episode description	May 28, 2025
PATS: Process-Level Adaptive Thinking Mode Switching Read the full episode description	May 28, 2025
Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance Read the full episode description	May 28, 2025
ARM: Adaptive Reasoning Model Read the full episode description	May 28, 2025
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Read the full episode description	May 28, 2025
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective Read the full episode description	May 28, 2025
B-score: Detecting biases in large language models using response history Read the full episode description	May 28, 2025
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations Read the full episode description	May 27, 2025
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning Read the full episode description	May 27, 2025
Quartet: Native FP4 Training Can Be Optimal for Large Language Models Read the full episode description	May 27, 2025
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models Read the full episode description	May 27, 2025
One RL to See Them All: Visual Triple Unified Reinforcement Learning Read the full episode description	May 27, 2025
Distilling LLM Agent into Small Models with Retrieval and Code Tools Read the full episode description	May 27, 2025
QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization Read the full episode description	May 27, 2025
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? Read the full episode description	May 27, 2025
Scaling Image and Video Generation via Test-Time Evolutionary Search Read the full episode description	May 27, 2025
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback Read the full episode description	May 27, 2025
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification Read the full episode description	May 24, 2025
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models Read the full episode description	May 24, 2025
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning Read the full episode description	May 24, 2025
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Read the full episode description	May 24, 2025
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models Read the full episode description	May 24, 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Read the full episode description	May 24, 2025
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning Read the full episode description	May 24, 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning Read the full episode description	May 24, 2025
Scaling Diffusion Transformers Efficiently via $μ$P Read the full episode description	May 24, 2025
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents Read the full episode description	May 23, 2025
MMaDA: Multimodal Large Diffusion Language Models Read the full episode description	May 23, 2025
Scaling Law for Quantization-Aware Training Read the full episode description	May 23, 2025
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Read the full episode description	May 23, 2025
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Read the full episode description	May 23, 2025
Efficient Agent Training for Computer Use Read the full episode description	May 23, 2025
This Time is Different: An Observability Perspective on Time Series Foundation Models Read the full episode description	May 23, 2025
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping Read the full episode description	May 23, 2025
Emerging Properties in Unified Multimodal Pretraining Read the full episode description	May 22, 2025
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Read the full episode description	May 22, 2025
Optimizing Anytime Reasoning via Budget Relative Policy Optimization Read the full episode description	May 22, 2025
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank Read the full episode description	May 22, 2025
Visual Agentic Reinforcement Fine-Tuning Read the full episode description	May 22, 2025
Neurosymbolic Diffusion Models Read the full episode description	May 22, 2025
Chain-of-Model Learning for Language Model Read the full episode description	May 21, 2025
AdaptThink: Reasoning Models Can Learn When to Think Read the full episode description	May 21, 2025
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning Read the full episode description	May 21, 2025
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction Read the full episode description	May 21, 2025
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Read the full episode description	May 21, 2025
Faster Video Diffusion with Trainable Sparse Attention Read the full episode description	May 21, 2025
Thinkless: LLM Learns When to Think Read the full episode description	May 21, 2025
Model Merging in Pre-training of Large Language Models Read the full episode description	May 21, 2025
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space Read the full episode description	May 21, 2025
Qwen3 Technical Report Read the full episode description	May 20, 2025
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Read the full episode description	May 20, 2025
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly Read the full episode description	May 20, 2025
Visual Planning: Let's Think Only with Images Read the full episode description	May 20, 2025
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Read the full episode description	May 17, 2025
System Prompt Optimization with Meta-Learning Read the full episode description	May 17, 2025
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Read the full episode description	May 16, 2025
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception Read the full episode description	May 16, 2025
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures Read the full episode description	May 16, 2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Read the full episode description	May 15, 2025
Seed1.5-VL Technical Report Read the full episode description	May 14, 2025
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Read the full episode description	May 14, 2025
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets Read the full episode description	May 14, 2025
Learning from Peers in Reasoning Models Read the full episode description	May 14, 2025
Unified Continuous Generative Models Read the full episode description	May 14, 2025
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback Read the full episode description	May 14, 2025
Bielik v3 Small: Technical Report Read the full episode description	May 13, 2025
Bielik 11B v2 Technical Report Read the full episode description	May 13, 2025
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Read the full episode description	May 10, 2025
On Path to Multimodal Generalist: General-Level and General-Bench Read the full episode description	May 10, 2025
Flow-GRPO: Training Flow Matching Models via Online RL Read the full episode description	May 10, 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities Read the full episode description	May 09, 2025
ZeroSearch: Incentivize the Search Capability of LLMs without Searching Read the full episode description	May 09, 2025
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Read the full episode description	May 08, 2025
Absolute Zero: Reinforced Self-play Reasoning with Zero Data Read the full episode description	May 08, 2025
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale Read the full episode description	May 08, 2025
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios Read the full episode description	May 08, 2025
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Read the full episode description	May 07, 2025
RM-R1: Reward Modeling as Reasoning Read the full episode description	May 07, 2025
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers Read the full episode description	May 07, 2025
Practical Efficiency of Muon for Pretraining Read the full episode description	May 07, 2025
PixelHacker: Image Inpainting with Structural and Semantic Consistency Read the full episode description	May 06, 2025
A Survey of Interactive Generative Video Read the full episode description	May 03, 2025
DeepCritic: Deliberate Critique with Large Language Models Read the full episode description	May 03, 2025
Sadeed: Advancing Arabic Diacritization Through Small Language Model Read the full episode description	May 02, 2025
WebThinker: Empowering Large Reasoning Models with Deep Research Capability Read the full episode description	May 02, 2025
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math Read the full episode description	May 02, 2025
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning Read the full episode description	May 02, 2025
Reinforcement Learning for Reasoning in Large Language Models with One Training Example Read the full episode description	May 01, 2025
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities Read the full episode description	May 01, 2025
ReasonIR: Training Retrievers for Reasoning Tasks Read the full episode description	May 01, 2025
The Leaderboard Illusion Read the full episode description	May 01, 2025
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models Read the full episode description	May 01, 2025
RepText: Rendering Visual Text via Replicating Read the full episode description	Apr 30, 2025
Towards Understanding Camera Motions in Any Video Read the full episode description	Apr 29, 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning Read the full episode description	Apr 29, 2025
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Read the full episode description	Apr 29, 2025
Step1X-Edit: A Practical Framework for General Image Editing Read the full episode description	Apr 26, 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Read the full episode description	Apr 26, 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation Read the full episode description	Apr 26, 2025
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs Read the full episode description	Apr 26, 2025
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning Read the full episode description	Apr 25, 2025
Trillion 7B Technical Report Read the full episode description	Apr 25, 2025
Tina: Tiny Reasoning Models via LoRA Read the full episode description	Apr 25, 2025
I-Con: A Unifying Framework for Representation Learning Read the full episode description	Apr 25, 2025
Kuwain 1.5B: An Arabic SLM via Language Injection Read the full episode description	Apr 24, 2025
TTRL: Test-Time Reinforcement Learning Read the full episode description	Apr 24, 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Read the full episode description	Apr 24, 2025
Describe Anything: Detailed Localized Image and Video Captioning Read the full episode description	Apr 24, 2025
Learning Adaptive Parallel Reasoning with Language Models Read the full episode description	Apr 24, 2025
Learning to Reason under Off-Policy Guidance Read the full episode description	Apr 23, 2025
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Read the full episode description	Apr 23, 2025
FlowReasoner: Reinforcing Query-Level Meta-Agents Read the full episode description	Apr 23, 2025
ToolRL: Reward is All Tool Learning Needs Read the full episode description	Apr 23, 2025
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents Read the full episode description	Apr 23, 2025
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians Read the full episode description	Apr 23, 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? Read the full episode description	Apr 22, 2025
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space Read the full episode description	Apr 22, 2025
NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes Read the full episode description	Apr 22, 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Read the full episode description	Apr 19, 2025
Antidistillation Sampling Read the full episode description	Apr 19, 2025
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Read the full episode description	Apr 19, 2025
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation Read the full episode description	Apr 19, 2025
WORLDMEM: Long-term Consistent World Simulation with Memory Read the full episode description	Apr 19, 2025
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis Read the full episode description	Apr 19, 2025
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness Read the full episode description	Apr 18, 2025
BitNet b1.58 2B4T Technical Report Read the full episode description	Apr 18, 2025
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs Read the full episode description	Apr 18, 2025
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Read the full episode description	Apr 17, 2025
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning Read the full episode description	Apr 17, 2025
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients Read the full episode description	Apr 17, 2025
Heimdall: test-time scaling on the generative verification Read the full episode description	Apr 17, 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Read the full episode description	Apr 17, 2025
TextArena Read the full episode description	Apr 17, 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Read the full episode description	Apr 16, 2025
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Read the full episode description	Apr 16, 2025
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Read the full episode description	Apr 16, 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Read the full episode description	Apr 16, 2025
Iterative Self-Training for Code Generation via Reinforced Re-Ranking Read the full episode description	Apr 16, 2025
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Read the full episode description	Apr 15, 2025
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Read the full episode description	Apr 15, 2025
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Read the full episode description	Apr 15, 2025
Kimi-VL Technical Report Read the full episode description	Apr 12, 2025
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing Read the full episode description	Apr 12, 2025
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Read the full episode description	Apr 12, 2025
DeepSeek-R1 Thoughtology: Let's about LLM Reasoning Read the full episode description	Apr 12, 2025
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Read the full episode description	Apr 12, 2025
MM-IFEngine: Towards Multimodal Instruction Following Read the full episode description	Apr 12, 2025
HoloPart: Generative 3D Part Amodal Segmentation Read the full episode description	Apr 12, 2025
DDT: Decoupled Diffusion Transformer Read the full episode description	Apr 11, 2025
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens Read the full episode description	Apr 11, 2025
A Unified Agentic Framework for Evaluating Conditional Image Generation Read the full episode description	Apr 11, 2025
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? Read the full episode description	Apr 11, 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model Read the full episode description	Apr 10, 2025
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention Read the full episode description	Apr 10, 2025
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Read the full episode description	Apr 10, 2025
An Empirical Study of GPT-4o Image Generation Capabilities Read the full episode description	Apr 10, 2025
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Read the full episode description	Apr 10, 2025
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation Read the full episode description	Apr 10, 2025
SmolVLM: Redefining small and efficient multimodal models Read the full episode description	Apr 09, 2025
One-Minute Video Generation with Test-Time Training Read the full episode description	Apr 09, 2025
Rethinking Reflection in Pre-Training Read the full episode description	Apr 09, 2025
URECA: Unique Region Caption Anything Read the full episode description	Apr 09, 2025
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models Read the full episode description	Apr 09, 2025
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving Read the full episode description	Apr 08, 2025
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems Read the full episode description	Apr 05, 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing Read the full episode description	Apr 05, 2025
ZClip: Adaptive Spike Mitigation for LLM Pre-Training Read the full episode description	Apr 05, 2025
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation Read the full episode description	Apr 05, 2025
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Read the full episode description	Apr 05, 2025
WikiVideo: Article Generation from Multiple Videos Read the full episode description	Apr 05, 2025
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Read the full episode description	Apr 04, 2025
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction Read the full episode description	Apr 04, 2025
Understanding R1-Zero-Like Training: A Critical Perspective Read the full episode description	Apr 04, 2025
Towards Physically Plausible Video Generation via VLM Planning Read the full episode description	Apr 04, 2025
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance Read the full episode description	Apr 04, 2025
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step Read the full episode description	Apr 04, 2025
START: Self-taught Reasoner with Tools Read the full episode description	Mar 08, 2025
Token-Efficient Long Video Understanding for Multimodal LLMs Read the full episode description	Mar 08, 2025
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Read the full episode description	Mar 08, 2025
EgoLife: Towards Egocentric Life Assistant Read the full episode description	Mar 08, 2025
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers Read the full episode description	Mar 07, 2025
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Read the full episode description	Mar 07, 2025
Process-based Self-Rewarding Language Models Read the full episode description	Mar 07, 2025
Visual-RFT: Visual Reinforcement Fine-Tuning Read the full episode description	Mar 05, 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Read the full episode description	Mar 05, 2025
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models Read the full episode description	Mar 05, 2025
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking Read the full episode description	Mar 04, 2025
Chain of Draft: Thinking Faster by Writing Less Read the full episode description	Mar 04, 2025
Multi-Turn Code Generation Through Single-Step Rewards Read the full episode description	Mar 04, 2025
Self-rewarding correction for mathematical reasoning Read the full episode description	Mar 01, 2025
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Read the full episode description	Mar 01, 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts Read the full episode description	Mar 01, 2025
LongRoPE2: Near-Lossless LLM Context Window Scaling Read the full episode description	Mar 01, 2025
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving Read the full episode description	Mar 01, 2025
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale Read the full episode description	Mar 01, 2025
UniTok: A Unified Tokenizer for Visual Generation and Understanding Read the full episode description	Mar 01, 2025
NeoBERT: A Next-Generation BERT Read the full episode description	Mar 01, 2025
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance Read the full episode description	Mar 01, 2025
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think Read the full episode description	Mar 01, 2025
GHOST 2.0: generative high-fidelity one shot transfer of heads Read the full episode description	Feb 28, 2025
Kanana: Compute-efficient Bilingual Language Models Read the full episode description	Feb 28, 2025
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Read the full episode description	Feb 28, 2025
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance Read the full episode description	Feb 28, 2025
Language Models' Factuality Depends on the Language of Inquiry Read the full episode description	Feb 28, 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? Read the full episode description	Feb 28, 2025
Towards an AI co-scientist Read the full episode description	Feb 28, 2025
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems Read the full episode description	Feb 28, 2025
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Read the full episode description	Feb 28, 2025
Rank1: Test-Time Compute for Reranking in Information Retrieval Read the full episode description	Feb 28, 2025
MLGym: A New Framework and Benchmark for Advancing AI Research Agents Read the full episode description	Feb 22, 2025
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Read the full episode description	Feb 22, 2025
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Read the full episode description	Feb 22, 2025
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? Read the full episode description	Feb 22, 2025
S*: Test Time Scaling for Code Generation Read the full episode description	Feb 22, 2025
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning Read the full episode description	Feb 22, 2025
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning Read the full episode description	Feb 22, 2025
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models Read the full episode description	Feb 22, 2025
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information Read the full episode description	Feb 22, 2025
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning Read the full episode description	Feb 22, 2025
Qwen2.5-VL Technical Report Read the full episode description	Feb 21, 2025
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning Read the full episode description	Feb 21, 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation Read the full episode description	Feb 21, 2025
MoM: Linear Sequence Modeling with Mixture-of-Memories Read the full episode description	Feb 21, 2025
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Read the full episode description	Feb 21, 2025
Craw4LLM: Efficient Web Crawling for LLM Pretraining Read the full episode description	Feb 21, 2025
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Read the full episode description	Feb 21, 2025
Small Models Struggle to Learn from Strong Reasoners Read the full episode description	Feb 21, 2025
Autellix: An Efficient Serving Engine for LLM Agents as General Programs Read the full episode description	Feb 21, 2025
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering? Read the full episode description	Feb 21, 2025
Soundwave: Less is More for Speech-Text Alignment in LLMs Read the full episode description	Feb 20, 2025
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Read the full episode description	Feb 20, 2025
Continuous Diffusion Model for Language Modeling Read the full episode description	Feb 20, 2025
Phantom: Subject-consistent video generation via cross-modal alignment Read the full episode description	Feb 20, 2025
Rethinking Diverse Human Preference Learning through Principal Component Analysis Read the full episode description	Feb 20, 2025
Magma: A Foundation Model for Multimodal AI Agents Read the full episode description	Feb 20, 2025
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation Read the full episode description	Feb 20, 2025
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Read the full episode description	Feb 20, 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models Read the full episode description	Feb 20, 2025
You Do Not Fully Utilize Transformer's Representation Capacity Read the full episode description	Feb 20, 2025
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Read the full episode description	Feb 19, 2025
Learning Getting-Up Policies for Real-World Humanoid Robots Read the full episode description	Feb 19, 2025
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Read the full episode description	Feb 19, 2025
CRANE: Reasoning with constrained LLM generation Read the full episode description	Feb 19, 2025
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training Read the full episode description	Feb 19, 2025
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation Read the full episode description	Feb 19, 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models Read the full episode description	Feb 19, 2025
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors Read the full episode description	Feb 19, 2025
Region-Adaptive Sampling for Diffusion Transformers Read the full episode description	Feb 18, 2025
Large Language Diffusion Models Read the full episode description	Feb 18, 2025
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Read the full episode description	Feb 18, 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Read the full episode description	Feb 18, 2025
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Read the full episode description	Feb 18, 2025
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Read the full episode description	Feb 18, 2025
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation Read the full episode description	Feb 18, 2025
Diverse Inference and Verification for Advanced Reasoning Read the full episode description	Feb 18, 2025
Precise Parameter Localization for Textual Generation in Diffusion Models Read the full episode description	Feb 18, 2025
DarwinLM: Evolutionary Structured Pruning of Large Language Models Read the full episode description	Feb 18, 2025
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Read the full episode description	Feb 15, 2025
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding Read the full episode description	Feb 15, 2025
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Read the full episode description	Feb 15, 2025
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models Read the full episode description	Feb 15, 2025
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights Read the full episode description	Feb 15, 2025
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging Read the full episode description	Feb 15, 2025
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Read the full episode description	Feb 15, 2025
Exploring the Potential of Encoder-free Architectures in 3D LMMs Read the full episode description	Feb 15, 2025
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Read the full episode description	Feb 15, 2025
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models Read the full episode description	Feb 15, 2025
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance Read the full episode description	Feb 14, 2025
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Read the full episode description	Feb 14, 2025
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Read the full episode description	Feb 14, 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Read the full episode description	Feb 14, 2025
Distillation Scaling Laws Read the full episode description	Feb 14, 2025
TransMLA: Multi-Head Latent Attention Is All You Need Read the full episode description	Feb 14, 2025
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation Read the full episode description	Feb 14, 2025
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Read the full episode description	Feb 14, 2025
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning Read the full episode description	Feb 14, 2025
Expect the Unexpected: FailSafe Long Context QA for Finance Read the full episode description	Feb 13, 2025
Competitive Programming with Large Reasoning Models Read the full episode description	Feb 13, 2025
Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models Read the full episode description	Feb 13, 2025
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction Read the full episode description	Feb 13, 2025
Magic 1-For-1: Generating One Minute Video Clips within One Minute Read the full episode description	Feb 13, 2025
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Read the full episode description	Feb 13, 2025
Teaching Language Models to Critique via Reinforcement Learning Read the full episode description	Feb 13, 2025
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Read the full episode description	Feb 13, 2025
Enhance-A-Video: Better Generated Video for Free Read the full episode description	Feb 13, 2025
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling Read the full episode description	Feb 12, 2025
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators Read the full episode description	Feb 12, 2025
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning Read the full episode description	Feb 12, 2025
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning Read the full episode description	Feb 12, 2025
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging Read the full episode description	Feb 12, 2025
LM2: Large Memory Models Read the full episode description	Feb 12, 2025
Matryoshka Quantization Read the full episode description	Feb 12, 2025
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation Read the full episode description	Feb 12, 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding Read the full episode description	Feb 12, 2025
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates Read the full episode description	Feb 12, 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Read the full episode description	Feb 11, 2025
Fast Video Generation with Sliding Tile Attention Read the full episode description	Feb 11, 2025
Goku: Flow Based Video Generative Foundation Models Read the full episode description	Feb 11, 2025
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations Read the full episode description	Feb 11, 2025
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Read the full episode description	Feb 11, 2025
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting Read the full episode description	Feb 11, 2025
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails Read the full episode description	Feb 11, 2025
Agency Is Frame-Dependent Read the full episode description	Feb 11, 2025
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation Read the full episode description	Feb 11, 2025
Generating Symbolic World Models via Test-time Scaling of Large Language Models Read the full episode description	Feb 11, 2025
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Read the full episode description	Feb 08, 2025
UltraIF: Advancing Instruction Following from the Wild Read the full episode description	Feb 08, 2025
Great Models Think Alike and this Undermines AI Oversight Read the full episode description	Feb 08, 2025
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 Read the full episode description	Feb 08, 2025
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Read the full episode description	Feb 08, 2025
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm Read the full episode description	Feb 08, 2025
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion Read the full episode description	Feb 08, 2025
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization Read the full episode description	Feb 08, 2025
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Read the full episode description	Feb 08, 2025
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Read the full episode description	Feb 07, 2025
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets Read the full episode description	Feb 07, 2025
Demystifying Long Chain-of-Thought Reasoning in LLMs Read the full episode description	Feb 07, 2025
LIMO: Less is More for Reasoning Read the full episode description	Feb 07, 2025
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Read the full episode description	Feb 07, 2025
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer Read the full episode description	Feb 07, 2025
On Teacher Hacking in Language Model Distillation Read the full episode description	Feb 07, 2025
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods Read the full episode description	Feb 07, 2025
Jailbreaking with Universal Multi-Prompts Read the full episode description	Feb 07, 2025
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models Read the full episode description	Feb 06, 2025
Inverse Bridge Matching Distillation Read the full episode description	Feb 06, 2025
ACECODER: Acing Coder RL via Automated Test-Case Synthesis Read the full episode description	Feb 06, 2025
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search Read the full episode description	Feb 06, 2025
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search Read the full episode description	Feb 06, 2025
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? Read the full episode description	Feb 06, 2025
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation Read the full episode description	Feb 06, 2025
The Differences Between Direct Alignment Algorithms are a Blur Read the full episode description	Feb 05, 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models Read the full episode description	Feb 05, 2025
Process Reinforcement through Implicit Rewards Read the full episode description	Feb 05, 2025
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Read the full episode description	Feb 05, 2025
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model Read the full episode description	Feb 05, 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge Read the full episode description	Feb 05, 2025
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models Read the full episode description	Feb 05, 2025
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models Read the full episode description	Feb 05, 2025
AIN: The Arabic INclusive Large Multimodal Model Read the full episode description	Feb 05, 2025
s1: Simple test-time scaling Read the full episode description	Feb 04, 2025
Reward-Guided Speculative Decoding for Efficient LLM Reasoning Read the full episode description	Feb 04, 2025
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models Read the full episode description	Feb 04, 2025
PixelWorld: Towards Perceiving Everything as Pixels Read the full episode description	Feb 04, 2025
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning Read the full episode description	Feb 04, 2025
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Read the full episode description	Feb 04, 2025
Scalable-Softmax Is Superior for Attention Read the full episode description	Feb 04, 2025
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training Read the full episode description	Feb 04, 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Read the full episode description	Feb 04, 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards Read the full episode description	Feb 01, 2025
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs Read the full episode description	Feb 01, 2025
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch Read the full episode description	Feb 01, 2025
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding Read the full episode description	Feb 01, 2025
Large Language Models Think Too Fast To Explore Effectively Read the full episode description	Feb 01, 2025
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Read the full episode description	Feb 01, 2025
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Read the full episode description	Feb 01, 2025
o3-mini vs DeepSeek-R1: Which One is Safer? Read the full episode description	Feb 01, 2025
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation Read the full episode description	Feb 01, 2025
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate Read the full episode description	Jan 31, 2025
Atla Selene Mini: A General Purpose Evaluation Model Read the full episode description	Jan 31, 2025
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts Read the full episode description	Jan 31, 2025
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation Read the full episode description	Jan 31, 2025
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks Read the full episode description	Jan 31, 2025
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation Read the full episode description	Jan 31, 2025
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text Read the full episode description	Jan 31, 2025
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Read the full episode description	Jan 30, 2025
Optimizing Large Language Model Training Using FP4 Quantization Read the full episode description	Jan 30, 2025
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation Read the full episode description	Jan 30, 2025
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling Read the full episode description	Jan 30, 2025
Open Problems in Mechanistic Interpretability Read the full episode description	Jan 30, 2025
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression Read the full episode description	Jan 30, 2025
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding Read the full episode description	Jan 30, 2025
Histoires Morales: A French Dataset for Assessing Moral Alignment Read the full episode description	Jan 30, 2025
Qwen2.5-1M Technical Report Read the full episode description	Jan 29, 2025
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Read the full episode description	Jan 29, 2025
Towards General-Purpose Model-Free Reinforcement Learning Read the full episode description	Jan 29, 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Read the full episode description	Jan 29, 2025
iFormer: Integrating ConvNet and Transformer for Mobile Application Read the full episode description	Jan 29, 2025
Are Vision Language Models Texture or Shape Biased and Can We Steer Them? Read the full episode description	Jan 29, 2025
CodeMonkeys: Scaling Test-Time Compute for Software Engineering Read the full episode description	Jan 29, 2025
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Read the full episode description	Jan 29, 2025
Humanity's Last Exam Read the full episode description	Jan 28, 2025
Chain-of-Retrieval Augmented Generation Read the full episode description	Jan 28, 2025
Redundancy Principles for MLLMs Benchmarks Read the full episode description	Jan 28, 2025
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques Read the full episode description	Jan 28, 2025
RL + Transformer = A General-Purpose Problem Solver Read the full episode description	Jan 28, 2025
Relightable Full-Body Gaussian Codec Avatars Read the full episode description	Jan 28, 2025
Question Answering on Patient Medical Records with Private Fine-Tuned LLMs Read the full episode description	Jan 28, 2025
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Read the full episode description	Jan 28, 2025
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation Read the full episode description	Jan 28, 2025
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning Read the full episode description	Jan 28, 2025
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding Read the full episode description	Jan 25, 2025
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models Read the full episode description	Jan 25, 2025
Improving Video Generation with Human Feedback Read the full episode description	Jan 25, 2025
Temporal Preference Optimization for Long-Form Video Understanding Read the full episode description	Jan 25, 2025
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Read the full episode description	Jan 25, 2025
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Read the full episode description	Jan 25, 2025
DiffuEraser: A Diffusion Model for Video Inpainting Read the full episode description	Jan 25, 2025
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models Read the full episode description	Jan 25, 2025
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback Read the full episode description	Jan 25, 2025
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Read the full episode description	Jan 25, 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Read the full episode description	Jan 24, 2025
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Read the full episode description	Jan 24, 2025
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Read the full episode description	Jan 24, 2025
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback Read the full episode description	Jan 24, 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs Read the full episode description	Jan 24, 2025
Autonomy-of-Experts Models Read the full episode description	Jan 24, 2025
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning Read the full episode description	Jan 24, 2025
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament Read the full episode description	Jan 24, 2025
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems Read the full episode description	Jan 24, 2025
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Read the full episode description	Jan 24, 2025
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training Read the full episode description	Jan 23, 2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Read the full episode description	Jan 23, 2025
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models Read the full episode description	Jan 23, 2025
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space Read the full episode description	Jan 23, 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents Read the full episode description	Jan 23, 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Read the full episode description	Jan 23, 2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Read the full episode description	Jan 23, 2025
Reasoning Language Models: A Blueprint Read the full episode description	Jan 23, 2025
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation Read the full episode description	Jan 23, 2025
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments Read the full episode description	Jan 23, 2025
GameFactory: Creating New Games with Generative Interactive Videos Read the full episode description	Jan 22, 2025
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Read the full episode description	Jan 22, 2025
SEAL: Entangled White-box Watermarks on Low-Rank Adaptation Read the full episode description	Jan 22, 2025
The Lessons of Developing Process Reward Models in Mathematical Reasoning Read the full episode description	Jan 15, 2025
Tensor Product Attention Is All You Need Read the full episode description	Jan 15, 2025
$\text{Transformer}^2$: Self-adaptive LLMs Read the full episode description	Jan 15, 2025
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction Read the full episode description	Jan 15, 2025
VideoAuteur: Towards Long Narrative Video Generation Read the full episode description	Jan 15, 2025
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning Read the full episode description	Jan 15, 2025
WebWalker: Benchmarking LLMs in Web Traversal Read the full episode description	Jan 15, 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training Read the full episode description	Jan 15, 2025
UnCommon Objects in 3D Read the full episode description	Jan 15, 2025
VideoRAG: Retrieval-Augmented Generation over Video Corpus Read the full episode description	Jan 14, 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Read the full episode description	Jan 14, 2025
Enabling Scalable Oversight via Self-Evolving Critic Read the full episode description	Jan 14, 2025
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models Read the full episode description	Jan 14, 2025
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Read the full episode description	Jan 14, 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning Read the full episode description	Jan 14, 2025
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains Read the full episode description	Jan 14, 2025
The GAN is dead; long live the GAN! A Modern GAN Baseline Read the full episode description	Jan 11, 2025
An Empirical Study of Autoregressive Pre-training from Videos Read the full episode description	Jan 11, 2025
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Read the full episode description	Jan 11, 2025
Entropy-Guided Attention for Private LLMs Read the full episode description	Jan 11, 2025
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis Read the full episode description	Jan 11, 2025
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model Read the full episode description	Jan 11, 2025
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Read the full episode description	Jan 11, 2025
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Read the full episode description	Jan 11, 2025
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Read the full episode description	Jan 10, 2025
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought Read the full episode description	Jan 10, 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Read the full episode description	Jan 10, 2025
Agent Laboratory: Using LLM Agents as Research Assistants Read the full episode description	Jan 10, 2025
LLM4SR: A Survey on Large Language Models for Scientific Research Read the full episode description	Jan 10, 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection Read the full episode description	Jan 10, 2025
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Read the full episode description	Jan 10, 2025
GeAR: Generation Augmented Retrieval Read the full episode description	Jan 10, 2025
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation Read the full episode description	Jan 10, 2025
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization Read the full episode description	Jan 10, 2025
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Read the full episode description	Jan 09, 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Read the full episode description	Jan 09, 2025
Cosmos World Foundation Model Platform for Physical AI Read the full episode description	Jan 09, 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Read the full episode description	Jan 09, 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Read the full episode description	Jan 09, 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control Read the full episode description	Jan 09, 2025
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Read the full episode description	Jan 09, 2025
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides Read the full episode description	Jan 09, 2025
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model Read the full episode description	Jan 09, 2025
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting Read the full episode description	Jan 09, 2025
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution Read the full episode description	Jan 08, 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction Read the full episode description	Jan 08, 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning Read the full episode description	Jan 08, 2025
Personalized Graph-Based Retrieval for Large Language Models Read the full episode description	Jan 08, 2025
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring Read the full episode description	Jan 08, 2025
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking Read the full episode description	Jan 08, 2025
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation Read the full episode description	Jan 08, 2025
TransPixar: Advancing Text-to-Video Generation with Transparency Read the full episode description	Jan 08, 2025
AutoPresent: Designing Structured Visuals from Scratch Read the full episode description	Jan 08, 2025
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation Read the full episode description	Jan 07, 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Read the full episode description	Jan 07, 2025
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Read the full episode description	Jan 07, 2025
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Read the full episode description	Jan 07, 2025
SDPO: Segment-Level Direct Preference Optimization for Social Agents Read the full episode description	Jan 07, 2025
Graph Generative Pre-trained Transformer Read the full episode description	Jan 07, 2025
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models Read the full episode description	Jan 07, 2025
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery Read the full episode description	Jan 07, 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Read the full episode description	Jan 04, 2025
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings Read the full episode description	Jan 04, 2025
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control Read the full episode description	Jan 04, 2025
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Read the full episode description	Jan 04, 2025
ProgCo: Program Helps Self-Correction of Large Language Models Read the full episode description	Jan 04, 2025
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models Read the full episode description	Jan 04, 2025
A3: Android Agent Arena for Mobile GUI Agents Read the full episode description	Jan 04, 2025
MLLM-as-a-Judge for Image Safety without Human Labeling Read the full episode description	Jan 04, 2025
Dynamic Scaling of Unit Tests for Code Reward Modeling Read the full episode description	Jan 04, 2025
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Read the full episode description	Jan 03, 2025
Xmodel-2 Technical Report Read the full episode description	Jan 03, 2025
Are Vision-Language Models Truly Understanding Multi-vision Sensor? Read the full episode description	Jan 03, 2025
HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving Read the full episode description	Jan 03, 2025
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control Read the full episode description	Jan 03, 2025
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs Read the full episode description	Jan 02, 2025
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System Read the full episode description	Jan 02, 2025
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Read the full episode description	Jan 01, 2025
On the Compositional Generalization of Multimodal LLMs for Medical Imaging Read the full episode description	Jan 01, 2025
Bringing Objects to Life: 4D generation from 3D objects Read the full episode description	Jan 01, 2025
Efficiently Serving LLM Reasoning Programs with Certaindex Read the full episode description	Jan 01, 2025
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization Read the full episode description	Jan 01, 2025
Edicho: Consistent Image Editing in the Wild Read the full episode description	Jan 01, 2025
Facilitating large language model Russian adaptation with Learned Embedding Propagation Read the full episode description	Jan 01, 2025
Training Software Engineering Agents and Verifiers with SWE-Gym Read the full episode description	Jan 01, 2025
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Read the full episode description	Jan 01, 2025
Slow Perception: Let's Perceive Geometric Figures Step-by-step Read the full episode description	Jan 01, 2025
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Read the full episode description	Dec 31, 2024
1.58-bit FLUX Read the full episode description	Dec 31, 2024
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Read the full episode description	Dec 31, 2024
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models Read the full episode description	Dec 31, 2024
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Read the full episode description	Dec 31, 2024
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition Read the full episode description	Dec 31, 2024
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models Read the full episode description	Dec 31, 2024
The Superposition of Diffusion Models Using the Itô Density Estimator Read the full episode description	Dec 31, 2024
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging Read the full episode description	Dec 31, 2024
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era Read the full episode description	Dec 31, 2024
YuLan-Mini: An Open Data-efficient Language Model Read the full episode description	Dec 28, 2024
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression Read the full episode description	Dec 28, 2024
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Read the full episode description	Dec 28, 2024
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Read the full episode description	Dec 28, 2024
DepthLab: From Partial to Complete Read the full episode description	Dec 26, 2024
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization Read the full episode description	Dec 26, 2024
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Read the full episode description	Dec 26, 2024
In Case You Missed It: ARC 'Challenge' Is Not That Challenging Read the full episode description	Dec 26, 2024
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing Read the full episode description	Dec 26, 2024
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval Read the full episode description	Dec 26, 2024
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models Read the full episode description	Dec 26, 2024
MotiF: Making Text Count in Image Animation with Motion Focal Loss Read the full episode description	Dec 26, 2024
Bridging the Data Provenance Gap Across Text, Speech and Video Read the full episode description	Dec 26, 2024
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response Read the full episode description	Dec 25, 2024
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners Read the full episode description	Dec 25, 2024
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Read the full episode description	Dec 25, 2024
Diving into Self-Evolving Training for Multimodal Reasoning Read the full episode description	Dec 25, 2024
Deliberation in Latent Space via Differentiable Cache Augmentation Read the full episode description	Dec 25, 2024
Large Motion Video Autoencoding with Cross-modal Video VAE Read the full episode description	Dec 25, 2024
OpenAI o1 System Card Read the full episode description	Dec 25, 2024
Revisiting In-Context Learning with Long Context Language Models Read the full episode description	Dec 25, 2024
Outcome-Refining Process Supervision for Code Generation Read the full episode description	Dec 25, 2024
LearnLM: Improving Gemini for Learning Read the full episode description	Dec 25, 2024
Parallelized Autoregressive Visual Generation Read the full episode description	Dec 24, 2024
Offline Reinforcement Learning for LLM Multi-Step Reasoning Read the full episode description	Dec 24, 2024
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Read the full episode description	Dec 24, 2024
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up Read the full episode description	Dec 24, 2024
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Read the full episode description	Dec 24, 2024
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage Read the full episode description	Dec 24, 2024
Sequence Matters: Harnessing Video Models in 3D Super-Resolution Read the full episode description	Dec 24, 2024
TRecViT: A Recurrent Video Transformer Read the full episode description	Dec 24, 2024
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design Read the full episode description	Dec 24, 2024
Multi-LLM Text Summarization Read the full episode description	Dec 24, 2024
Qwen2.5 Technical Report Read the full episode description	Dec 21, 2024
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Read the full episode description	Dec 21, 2024
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Read the full episode description	Dec 21, 2024
How to Synthesize Text Data without Model Collapse? Read the full episode description	Dec 21, 2024
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Read the full episode description	Dec 21, 2024
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion Read the full episode description	Dec 21, 2024
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Read the full episode description	Dec 21, 2024
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation Read the full episode description	Dec 21, 2024
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling Read the full episode description	Dec 21, 2024
No More Adam: Learning Rate Scaling at Initialization is All You Need Read the full episode description	Dec 20, 2024
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Read the full episode description	Dec 20, 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Read the full episode description	Dec 20, 2024
AniDoc: Animation Creation Made Easier Read the full episode description	Dec 20, 2024
FashionComposer: Compositional Fashion Image Generation Read the full episode description	Dec 20, 2024
GUI Agents: A Survey Read the full episode description	Dec 20, 2024
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning Read the full episode description	Dec 20, 2024
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation Read the full episode description	Dec 20, 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Read the full episode description	Dec 20, 2024
Are Your LLMs Capable of Stable Reasoning? Read the full episode description	Dec 19, 2024
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models Read the full episode description	Dec 19, 2024
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Read the full episode description	Dec 19, 2024
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations Read the full episode description	Dec 19, 2024
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers Read the full episode description	Dec 19, 2024
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Read the full episode description	Dec 19, 2024
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents Read the full episode description	Dec 19, 2024
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation Read the full episode description	Dec 19, 2024
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner Read the full episode description	Dec 19, 2024
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion Read the full episode description	Dec 19, 2024
Byte Latent Transformer: Patches Scale Better Than Tokens Read the full episode description	Dec 18, 2024
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation Read the full episode description	Dec 18, 2024
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models Read the full episode description	Dec 18, 2024
BrushEdit: All-In-One Image Inpainting and Editing Read the full episode description	Dec 18, 2024
ColorFlow: Retrieval-Augmented Image Sequence Colorization Read the full episode description	Dec 18, 2024
Smaller Language Models Are Better Instruction Evolvers Read the full episode description	Dec 18, 2024
Causal Diffusion Transformers for Generative Modeling Read the full episode description	Dec 18, 2024
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Read the full episode description	Dec 18, 2024
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations Read the full episode description	Dec 18, 2024
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs Read the full episode description	Dec 18, 2024
Apollo: An Exploration of Video Understanding in Large Multimodal Models Read the full episode description	Dec 17, 2024
GenEx: Generating an Explorable World Read the full episode description	Dec 17, 2024
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Read the full episode description	Dec 17, 2024
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities Read the full episode description	Dec 17, 2024
Large Action Models: From Inception to Implementation Read the full episode description	Dec 17, 2024
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Read the full episode description	Dec 17, 2024
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion Read the full episode description	Dec 17, 2024
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation Read the full episode description	Dec 17, 2024
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing Read the full episode description	Dec 17, 2024
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers Read the full episode description	Dec 17, 2024
Phi-4 Technical Report Read the full episode description	Dec 14, 2024
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Read the full episode description	Dec 14, 2024
Multimodal Latent Language Modeling with Next-Token Diffusion Read the full episode description	Dec 14, 2024
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Read the full episode description	Dec 14, 2024
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Read the full episode description	Dec 14, 2024
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Read the full episode description	Dec 14, 2024
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion Read the full episode description	Dec 14, 2024
JuStRank: Benchmarking LLM Judges for System Ranking Read the full episode description	Dec 14, 2024
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Read the full episode description	Dec 13, 2024
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Read the full episode description	Dec 13, 2024
POINTS1.5: Building a Vision-Language Model towards Real World Applications Read the full episode description	Dec 13, 2024
Learning Flow Fields in Attention for Controllable Person Image Generation Read the full episode description	Dec 13, 2024
StyleMaster: Stylize Your Video with Artistic Generation and Translation Read the full episode description	Dec 13, 2024
StreamChat: Chatting with Streaming Video Read the full episode description	Dec 13, 2024
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Read the full episode description	Dec 13, 2024
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction Read the full episode description	Dec 13, 2024
The BrowserGym Ecosystem for Web Agent Research Read the full episode description	Dec 13, 2024
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Read the full episode description	Dec 12, 2024
Hidden in the Noise: Two-Stage Robust Watermarking for Images Read the full episode description	Dec 12, 2024
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models Read the full episode description	Dec 12, 2024
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics Read the full episode description	Dec 12, 2024
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation Read the full episode description	Dec 12, 2024
Mobile Video Diffusion Read the full episode description	Dec 12, 2024
Granite Guardian Read the full episode description	Dec 12, 2024
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation Read the full episode description	Dec 11, 2024
ProcessBench: Identifying Process Errors in Mathematical Reasoning Read the full episode description	Dec 11, 2024
Training Large Language Models to Reason in a Continuous Latent Space Read the full episode description	Dec 11, 2024
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Read the full episode description	Dec 11, 2024
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation Read the full episode description	Dec 11, 2024
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models Read the full episode description	Dec 11, 2024
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale Read the full episode description	Dec 11, 2024
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Read the full episode description	Dec 11, 2024
Robust Multi-bit Text Watermark with LLM-based Paraphrasers Read the full episode description	Dec 11, 2024
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views Read the full episode description	Dec 11, 2024
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Read the full episode description	Dec 10, 2024
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases Read the full episode description	Dec 10, 2024
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Read the full episode description	Dec 10, 2024
APOLLO: SGD-like Memory, AdamW-level Performance Read the full episode description	Dec 10, 2024
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion Read the full episode description	Dec 10, 2024
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation Read the full episode description	Dec 10, 2024
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Read the full episode description	Dec 10, 2024
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction Read the full episode description	Dec 10, 2024
CompCap: Improving Multimodal Large Language Models with Composite Captions Read the full episode description	Dec 10, 2024
VisionZip: Longer is Better but Not Necessary in Vision Language Models Read the full episode description	Dec 08, 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Read the full episode description	Dec 08, 2024
NVILA: Efficient Frontier Visual Language Models Read the full episode description	Dec 08, 2024
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Read the full episode description	Dec 08, 2024
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Read the full episode description	Dec 08, 2024
Evaluating Language Models as Synthetic Data Generators Read the full episode description	Dec 08, 2024
A Noise is Worth Diffusion Guidance Read the full episode description	Dec 08, 2024
Structured 3D Latents for Scalable and Versatile 3D Generation Read the full episode description	Dec 08, 2024
Negative Token Merging: Image-based Adversarial Feature Guidance Read the full episode description	Dec 08, 2024
MV-Adapter: Multi-view Consistent Image Generation Made Easy Read the full episode description	Dec 08, 2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Read the full episode description	Nov 28, 2024
Star Attention: Efficient LLM Inference over Long Sequences Read the full episode description	Nov 28, 2024
Pathways on the Image Manifold: Image Editing via Video Generation Read the full episode description	Nov 28, 2024
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Read the full episode description	Nov 28, 2024
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Read the full episode description	Nov 28, 2024
SketchAgent: Language-Driven Sequential Sketch Generation Read the full episode description	Nov 28, 2024
TEXGen: a Generative Diffusion Model for Mesh Textures Read the full episode description	Nov 28, 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Read the full episode description	Nov 28, 2024
Learning 3D Representations from Procedural 3D Programs Read the full episode description	Nov 28, 2024
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE Read the full episode description	Nov 28, 2024
Material Anything: Generating Materials for Any 3D Object via Diffusion Read the full episode description	Nov 27, 2024
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator Read the full episode description	Nov 27, 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge Read the full episode description	Nov 27, 2024
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? Read the full episode description	Nov 27, 2024
MH-MoE: Multi-Head Mixture-of-Experts Read the full episode description	Nov 27, 2024
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Read the full episode description	Nov 27, 2024
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation Read the full episode description	Nov 27, 2024
Knowledge Transfer Across Modalities with Natural Language Supervision Read the full episode description	Nov 27, 2024
One Diffusion to Generate Them All Read the full episode description	Nov 27, 2024
VisualLens: Personalization through Visual History Read the full episode description	Nov 27, 2024
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Read the full episode description	Nov 26, 2024
Style-Friendly SNR Sampler for Style-Driven Generation Read the full episode description	Nov 26, 2024
OminiControl: Minimal and Universal Control for Diffusion Transformer Read the full episode description	Nov 26, 2024
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection Read the full episode description	Nov 26, 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Read the full episode description	Nov 26, 2024
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Read the full episode description	Nov 26, 2024
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Read the full episode description	Nov 26, 2024
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction Read the full episode description	Nov 26, 2024
MyTimeMachine: Personalized Facial Age Transformation Read the full episode description	Nov 26, 2024
Novel View Extrapolation with Video Diffusion Priors Read the full episode description	Nov 26, 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Read the full episode description	Nov 23, 2024
Multimodal Autoregressive Pre-training of Large Vision Encoders Read the full episode description	Nov 23, 2024
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Read the full episode description	Nov 23, 2024
Hymba: A Hybrid-head Architecture for Small Language Models Read the full episode description	Nov 23, 2024
Natural Language Reinforcement Learning Read the full episode description	Nov 23, 2024
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Read the full episode description	Nov 23, 2024
Ultra-Sparse Memory Network Read the full episode description	Nov 23, 2024
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Read the full episode description	Nov 23, 2024
Stable Flow: Vital Layers for Training-Free Image Editing Read the full episode description	Nov 23, 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Read the full episode description	Nov 23, 2024
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration Read the full episode description	Nov 22, 2024
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Read the full episode description	Nov 22, 2024
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Read the full episode description	Nov 22, 2024
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Read the full episode description	Nov 22, 2024
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents Read the full episode description	Nov 22, 2024
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training Read the full episode description	Nov 22, 2024
Stylecodes: Encoding Stylistic Information For Image Generation Read the full episode description	Nov 22, 2024
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models Read the full episode description	Nov 22, 2024
Loss-to-Loss Prediction: Scaling Laws for All Datasets Read the full episode description	Nov 22, 2024
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation Read the full episode description	Nov 22, 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Read the full episode description	Nov 21, 2024
Continuous Speculative Decoding for Autoregressive Image Generation Read the full episode description	Nov 21, 2024
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements Read the full episode description	Nov 21, 2024
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations Read the full episode description	Nov 21, 2024
Soft Robotic Dynamic In-Hand Pen Spinning Read the full episode description	Nov 21, 2024
Building Trust: Foundations of Security, Safety and Transparency in AI Read the full episode description	Nov 21, 2024
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Read the full episode description	Nov 21, 2024
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages Read the full episode description	Nov 21, 2024
Generative World Explorer Read the full episode description	Nov 20, 2024
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Read the full episode description	Nov 20, 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering Read the full episode description	Nov 20, 2024
AnimateAnything: Consistent and Controllable Animation for Video Generation Read the full episode description	Nov 20, 2024
Top-$nσ$: Not All Logits Are You Need Read the full episode description	Nov 20, 2024
Drowning in Documents: Consequences of Scaling Reranker Inference Read the full episode description	Nov 20, 2024
SlimLM: An Efficient Small Language Model for On-Device Document Assistance Read the full episode description	Nov 20, 2024
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Read the full episode description	Nov 20, 2024
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers Read the full episode description	Nov 20, 2024
LLäMmlein: Compact and Competitive German-Only Language Models from Scratch Read the full episode description	Nov 20, 2024
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Read the full episode description	Nov 19, 2024
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation Read the full episode description	Nov 19, 2024
Xmodel-1.5: An 1B-scale Multilingual LLM Read the full episode description	Nov 19, 2024
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Read the full episode description	Nov 16, 2024
MagicQuill: An Intelligent Interactive Image Editing System Read the full episode description	Nov 16, 2024
Cut Your Losses in Large-Vocabulary Language Models Read the full episode description	Nov 16, 2024
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? Read the full episode description	Nov 16, 2024
Sharingan: Extract User Action Sequence from Desktop Recordings Read the full episode description	Nov 16, 2024
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks Read the full episode description	Nov 16, 2024
Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples Read the full episode description	Nov 16, 2024
Direct Preference Optimization Using Sparse Feature-Level Constraints Read the full episode description	Nov 15, 2024
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection Read the full episode description	Nov 15, 2024
Can sparse autoencoders be used to decompose and interpret steering vectors? Read the full episode description	Nov 15, 2024
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation Read the full episode description	Nov 15, 2024
SAMPart3D: Segment Any Part in 3D Objects Read the full episode description	Nov 14, 2024
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Read the full episode description	Nov 14, 2024
Stronger Models are NOT Stronger Teachers for Instruction Tuning Read the full episode description	Nov 14, 2024
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Read the full episode description	Nov 14, 2024
Scaling Properties of Diffusion Models for Perceptual Tasks Read the full episode description	Nov 14, 2024
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings Read the full episode description	Nov 14, 2024
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models Read the full episode description	Nov 13, 2024
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision Read the full episode description	Nov 13, 2024
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models Read the full episode description	Nov 13, 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework Read the full episode description	Nov 13, 2024
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models Read the full episode description	Nov 13, 2024
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Read the full episode description	Nov 13, 2024
Watermark Anything with Localized Messages Read the full episode description	Nov 13, 2024
Autoregressive Models in Vision: A Survey Read the full episode description	Nov 13, 2024
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Read the full episode description	Nov 12, 2024
Balancing Pipeline Parallelism with Vocabulary Parallelism Read the full episode description	Nov 12, 2024
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images Read the full episode description	Nov 12, 2024
DELIFT: Data Efficient Language model Instruction Fine Tuning Read the full episode description	Nov 12, 2024
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study Read the full episode description	Nov 12, 2024
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models Read the full episode description	Nov 12, 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Read the full episode description	Nov 12, 2024
Improving the detection of technical debt in Java source code with an enriched dataset Read the full episode description	Nov 12, 2024
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Read the full episode description	Nov 09, 2024
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Read the full episode description	Nov 09, 2024
BitNet a4.8: 4-bit Activations for 1-bit LLMs Read the full episode description	Nov 09, 2024
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Read the full episode description	Nov 09, 2024
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Read the full episode description	Nov 09, 2024
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Read the full episode description	Nov 09, 2024
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model Read the full episode description	Nov 09, 2024
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Read the full episode description	Nov 09, 2024
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation Read the full episode description	Nov 09, 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Read the full episode description	Nov 09, 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Read the full episode description	Nov 08, 2024
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level Read the full episode description	Nov 08, 2024
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models Read the full episode description	Nov 08, 2024
Self-Consistency Preference Optimization Read the full episode description	Nov 08, 2024
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond Read the full episode description	Nov 08, 2024
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Read the full episode description	Nov 07, 2024
LLaMo: Large Language Model-based Molecular Graph Assistant Read the full episode description	Nov 07, 2024
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Read the full episode description	Nov 07, 2024
Controlling Language and Diffusion Models by Transporting Activations Read the full episode description	Nov 07, 2024
Sample-Efficient Alignment for LLMs Read the full episode description	Nov 07, 2024
DreamPolish: Domain Score Distillation With Progressive Geometry Generation Read the full episode description	Nov 07, 2024
Adaptive Length Image Tokenization via Recurrent Allocation Read the full episode description	Nov 07, 2024
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details Read the full episode description	Nov 07, 2024
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge Read the full episode description	Nov 07, 2024
Inference Optimal VLMs Need Only One Visual Token but Larger Models Read the full episode description	Nov 07, 2024
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents Read the full episode description	Nov 06, 2024
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization Read the full episode description	Nov 06, 2024
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Read the full episode description	Nov 06, 2024
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D Read the full episode description	Nov 06, 2024
Training-free Regional Prompting for Diffusion Transformers Read the full episode description	Nov 06, 2024
How Far is Video Generation from World Model: A Physical Law Perspective Read the full episode description	Nov 06, 2024
Survey of Cultural Awareness in Language Models: Text and Beyond Read the full episode description	Nov 06, 2024
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent Read the full episode description	Nov 06, 2024
GenXD: Generating Any 3D and 4D Scenes Read the full episode description	Nov 06, 2024
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Read the full episode description	Nov 06, 2024
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Read the full episode description	Nov 05, 2024
Personalization of Large Language Models: A Survey Read the full episode description	Nov 05, 2024
Constant Acceleration Flow Read the full episode description	Nov 05, 2024
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Read the full episode description	Nov 05, 2024
Randomized Autoregressive Visual Generation Read the full episode description	Nov 05, 2024
Survey of User Interface Design and Interaction Techniques in Generative AI Applications Read the full episode description	Nov 05, 2024
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation Read the full episode description	Nov 05, 2024
In-Context LoRA for Diffusion Transformers Read the full episode description	Nov 05, 2024
Physics in Next-token Prediction Read the full episode description	Nov 05, 2024
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes Read the full episode description	Nov 05, 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Read the full episode description	Nov 03, 2024
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Read the full episode description	Nov 03, 2024
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents Read the full episode description	Nov 03, 2024
Language Models can Self-Lengthen to Generate Long Texts Read the full episode description	Nov 03, 2024
Constraint Back-translation Improves Complex Instruction Following of Large Language Models Read the full episode description	Nov 03, 2024
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Read the full episode description	Nov 03, 2024
SelfCodeAlign: Self-Alignment for Code Generation Read the full episode description	Nov 03, 2024
Learning Video Representations without Natural Videos Read the full episode description	Nov 03, 2024
AAAR-1.0: Assessing AI's Potential to Assist Research Read the full episode description	Nov 03, 2024
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Read the full episode description	Nov 03, 2024

Daily Paper Cast

By Jingwen Liang, Gengyu Wang

Category: Science

Open in Apple Podcasts

Open RSS feed

Open Website

Description