Daily Paper Cast

By Jingwen Liang, Gengyu Wang

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.


Category: Science

Open in Apple Podcasts


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 0
Reviews: 0
Episodes: 1610

Description

We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, LLM ML, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Episode Date
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Jan 17, 2026
STEP3-VL-10B Technical Report
Jan 17, 2026
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Jan 17, 2026
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Jan 17, 2026
Controlled Self-Evolution for Algorithmic Code Optimization
Jan 16, 2026
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Jan 16, 2026
MAXS: Meta-Adaptive Exploration with LLM Agents
Jan 16, 2026
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Jan 16, 2026
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Jan 16, 2026
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
Jan 16, 2026
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
Jan 16, 2026
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Jan 16, 2026
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences
Jan 15, 2026
Solar Open Technical Report
Jan 15, 2026
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
Jan 15, 2026
User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
Jan 15, 2026
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
Jan 15, 2026
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
Jan 15, 2026
MemoBrain: Executive Memory as an Agentic Brain for Reasoning
Jan 15, 2026
Motion Attribution for Video Generation
Jan 15, 2026
3AM: Segment Anything with Geometric Consistency in Videos
Jan 15, 2026
BabyVision: Visual Reasoning Beyond Language
Jan 14, 2026
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
Jan 14, 2026
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Jan 14, 2026
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Jan 14, 2026
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Jan 14, 2026
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
Jan 14, 2026
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Jan 14, 2026
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
Jan 13, 2026
MMFormalizer: Multimodal Autoformalization in the Wild
Jan 13, 2026
CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature
Jan 13, 2026
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Jan 13, 2026
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
Jan 13, 2026
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Jan 13, 2026
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Jan 13, 2026
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Jan 10, 2026
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Jan 10, 2026
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Jan 10, 2026
Token-Level LLM Collaboration via FusionRoute
Jan 10, 2026
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Jan 09, 2026
Evolving Programmatic Skill Networks
Jan 09, 2026
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Jan 09, 2026
Benchmark^2: Systematic Evaluation of LLM Benchmarks
Jan 09, 2026
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Jan 08, 2026
LTX-2: Efficient Joint Audio-Visual Foundation Model
Jan 08, 2026
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Jan 08, 2026
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Jan 08, 2026
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Jan 08, 2026
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Jan 07, 2026
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Jan 07, 2026
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
Jan 07, 2026
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Jan 07, 2026
GARDO: Reinforcing Diffusion Models without Reward Hacking
Jan 07, 2026
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Jan 07, 2026
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Jan 07, 2026
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
Jan 06, 2026
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Jan 06, 2026
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Jan 06, 2026
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Jan 06, 2026
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Jan 06, 2026
Deep Delta Learning
Jan 06, 2026
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Jan 06, 2026
Nested Learning: The Illusion of Deep Learning Architectures
Jan 06, 2026
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
Jan 03, 2026
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Jan 03, 2026
mHC: Manifold-Constrained Hyper-Connections
Jan 02, 2026
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
Jan 02, 2026
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
Jan 02, 2026
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Jan 02, 2026
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Dec 31, 2025
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Dec 31, 2025
Yume-1.5: A Text-Controlled Interactive World Generation Model
Dec 31, 2025
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
Dec 31, 2025
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Dec 31, 2025
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Dec 31, 2025
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
Dec 31, 2025
SpotEdit: Selective Region Editing in Diffusion Transformers
Dec 31, 2025
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Dec 31, 2025
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
Dec 30, 2025
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
Dec 30, 2025
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Dec 30, 2025
Latent Implicit Visual Reasoning
Dec 27, 2025
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Dec 27, 2025
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Dec 26, 2025
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Dec 26, 2025
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Dec 26, 2025
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Dec 26, 2025
SemanticGen: Video Generation in Semantic Space
Dec 25, 2025
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Dec 25, 2025
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Dec 25, 2025
SpatialTree: How Spatial Abilities Branch Out in MLLMs
Dec 25, 2025
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Dec 24, 2025
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Dec 24, 2025
Region-Constraint In-Context Generation for Instructional Video Editing
Dec 24, 2025
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
Dec 24, 2025
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
Dec 24, 2025
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Dec 24, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
Dec 23, 2025
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Dec 23, 2025
When Reasoning Meets Its Laws
Dec 23, 2025
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
Dec 23, 2025
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Dec 23, 2025
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Dec 23, 2025
Are We on the Right Way to Assessing LLM-as-a-Judge?
Dec 23, 2025
Kling-Omni Technical Report
Dec 20, 2025
Adaptation of Agentic AI
Dec 20, 2025
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Dec 20, 2025
Next-Embedding Prediction Makes Strong Vision Learners
Dec 20, 2025
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Dec 20, 2025
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Dec 20, 2025
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Dec 20, 2025
Generative Refocusing: Flexible Defocus Control from a Single Image
Dec 20, 2025
DeContext as Defense: Safe Image Editing in Diffusion Transformers
Dec 20, 2025
Step-GUI Technical Report
Dec 19, 2025
DEER: Draft with Diffusion, Verify with Autoregressive Models
Dec 19, 2025
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Dec 19, 2025
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Dec 19, 2025
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Dec 19, 2025
MMGR: Multi-Modal Generative Reasoning
Dec 18, 2025
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
Dec 18, 2025
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Dec 18, 2025
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Dec 18, 2025
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Dec 18, 2025
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Dec 18, 2025
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Dec 17, 2025
Towards Scalable Pre-training of Visual Tokenizers for Generation
Dec 17, 2025
Memory in the Age of AI Agents
Dec 17, 2025
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Dec 17, 2025
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Dec 17, 2025
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Dec 17, 2025
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Dec 17, 2025
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Dec 17, 2025
KlingAvatar 2.0 Technical Report
Dec 17, 2025
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
Dec 17, 2025
EgoX: Egocentric Video Generation from a Single Exocentric Video
Dec 16, 2025
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
Dec 16, 2025
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Dec 16, 2025
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Dec 16, 2025
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
Dec 13, 2025
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Dec 13, 2025
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Dec 13, 2025
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Dec 13, 2025
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
Dec 13, 2025
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
Dec 12, 2025
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
Dec 12, 2025
OmniPSD: Layered PSD Generation with Diffusion Transformer
Dec 12, 2025
Composing Concepts from Images and Videos via Concept-prompt Binding
Dec 12, 2025
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Dec 11, 2025
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
Dec 11, 2025
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Dec 11, 2025
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Dec 11, 2025
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Dec 10, 2025
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
Dec 10, 2025
Unified Video Editing with Temporal Reasoner
Dec 10, 2025
Voxify3D: Pixel Art Meets Volumetric Rendering
Dec 10, 2025
Scaling Zero-Shot Reference-to-Video Generation
Dec 10, 2025
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
Dec 10, 2025
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
Dec 09, 2025
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
Dec 09, 2025
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
Dec 09, 2025
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Dec 09, 2025
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Dec 06, 2025
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Dec 06, 2025
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
Dec 06, 2025
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Dec 06, 2025
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Dec 06, 2025
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Dec 06, 2025
PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing
Dec 06, 2025
Qwen3-VL Technical Report
Dec 05, 2025
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Dec 05, 2025
PretrainZero: Reinforcement Active Pretraining
Dec 05, 2025
ViDiC: Video Difference Captioning
Dec 05, 2025
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Dec 04, 2025
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Dec 04, 2025
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Dec 04, 2025
MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
Dec 04, 2025
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Dec 04, 2025
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
Dec 04, 2025
Guided Self-Evolving LLMs with Minimal Human Supervision
Dec 04, 2025
SimScale: Learning to Drive via Real-World Simulation at Scale
Dec 04, 2025
InnoGym: Benchmarking the Innovation Potential of AI Agents
Dec 04, 2025
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Dec 03, 2025
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Dec 03, 2025
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Dec 03, 2025
How Far Are We from Genuinely Useful Deep Research Agents?
Dec 03, 2025
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
Dec 03, 2025
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Dec 03, 2025
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Dec 03, 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Dec 03, 2025
LFM2 Technical Report
Dec 03, 2025
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Dec 02, 2025
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Dec 02, 2025
Vision Bridge Transformer at Scale
Dec 02, 2025
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Dec 02, 2025
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dec 02, 2025
Multimodal Evaluation of Russian-language Architectures
Nov 28, 2025
Latent Collaboration in Multi-Agent Systems
Nov 28, 2025
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Nov 28, 2025
GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms
Nov 27, 2025
MedSAM3: Delving into Segment Anything with Medical Concepts
Nov 27, 2025
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
Nov 27, 2025
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
Nov 27, 2025
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Nov 27, 2025
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Nov 27, 2025
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
Nov 27, 2025
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Nov 27, 2025
Soft Adaptive Policy Optimization
Nov 27, 2025
General Agentic Memory Via Deep Research
Nov 26, 2025
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Nov 26, 2025
Computer-Use Agents as Judges for Generative User Interface
Nov 26, 2025
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Nov 26, 2025
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Nov 26, 2025
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Nov 26, 2025
In-Video Instructions: Visual Signals as Generative Control
Nov 26, 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Nov 25, 2025
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Nov 25, 2025
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Nov 25, 2025
SAM 3: Segment Anything with Concepts
Nov 25, 2025
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Nov 21, 2025
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Nov 21, 2025
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
Nov 21, 2025
VisPlay: Self-Evolving Vision-Language Models from Images
Nov 21, 2025
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Nov 21, 2025
VIDEOP2R: Video Understanding from Perception to Reasoning
Nov 20, 2025
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
Nov 20, 2025
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Nov 20, 2025
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Nov 20, 2025
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Nov 20, 2025
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Nov 20, 2025
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Nov 20, 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Nov 19, 2025
P1: Mastering Physics Olympiads with Reinforcement Learning
Nov 19, 2025
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
Nov 19, 2025
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
Nov 19, 2025
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Nov 19, 2025
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Nov 19, 2025
GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning
Nov 19, 2025
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Nov 19, 2025
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Nov 19, 2025
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Nov 18, 2025
DoPE: Denoising Rotary Position Embedding
Nov 18, 2025
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Nov 18, 2025
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Nov 18, 2025
AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery
Nov 18, 2025
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
Nov 18, 2025
Virtual Width Networks
Nov 18, 2025
One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
Nov 15, 2025
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
Nov 15, 2025
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Nov 15, 2025
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Nov 11, 2025
DeepEyesV2: Toward Agentic Multimodal Model
Nov 11, 2025
Visual Spatial Tuning
Nov 11, 2025
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Nov 11, 2025
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Nov 08, 2025
V-Thinker: Interactive Thinking with Images
Nov 08, 2025
Scaling Agent Learning via Experience Synthesis
Nov 08, 2025
Diffusion Language Models are Super Data Learners
Nov 07, 2025
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
Nov 07, 2025
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Nov 07, 2025
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Nov 06, 2025
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Nov 06, 2025
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Nov 06, 2025
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
Nov 05, 2025
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
Nov 05, 2025
The Underappreciated Power of Vision Models for Graph Structural Understanding
Nov 05, 2025
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
Nov 05, 2025
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Nov 05, 2025
PHUMA: Physically-Grounded Humanoid Locomotion Dataset
Nov 05, 2025
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Nov 05, 2025
World Simulation with Video Foundation Models for Physical AI
Nov 05, 2025
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Nov 04, 2025
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Nov 04, 2025
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Nov 04, 2025
The End of Manual Decoding: Towards Truly End-to-End Language Models
Nov 01, 2025
Kimi Linear: An Expressive, Efficient Attention Architecture
Nov 01, 2025
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
Nov 01, 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Nov 01, 2025
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Nov 01, 2025
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Oct 29, 2025
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
Oct 24, 2025
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Oct 24, 2025
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Oct 24, 2025
Language Models are Injective and Hence Invertible
Oct 24, 2025
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Oct 24, 2025
LightMem: Lightweight and Efficient Memory-Augmented Generation
Oct 23, 2025
Efficient Long-context Language Model Training by Core Attention Disaggregation
Oct 23, 2025
World-in-World: World Models in a Closed-Loop World
Oct 23, 2025
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Oct 23, 2025
Chem-R: Learning to Reason as a Chemist
Oct 23, 2025
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Oct 23, 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Oct 23, 2025
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Oct 23, 2025
IF-VidCap: Can Video Caption Models Follow Instructions?
Oct 23, 2025
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Oct 22, 2025
PICABench: How Far Are We from Physically Realistic Image Editing?
Oct 22, 2025
Glyph: Scaling Context Windows via Visual-Text Compression
Oct 22, 2025
FineVision: Open Data Is All You Need
Oct 22, 2025
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
Oct 22, 2025
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Oct 22, 2025
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Oct 22, 2025
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
Oct 21, 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Oct 21, 2025
NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
Oct 21, 2025
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Oct 21, 2025
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Oct 21, 2025
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Oct 21, 2025
Latent Diffusion Model without Variational Autoencoder
Oct 21, 2025
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
Oct 18, 2025
Agentic Entropy-Balanced Policy Optimization
Oct 18, 2025
WithAnyone: Towards Controllable and ID Consistent Image Generation
Oct 18, 2025
AI for Service: Proactive Assistance with AI Glasses
Oct 18, 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Oct 18, 2025
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Oct 18, 2025
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
Oct 18, 2025
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
Oct 18, 2025
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
Oct 18, 2025
BitNet Distillation
Oct 18, 2025
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Oct 16, 2025
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
Oct 16, 2025
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
Oct 16, 2025
Scaling Language-Centric Omnimodal Representation Learning
Oct 16, 2025
Robot Learning: A Tutorial
Oct 16, 2025
Detect Anything via Next Point Prediction
Oct 16, 2025
A Survey of Vibe Coding with Large Language Models
Oct 16, 2025
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
Oct 16, 2025
Dr.LLM: Dynamic Layer Routing in LLMs
Oct 16, 2025
Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models
Oct 16, 2025
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Oct 15, 2025
Diffusion Transformers with Representation Autoencoders
Oct 15, 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Oct 15, 2025
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Oct 15, 2025
Spotlight on Token Perception for Multimodal Reinforcement Learning
Oct 15, 2025
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
Oct 15, 2025
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Oct 15, 2025
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Oct 15, 2025
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Oct 15, 2025
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
Oct 15, 2025
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Oct 14, 2025
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Oct 14, 2025
TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling
Oct 14, 2025
AutoPR: Let's Automate Your Academic Promotion!
Oct 14, 2025
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Oct 14, 2025
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Oct 14, 2025
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Oct 14, 2025
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Oct 14, 2025
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
Oct 14, 2025
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Oct 14, 2025
Agent Learning via Early Experience
Oct 11, 2025
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
Oct 11, 2025
MemMamba: Rethinking Memory Patterns in State Space Model
Oct 11, 2025
UniVideo: Unified Understanding, Generation, and Editing for Videos
Oct 11, 2025
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
Oct 11, 2025
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
Oct 11, 2025
Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning
Oct 11, 2025
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Oct 11, 2025
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Oct 11, 2025
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Oct 11, 2025
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Oct 10, 2025
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Oct 10, 2025
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Oct 10, 2025
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Oct 10, 2025
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Oct 10, 2025
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Oct 10, 2025
Vibe Checker: Aligning Code Evaluation with Human Preference
Oct 10, 2025
Less is More: Recursive Reasoning with Tiny Networks
Oct 09, 2025
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Oct 09, 2025
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
Oct 09, 2025
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Oct 09, 2025
Fast-dLLM v2: Efficient Block-Diffusion LLM
Oct 09, 2025
CoDA: Coding LM via Diffusion Adaptation
Oct 09, 2025
Drax: Speech Recognition with Discrete Flow Matching
Oct 09, 2025
Paper2Video: Automatic Video Generation from Scientific Papers
Oct 08, 2025
MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information
Oct 08, 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Oct 08, 2025
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Oct 08, 2025
Imperceptible Jailbreaking against Large Language Models
Oct 08, 2025
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Oct 08, 2025
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Oct 08, 2025
Optimal Scaling Needs Optimal Norm
Oct 08, 2025
Apriel-1.5-15b-Thinker
Oct 07, 2025
Large Reasoning Models Learn Better Alignment from Flawed Thinking
Oct 07, 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Oct 07, 2025
LongCodeZip: Compress Long Context for Code Language Models
Oct 04, 2025
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Oct 04, 2025
ExGRPO: Learning to Reason from Experience
Oct 04, 2025
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
Oct 04, 2025
Interactive Training: Feedback-Driven Neural Network Optimization
Oct 04, 2025
ModernVBERT: Towards Smaller Visual Document Retrievers
Oct 04, 2025
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Oct 04, 2025
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Oct 03, 2025
GEM: A Gym for Agentic LLMs
Oct 03, 2025
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
Oct 03, 2025
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Oct 03, 2025
PIPer: On-Device Environment Setup via Online Reinforcement Learning
Oct 03, 2025
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Oct 03, 2025
ACON: Optimizing Context Compression for Long-horizon LLM Agents
Oct 03, 2025
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Oct 02, 2025
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Oct 02, 2025
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Oct 02, 2025
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning
Oct 02, 2025
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Oct 02, 2025
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Oct 02, 2025
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Oct 02, 2025
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Oct 02, 2025
Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
Oct 02, 2025
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
Oct 02, 2025
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
Oct 01, 2025
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Oct 01, 2025
Multiplayer Nash Preference Optimization
Oct 01, 2025
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Oct 01, 2025
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
Oct 01, 2025
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
Oct 01, 2025
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Oct 01, 2025
Democratizing AI scientists using ToolUniverse
Oct 01, 2025
Visual Jigsaw Post-Training Improves MLLMs
Oct 01, 2025
When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance
Oct 01, 2025
LongLive: Real-time Interactive Long Video Generation
Sep 30, 2025
Quantile Advantage Estimation for Entropy-Safe Reasoning
Sep 30, 2025
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
Sep 30, 2025
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Sep 30, 2025
ReviewScore: Misinformed Peer Review Detection with Large Language Models
Sep 30, 2025
Variational Reasoning for Language Models
Sep 30, 2025
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Sep 30, 2025
MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
Sep 30, 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Sep 30, 2025
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
Sep 30, 2025
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
Sep 27, 2025
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Sep 27, 2025
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Sep 27, 2025
Tree Search for LLM Agent Reinforcement Learning
Sep 27, 2025
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Sep 27, 2025
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
Sep 27, 2025
AutoIntent: AutoML for Text Classification
Sep 27, 2025
Video models are zero-shot learners and reasoners
Sep 26, 2025
SIM-CoT: Supervised Implicit Chain-of-Thought
Sep 26, 2025
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
Sep 25, 2025
Reinforcement Learning on Pre-Training Data
Sep 25, 2025
Do You Need Proprioceptive States in Visuomotor Policies?
Sep 25, 2025
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Sep 25, 2025
LIMI: Less is More for Agency
Sep 24, 2025
Qwen3-Omni Technical Report
Sep 24, 2025
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
Sep 24, 2025
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
Sep 24, 2025
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Sep 24, 2025
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
Sep 23, 2025
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Sep 23, 2025
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
Sep 23, 2025
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Sep 20, 2025
FlowRL: Matching Reward Distributions for LLM Reasoning
Sep 20, 2025
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Sep 20, 2025
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Sep 20, 2025
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Sep 20, 2025
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Sep 20, 2025
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Sep 19, 2025
SAIL-VL2 Technical Report
Sep 19, 2025
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
Sep 19, 2025
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Sep 18, 2025
Scaling Agents via Continual Pre-training
Sep 18, 2025
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Sep 18, 2025
Towards General Agentic Intelligence via Environment Scaling
Sep 18, 2025
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
Sep 18, 2025
ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Sep 18, 2025
Single-stream Policy Optimization
Sep 18, 2025
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Sep 17, 2025
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Sep 17, 2025
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
Sep 17, 2025
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
Sep 16, 2025
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Sep 16, 2025
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Sep 13, 2025
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Sep 13, 2025
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Sep 13, 2025
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Sep 13, 2025
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Sep 13, 2025
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Sep 13, 2025
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Sep 13, 2025
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Sep 13, 2025
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Sep 13, 2025
A Survey of Reinforcement Learning for Large Reasoning Models
Sep 12, 2025
RewardDance: Reward Scaling in Visual Generation
Sep 12, 2025
3D and 4D World Modeling: A Survey
Sep 12, 2025
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Sep 12, 2025
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Sep 11, 2025
Visual Representation Alignment for Multimodal Large Language Models
Sep 11, 2025
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Sep 11, 2025
Reconstruction Alignment Improves Unified Multimodal Models
Sep 11, 2025
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Sep 11, 2025
Reverse-Engineered Reasoning for Open-Ended Generation
Sep 10, 2025
Does DINOv3 Set a New Medical Vision Standard?
Sep 10, 2025
Symbolic Graphics Programming with Large Language Models
Sep 09, 2025
Set Block Decoding is a Language Model Inference Accelerator
Sep 09, 2025
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Sep 06, 2025
From Editor to Dense Geometry Estimator
Sep 06, 2025
Towards a Unified View of Large Language Model Post-Training
Sep 06, 2025
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks
Sep 06, 2025
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Sep 06, 2025
Open Data Synthesis For Deep Research
Sep 05, 2025
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Sep 05, 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Sep 04, 2025
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Sep 04, 2025
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Sep 04, 2025
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Sep 04, 2025
Baichuan-M2: Scaling Medical Capability with Large Verifier System
Sep 04, 2025
Kwai Keye-VL 1.5 Technical Report
Sep 04, 2025
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Sep 04, 2025
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
Sep 03, 2025
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Sep 02, 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Sep 02, 2025
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Aug 28, 2025
VibeVoice Technical Report
Aug 28, 2025
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Aug 28, 2025
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Aug 28, 2025
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
Aug 28, 2025
Spacer: Towards Engineered Scientific Inspiration
Aug 28, 2025
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
Aug 28, 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Aug 27, 2025
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Aug 27, 2025
MV-RAG: Retrieval Augmented Multiview Diffusion
Aug 27, 2025
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
Aug 26, 2025
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Aug 26, 2025
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
Aug 26, 2025
Intern-S1: A Scientific Multimodal Foundation Model
Aug 23, 2025
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Aug 23, 2025
Deep Think with Confidence
Aug 23, 2025
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Aug 23, 2025
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
Aug 22, 2025
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
Aug 22, 2025
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
Aug 22, 2025
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Aug 22, 2025
Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
Aug 22, 2025
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
Aug 21, 2025
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Aug 21, 2025
Prompt Orchestration Markup Language
Aug 21, 2025
Ovis2.5 Technical Report
Aug 20, 2025
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Aug 20, 2025
4DNeX: Feed-Forward 4D Generative Modeling Made Easy
Aug 20, 2025
Next Visual Granularity Generation
Aug 20, 2025
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Aug 20, 2025
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
Aug 20, 2025
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
Aug 20, 2025
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
Aug 20, 2025
SSRL: Self-Search Reinforcement Learning
Aug 19, 2025
DINOv3
Aug 19, 2025
Thyme: Think Beyond Images
Aug 19, 2025
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Aug 19, 2025
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Aug 19, 2025
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning
Aug 16, 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
Aug 16, 2025
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
Aug 16, 2025
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Aug 16, 2025
Story2Board: A Training-Free Approach for Expressive Storyboard Generation
Aug 15, 2025
Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery
Aug 15, 2025
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Aug 15, 2025
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Aug 15, 2025
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Aug 15, 2025
AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving
Aug 15, 2025
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Aug 14, 2025
Complex Logical Instruction Generation
Aug 14, 2025
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
Aug 14, 2025
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Aug 14, 2025
ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
Aug 13, 2025
WideSearch: Benchmarking Agentic Broad Info-Seeking
Aug 13, 2025
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Aug 13, 2025
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
Aug 13, 2025
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
Aug 13, 2025
SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
Aug 13, 2025
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Aug 13, 2025
MolmoAct: Action Reasoning Models that can Reason in Space
Aug 13, 2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aug 12, 2025
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Aug 12, 2025
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Aug 08, 2025
VeriGUI: Verifiable Long-Chain GUI Dataset
Aug 08, 2025
Efficient Agents: Building Effective Agents While Reducing Cost
Aug 08, 2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Aug 08, 2025
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Aug 08, 2025
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
Aug 08, 2025
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Aug 08, 2025
Qwen-Image Technical Report
Aug 06, 2025
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
Aug 06, 2025
CellForge: Agentic Design of Virtual Cell Models
Aug 06, 2025
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
Aug 06, 2025
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
Aug 06, 2025
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Aug 05, 2025
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Aug 05, 2025
PixNerd: Pixel Neural Field Diffusion
Aug 05, 2025
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
Aug 02, 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Aug 02, 2025
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Aug 01, 2025
BANG: Dividing 3D Assets via Generative Exploded Dynamics
Aug 01, 2025
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
Aug 01, 2025
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
Jul 31, 2025
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
Jul 31, 2025
ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge
Jul 31, 2025
Agentic Reinforced Policy Optimization
Jul 30, 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Jul 30, 2025
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
Jul 30, 2025
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
Jul 30, 2025
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
Jul 30, 2025
Reconstructing 4D Spatial Intelligence: A Survey
Jul 30, 2025
Deep Researcher with Test-Time Diffusion
Jul 29, 2025
$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention
Jul 26, 2025
Group Sequence Policy Optimization
Jul 26, 2025
MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Jul 26, 2025
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Jul 26, 2025
Pixels, Patterns, but No Poetry: To See The World like Humans
Jul 25, 2025
Yume: An Interactive World Generation Model
Jul 25, 2025
DesignLab: Designing Slides Through Iterative Detection and Correction
Jul 25, 2025
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
Jul 25, 2025
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
Jul 24, 2025
Step-Audio 2 Technical Report
Jul 24, 2025
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
Jul 24, 2025
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
Jul 24, 2025
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Jul 24, 2025
GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding
Jul 23, 2025
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
Jul 23, 2025
The Invisible Leash: Why RLVR May Not Escape Its Origin
Jul 23, 2025
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
Jul 23, 2025
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
Jul 23, 2025
GR-3 Technical Report
Jul 23, 2025
Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
Jul 23, 2025
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Jul 23, 2025
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
Jul 23, 2025
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
Jul 22, 2025
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
Jul 22, 2025
A Survey of Context Engineering for Large Language Models
Jul 19, 2025
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Jul 19, 2025
$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning
Jul 19, 2025
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Jul 19, 2025
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Jul 19, 2025
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
Jul 19, 2025
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
Jul 19, 2025
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Jul 18, 2025
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Jul 17, 2025
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
Jul 17, 2025
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
Jul 16, 2025
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Jul 16, 2025
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Jul 16, 2025
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Jul 16, 2025
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Jul 16, 2025
Test-Time Scaling with Reflective Generative Model
Jul 15, 2025
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Jul 15, 2025
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
Jul 15, 2025
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
Jul 15, 2025
KV Cache Steering for Inducing Reasoning in Small Language Models
Jul 15, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Jul 15, 2025
Neural-Driven Image Editing
Jul 15, 2025
Scaling RL to Long Videos
Jul 12, 2025
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Jul 12, 2025
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Jul 12, 2025
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Jul 12, 2025
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jul 12, 2025
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Jul 12, 2025
PyVision: Agentic Vision with Dynamic Tooling
Jul 12, 2025
4KAgent: Agentic Any Image to 4K Super-Resolution
Jul 11, 2025
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Jul 11, 2025
Perception-Aware Policy Optimization for Multimodal Reasoning
Jul 11, 2025
MIRIX: Multi-Agent Memory System for LLM-Based Agents
Jul 11, 2025
Rethinking Verification for LLM Code Generation: From Generation to Testing
Jul 11, 2025
SingLoRA: Low Rank Adaptation Using a Single Matrix
Jul 10, 2025
A Survey on Latent Reasoning
Jul 10, 2025
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
Jul 10, 2025
How to Train Your LLM Web Agent: A Statistical Diagnosis
Jul 10, 2025
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Jul 10, 2025
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Jul 10, 2025
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
Jul 10, 2025
MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
Jul 10, 2025
MemOS: A Memory OS for AI System
Jul 09, 2025
Should We Still Pretrain Encoders with Masked Language Modeling?
Jul 09, 2025
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Jul 09, 2025
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
Jul 09, 2025
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Jul 09, 2025
Pre-Trained Policy Discriminators are General Reward Models
Jul 09, 2025
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Jul 09, 2025
WebSailor: Navigating Super-human Reasoning for Web Agent
Jul 05, 2025
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Jul 05, 2025
Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
Jul 05, 2025
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
Jul 05, 2025
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Jul 05, 2025
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Jul 05, 2025
Kwai Keye-VL Technical Report
Jul 04, 2025
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Jul 04, 2025
Depth Anything at Any Condition
Jul 04, 2025
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Jul 04, 2025
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Jul 03, 2025
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Jul 03, 2025
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
Jul 03, 2025
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
Jul 03, 2025
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Jul 03, 2025
Ovis-U1 Technical Report
Jul 02, 2025
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Jul 02, 2025
VMoBA: Mixture-of-Block Attention for Video Diffusion Models
Jul 02, 2025
Calligrapher: Freestyle Text Image Customization
Jul 02, 2025
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Jul 01, 2025
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Jul 01, 2025
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Jul 01, 2025
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
Jun 17, 2025
Effective Red-Teaming of Policy-Adherent Agents
Jun 17, 2025
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
Jun 17, 2025
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Jun 14, 2025
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Jun 14, 2025
Text-Aware Image Restoration with Diffusion Models
Jun 14, 2025
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Jun 14, 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jun 14, 2025
Discrete Audio Tokens: More Than a Survey!
Jun 14, 2025
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Jun 13, 2025
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Jun 13, 2025
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Jun 13, 2025
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Jun 13, 2025
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Jun 13, 2025
PlayerOne: Egocentric World Simulator
Jun 13, 2025
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Jun 13, 2025
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models
Jun 12, 2025
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Jun 12, 2025
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
Jun 12, 2025
Reinforcement Pre-Training
Jun 11, 2025
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Jun 11, 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
Jun 11, 2025
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Jun 11, 2025
Image Reconstruction as a Tool for Feature Analysis
Jun 11, 2025
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Jun 11, 2025
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
Jun 10, 2025
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
Jun 10, 2025
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
Jun 10, 2025
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs
Jun 10, 2025
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
Jun 07, 2025
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Jun 07, 2025
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
Jun 07, 2025
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Jun 07, 2025
Video World Models with Long-term Spatial Memory
Jun 07, 2025
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights
Jun 07, 2025
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Jun 07, 2025
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Jun 07, 2025
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Jun 07, 2025
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
Jun 07, 2025
MiMo-VL Technical Report
Jun 06, 2025
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Jun 06, 2025
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
Jun 06, 2025
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Jun 06, 2025
A Controllable Examination for Long-Context Language Models
Jun 06, 2025
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
Jun 06, 2025
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Jun 06, 2025
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Jun 06, 2025
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Jun 05, 2025
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments
Jun 05, 2025
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Jun 05, 2025
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Jun 05, 2025
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
Jun 05, 2025
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Jun 05, 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Jun 05, 2025
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation
Jun 05, 2025
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Jun 04, 2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Jun 04, 2025
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Jun 04, 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Jun 04, 2025
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
Jun 04, 2025
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models
Jun 04, 2025
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
Jun 04, 2025
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Jun 04, 2025
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding
Jun 04, 2025
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Jun 04, 2025
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Jun 03, 2025
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Jun 03, 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Jun 03, 2025
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Jun 03, 2025
Large Language Models for Data Synthesis
Jun 03, 2025
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Jun 03, 2025
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Jun 03, 2025
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Jun 03, 2025
Table-R1: Inference-Time Scaling for Table Reasoning
May 31, 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
May 31, 2025
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
May 31, 2025
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
May 31, 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
May 31, 2025
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
May 31, 2025
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
May 31, 2025
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
May 30, 2025
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
May 30, 2025
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
May 30, 2025
Skywork Open Reasoner 1 Technical Report
May 30, 2025
Sherlock: Self-Correcting Reasoning in Vision-Language Models
May 30, 2025
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
May 30, 2025
SageAttention2++: A More Efficient Implementation of SageAttention2
May 30, 2025
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
May 30, 2025
Fostering Video Reasoning via Next-Event Prediction
May 30, 2025
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
May 30, 2025
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
May 29, 2025
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
May 29, 2025
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
May 29, 2025
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
May 29, 2025
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
May 29, 2025
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
May 29, 2025
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
May 29, 2025
Exploring the Latent Capacity of LLMs for One-Step Text Generation
May 29, 2025
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
May 29, 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
May 29, 2025
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
May 28, 2025
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
May 28, 2025
Alchemist: Turning Public Text-to-Image Data into Generative Gold
May 28, 2025
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
May 28, 2025
PATS: Process-Level Adaptive Thinking Mode Switching
May 28, 2025
Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance
May 28, 2025
ARM: Adaptive Reasoning Model
May 28, 2025
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
May 28, 2025
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
May 28, 2025
B-score: Detecting biases in large language models using response history
May 28, 2025
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
May 27, 2025
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
May 27, 2025
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
May 27, 2025
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
May 27, 2025
One RL to See Them All: Visual Triple Unified Reinforcement Learning
May 27, 2025
Distilling LLM Agent into Small Models with Retrieval and Code Tools
May 27, 2025
QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
May 27, 2025
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
May 27, 2025
Scaling Image and Video Generation via Test-Time Evolutionary Search
May 27, 2025
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
May 27, 2025
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
May 24, 2025
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
May 24, 2025
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
May 24, 2025
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
May 24, 2025
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
May 24, 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
May 24, 2025
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
May 24, 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
May 24, 2025
Scaling Diffusion Transformers Efficiently via $μ$P
May 24, 2025
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
May 23, 2025
MMaDA: Multimodal Large Diffusion Language Models
May 23, 2025
Scaling Law for Quantization-Aware Training
May 23, 2025
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
May 23, 2025
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
May 23, 2025
Efficient Agent Training for Computer Use
May 23, 2025
This Time is Different: An Observability Perspective on Time Series Foundation Models
May 23, 2025
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
May 23, 2025
Emerging Properties in Unified Multimodal Pretraining
May 22, 2025
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
May 22, 2025
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
May 22, 2025
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
May 22, 2025
Visual Agentic Reinforcement Fine-Tuning
May 22, 2025
Neurosymbolic Diffusion Models
May 22, 2025
Chain-of-Model Learning for Language Model
May 21, 2025
AdaptThink: Reasoning Models Can Learn When to Think
May 21, 2025
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
May 21, 2025
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
May 21, 2025
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
May 21, 2025
Faster Video Diffusion with Trainable Sparse Attention
May 21, 2025
Thinkless: LLM Learns When to Think
May 21, 2025
Model Merging in Pre-training of Large Language Models
May 21, 2025
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
May 21, 2025
Qwen3 Technical Report
May 20, 2025
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
May 20, 2025
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
May 20, 2025
Visual Planning: Let's Think Only with Images
May 20, 2025
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
May 17, 2025
System Prompt Optimization with Meta-Learning
May 17, 2025
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
May 16, 2025
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
May 16, 2025
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
May 16, 2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
May 15, 2025
Seed1.5-VL Technical Report
May 14, 2025
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
May 14, 2025
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
May 14, 2025
Learning from Peers in Reasoning Models
May 14, 2025
Unified Continuous Generative Models
May 14, 2025
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
May 14, 2025
Bielik v3 Small: Technical Report
May 13, 2025
Bielik 11B v2 Technical Report
May 13, 2025
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
May 10, 2025
On Path to Multimodal Generalist: General-Level and General-Bench
May 10, 2025
Flow-GRPO: Training Flow Matching Models via Online RL
May 10, 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
May 09, 2025
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
May 09, 2025
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
May 08, 2025
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
May 08, 2025
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
May 08, 2025
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
May 08, 2025
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
May 07, 2025
RM-R1: Reward Modeling as Reasoning
May 07, 2025
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
May 07, 2025
Practical Efficiency of Muon for Pretraining
May 07, 2025
PixelHacker: Image Inpainting with Structural and Semantic Consistency
May 06, 2025
A Survey of Interactive Generative Video
May 03, 2025
DeepCritic: Deliberate Critique with Large Language Models
May 03, 2025
Sadeed: Advancing Arabic Diacritization Through Small Language Model
May 02, 2025
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
May 02, 2025
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
May 02, 2025
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
May 02, 2025
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
May 01, 2025
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
May 01, 2025
ReasonIR: Training Retrievers for Reasoning Tasks
May 01, 2025
The Leaderboard Illusion
May 01, 2025
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
May 01, 2025
RepText: Rendering Visual Text via Replicating
Apr 30, 2025
Towards Understanding Camera Motions in Any Video
Apr 29, 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Apr 29, 2025
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Apr 29, 2025
Step1X-Edit: A Practical Framework for General Image Editing
Apr 26, 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Apr 26, 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Apr 26, 2025
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Apr 26, 2025
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Apr 25, 2025
Trillion 7B Technical Report
Apr 25, 2025
Tina: Tiny Reasoning Models via LoRA
Apr 25, 2025
I-Con: A Unifying Framework for Representation Learning
Apr 25, 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Apr 24, 2025
TTRL: Test-Time Reinforcement Learning
Apr 24, 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Apr 24, 2025
Describe Anything: Detailed Localized Image and Video Captioning
Apr 24, 2025
Learning Adaptive Parallel Reasoning with Language Models
Apr 24, 2025
Learning to Reason under Off-Policy Guidance
Apr 23, 2025
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Apr 23, 2025
FlowReasoner: Reinforcing Query-Level Meta-Agents
Apr 23, 2025
ToolRL: Reward is All Tool Learning Needs
Apr 23, 2025
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Apr 23, 2025
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
Apr 23, 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Apr 22, 2025
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Apr 22, 2025
NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
Apr 22, 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Apr 19, 2025
Antidistillation Sampling
Apr 19, 2025
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Apr 19, 2025
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Apr 19, 2025
WORLDMEM: Long-term Consistent World Simulation with Memory
Apr 19, 2025
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Apr 19, 2025
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Apr 18, 2025
BitNet b1.58 2B4T Technical Report
Apr 18, 2025
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Apr 18, 2025
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Apr 17, 2025
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Apr 17, 2025
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
Apr 17, 2025
Heimdall: test-time scaling on the generative verification
Apr 17, 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Apr 17, 2025
TextArena
Apr 17, 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Apr 16, 2025
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
Apr 16, 2025
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Apr 16, 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Apr 16, 2025
Iterative Self-Training for Code Generation via Reinforced Re-Ranking
Apr 16, 2025
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Apr 15, 2025
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Apr 15, 2025
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Apr 15, 2025
Kimi-VL Technical Report
Apr 12, 2025
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
Apr 12, 2025
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
Apr 12, 2025
DeepSeek-R1 Thoughtology: Let's about LLM Reasoning
Apr 12, 2025
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Apr 12, 2025
MM-IFEngine: Towards Multimodal Instruction Following
Apr 12, 2025
HoloPart: Generative 3D Part Amodal Segmentation
Apr 12, 2025
DDT: Decoupled Diffusion Transformer
Apr 11, 2025
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Apr 11, 2025
A Unified Agentic Framework for Evaluating Conditional Image Generation
Apr 11, 2025
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Apr 11, 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Apr 10, 2025
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Apr 10, 2025
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Apr 10, 2025
An Empirical Study of GPT-4o Image Generation Capabilities
Apr 10, 2025
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Apr 10, 2025
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
Apr 10, 2025
SmolVLM: Redefining small and efficient multimodal models
Apr 09, 2025
One-Minute Video Generation with Test-Time Training
Apr 09, 2025
Rethinking Reflection in Pre-Training
Apr 09, 2025
URECA: Unique Region Caption Anything
Apr 09, 2025
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
Apr 09, 2025
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Apr 08, 2025
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Apr 05, 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Apr 05, 2025
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Apr 05, 2025
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Apr 05, 2025
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Apr 05, 2025
WikiVideo: Article Generation from Multiple Videos
Apr 05, 2025
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Apr 04, 2025
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Apr 04, 2025
Understanding R1-Zero-Like Training: A Critical Perspective
Apr 04, 2025
Towards Physically Plausible Video Generation via VLM Planning
Apr 04, 2025
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Apr 04, 2025
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Apr 04, 2025
START: Self-taught Reasoner with Tools
Mar 08, 2025
Token-Efficient Long Video Understanding for Multimodal LLMs
Mar 08, 2025
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Mar 08, 2025
EgoLife: Towards Egocentric Life Assistant
Mar 08, 2025
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
Mar 07, 2025
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Mar 07, 2025
Process-based Self-Rewarding Language Models
Mar 07, 2025
Visual-RFT: Visual Reinforcement Fine-Tuning
Mar 05, 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Mar 05, 2025
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Mar 05, 2025
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
Mar 04, 2025
Chain of Draft: Thinking Faster by Writing Less
Mar 04, 2025
Multi-Turn Code Generation Through Single-Step Rewards
Mar 04, 2025
Self-rewarding correction for mathematical reasoning
Mar 01, 2025
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Mar 01, 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Mar 01, 2025
LongRoPE2: Near-Lossless LLM Context Window Scaling
Mar 01, 2025
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
Mar 01, 2025
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
Mar 01, 2025
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Mar 01, 2025
NeoBERT: A Next-Generation BERT
Mar 01, 2025
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Mar 01, 2025
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Mar 01, 2025
GHOST 2.0: generative high-fidelity one shot transfer of heads
Feb 28, 2025
Kanana: Compute-efficient Bilingual Language Models
Feb 28, 2025
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
Feb 28, 2025
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
Feb 28, 2025
Language Models' Factuality Depends on the Language of Inquiry
Feb 28, 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Feb 28, 2025
Towards an AI co-scientist
Feb 28, 2025
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Feb 28, 2025
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Feb 28, 2025
Rank1: Test-Time Compute for Reranking in Information Retrieval
Feb 28, 2025
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Feb 22, 2025
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Feb 22, 2025
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Feb 22, 2025
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Feb 22, 2025
S*: Test Time Scaling for Code Generation
Feb 22, 2025
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Feb 22, 2025
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning
Feb 22, 2025
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Feb 22, 2025
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
Feb 22, 2025
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Feb 22, 2025
Qwen2.5-VL Technical Report
Feb 21, 2025
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Feb 21, 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Feb 21, 2025
MoM: Linear Sequence Modeling with Mixture-of-Memories
Feb 21, 2025
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
Feb 21, 2025
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Feb 21, 2025
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
Feb 21, 2025
Small Models Struggle to Learn from Strong Reasoners
Feb 21, 2025
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Feb 21, 2025
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
Feb 21, 2025
Soundwave: Less is More for Speech-Text Alignment in LLMs
Feb 20, 2025
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Feb 20, 2025
Continuous Diffusion Model for Language Modeling
Feb 20, 2025
Phantom: Subject-consistent video generation via cross-modal alignment
Feb 20, 2025
Rethinking Diverse Human Preference Learning through Principal Component Analysis
Feb 20, 2025
Magma: A Foundation Model for Multimodal AI Agents
Feb 20, 2025
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
Feb 20, 2025
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Feb 20, 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Feb 20, 2025
You Do Not Fully Utilize Transformer's Representation Capacity
Feb 20, 2025
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Feb 19, 2025
Learning Getting-Up Policies for Real-World Humanoid Robots
Feb 19, 2025
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Feb 19, 2025
CRANE: Reasoning with constrained LLM generation
Feb 19, 2025
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Feb 19, 2025
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Feb 19, 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Feb 19, 2025
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Feb 19, 2025
Region-Adaptive Sampling for Diffusion Transformers
Feb 18, 2025
Large Language Diffusion Models
Feb 18, 2025
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
Feb 18, 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Feb 18, 2025
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
Feb 18, 2025
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Feb 18, 2025
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Feb 18, 2025
Diverse Inference and Verification for Advanced Reasoning
Feb 18, 2025
Precise Parameter Localization for Textual Generation in Diffusion Models
Feb 18, 2025
DarwinLM: Evolutionary Structured Pruning of Large Language Models
Feb 18, 2025
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Feb 15, 2025
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
Feb 15, 2025
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
Feb 15, 2025
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Feb 15, 2025
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights
Feb 15, 2025
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
Feb 15, 2025
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Feb 15, 2025
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Feb 15, 2025
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Feb 15, 2025
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Feb 15, 2025
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
Feb 14, 2025
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Feb 14, 2025
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Feb 14, 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Feb 14, 2025
Distillation Scaling Laws
Feb 14, 2025
TransMLA: Multi-Head Latent Attention Is All You Need
Feb 14, 2025
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Feb 14, 2025
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
Feb 14, 2025
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
Feb 14, 2025
Expect the Unexpected: FailSafe Long Context QA for Finance
Feb 13, 2025
Competitive Programming with Large Reasoning Models
Feb 13, 2025
Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models
Feb 13, 2025
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Feb 13, 2025
Magic 1-For-1: Generating One Minute Video Clips within One Minute
Feb 13, 2025
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Feb 13, 2025
Teaching Language Models to Critique via Reinforcement Learning
Feb 13, 2025
Scaling Pre-training to One Hundred Billion Data for Vision Language Models
Feb 13, 2025
Enhance-A-Video: Better Generated Video for Free
Feb 13, 2025
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Feb 12, 2025
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Feb 12, 2025
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Feb 12, 2025
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
Feb 12, 2025
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Feb 12, 2025
LM2: Large Memory Models
Feb 12, 2025
Matryoshka Quantization
Feb 12, 2025
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
Feb 12, 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Feb 12, 2025
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Feb 12, 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Feb 11, 2025
Fast Video Generation with Sliding Tile Attention
Feb 11, 2025
Goku: Flow Based Video Generative Foundation Models
Feb 11, 2025
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Feb 11, 2025
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Feb 11, 2025
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
Feb 11, 2025
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Feb 11, 2025
Agency Is Frame-Dependent
Feb 11, 2025
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Feb 11, 2025
Generating Symbolic World Models via Test-time Scaling of Large Language Models
Feb 11, 2025
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Feb 08, 2025
UltraIF: Advancing Instruction Following from the Wild
Feb 08, 2025
Great Models Think Alike and this Undermines AI Oversight
Feb 08, 2025
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
Feb 08, 2025
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
Feb 08, 2025
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
Feb 08, 2025
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion
Feb 08, 2025
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
Feb 08, 2025
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
Feb 08, 2025
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Feb 07, 2025
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
Feb 07, 2025
Demystifying Long Chain-of-Thought Reasoning in LLMs
Feb 07, 2025
LIMO: Less is More for Reasoning
Feb 07, 2025
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Feb 07, 2025
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
Feb 07, 2025
On Teacher Hacking in Language Model Distillation
Feb 07, 2025
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
Feb 07, 2025
Jailbreaking with Universal Multi-Prompts
Feb 07, 2025
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Feb 06, 2025
Inverse Bridge Matching Distillation
Feb 06, 2025
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Feb 06, 2025
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Feb 06, 2025
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Feb 06, 2025
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
Feb 06, 2025
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
Feb 06, 2025
The Differences Between Direct Alignment Algorithms are a Blur
Feb 05, 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Feb 05, 2025
Process Reinforcement through Implicit Rewards
Feb 05, 2025
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Feb 05, 2025
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
Feb 05, 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Feb 05, 2025
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Feb 05, 2025
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
Feb 05, 2025
AIN: The Arabic INclusive Large Multimodal Model
Feb 05, 2025
s1: Simple test-time scaling
Feb 04, 2025
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Feb 04, 2025
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
Feb 04, 2025
PixelWorld: Towards Perceiving Everything as Pixels
Feb 04, 2025
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Feb 04, 2025
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Feb 04, 2025
Scalable-Softmax Is Superior for Attention
Feb 04, 2025
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
Feb 04, 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Feb 04, 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Feb 01, 2025
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Feb 01, 2025
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Feb 01, 2025
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Feb 01, 2025
Large Language Models Think Too Fast To Explore Effectively
Feb 01, 2025
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Feb 01, 2025
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
Feb 01, 2025
o3-mini vs DeepSeek-R1: Which One is Safer?
Feb 01, 2025
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Feb 01, 2025
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Jan 31, 2025
Atla Selene Mini: A General Purpose Evaluation Model
Jan 31, 2025
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts
Jan 31, 2025
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
Jan 31, 2025
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Jan 31, 2025
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Jan 31, 2025
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
Jan 31, 2025
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Jan 30, 2025
Optimizing Large Language Model Training Using FP4 Quantization
Jan 30, 2025
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Jan 30, 2025
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Jan 30, 2025
Open Problems in Mechanistic Interpretability
Jan 30, 2025
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
Jan 30, 2025
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Jan 30, 2025
Histoires Morales: A French Dataset for Assessing Moral Alignment
Jan 30, 2025
Qwen2.5-1M Technical Report
Jan 29, 2025
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Jan 29, 2025
Towards General-Purpose Model-Free Reinforcement Learning
Jan 29, 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Jan 29, 2025
iFormer: Integrating ConvNet and Transformer for Mobile Application
Jan 29, 2025
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
Jan 29, 2025
CodeMonkeys: Scaling Test-Time Compute for Software Engineering
Jan 29, 2025
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Jan 29, 2025
Humanity's Last Exam
Jan 28, 2025
Chain-of-Retrieval Augmented Generation
Jan 28, 2025
Redundancy Principles for MLLMs Benchmarks
Jan 28, 2025
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
Jan 28, 2025
RL + Transformer = A General-Purpose Problem Solver
Jan 28, 2025
Relightable Full-Body Gaussian Codec Avatars
Jan 28, 2025
Question Answering on Patient Medical Records with Private Fine-Tuned LLMs
Jan 28, 2025
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
Jan 28, 2025
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
Jan 28, 2025
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
Jan 28, 2025
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Jan 25, 2025
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
Jan 25, 2025
Improving Video Generation with Human Feedback
Jan 25, 2025
Temporal Preference Optimization for Long-Form Video Understanding
Jan 25, 2025
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Jan 25, 2025
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Jan 25, 2025
DiffuEraser: A Diffusion Model for Video Inpainting
Jan 25, 2025
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
Jan 25, 2025
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Jan 25, 2025
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
Jan 25, 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Jan 24, 2025
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Jan 24, 2025
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
Jan 24, 2025
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
Jan 24, 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Jan 24, 2025
Autonomy-of-Experts Models
Jan 24, 2025
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Jan 24, 2025
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
Jan 24, 2025
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Jan 24, 2025
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Jan 24, 2025
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Jan 23, 2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Jan 23, 2025
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Jan 23, 2025
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
Jan 23, 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Jan 23, 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Jan 23, 2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Jan 23, 2025
Reasoning Language Models: A Blueprint
Jan 23, 2025
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Jan 23, 2025
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Jan 23, 2025
GameFactory: Creating New Games with Generative Interactive Videos
Jan 22, 2025
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Jan 22, 2025
SEAL: Entangled White-box Watermarks on Low-Rank Adaptation
Jan 22, 2025
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Jan 15, 2025
Tensor Product Attention Is All You Need
Jan 15, 2025
$\text{Transformer}^2$: Self-adaptive LLMs
Jan 15, 2025
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Jan 15, 2025
VideoAuteur: Towards Long Narrative Video Generation
Jan 15, 2025
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
Jan 15, 2025
WebWalker: Benchmarking LLMs in Web Traversal
Jan 15, 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Jan 15, 2025
UnCommon Objects in 3D
Jan 15, 2025
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Jan 14, 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Jan 14, 2025
Enabling Scalable Oversight via Self-Evolving Critic
Jan 14, 2025
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Jan 14, 2025
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Jan 14, 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Jan 14, 2025
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Jan 14, 2025
The GAN is dead; long live the GAN! A Modern GAN Baseline
Jan 11, 2025
An Empirical Study of Autoregressive Pre-training from Videos
Jan 11, 2025
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Jan 11, 2025
Entropy-Guided Attention for Private LLMs
Jan 11, 2025
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
Jan 11, 2025
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Jan 11, 2025
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Jan 11, 2025
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
Jan 11, 2025
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Jan 10, 2025
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Jan 10, 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Jan 10, 2025
Agent Laboratory: Using LLM Agents as Research Assistants
Jan 10, 2025
LLM4SR: A Survey on Large Language Models for Scientific Research
Jan 10, 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Jan 10, 2025
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
Jan 10, 2025
GeAR: Generation Augmented Retrieval
Jan 10, 2025
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
Jan 10, 2025
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Jan 10, 2025
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Jan 09, 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Jan 09, 2025
Cosmos World Foundation Model Platform for Physical AI
Jan 09, 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Jan 09, 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Jan 09, 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Jan 09, 2025
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
Jan 09, 2025
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Jan 09, 2025
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Jan 09, 2025
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
Jan 09, 2025
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Jan 08, 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Jan 08, 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Jan 08, 2025
Personalized Graph-Based Retrieval for Large Language Models
Jan 08, 2025
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring
Jan 08, 2025
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
Jan 08, 2025
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
Jan 08, 2025
TransPixar: Advancing Text-to-Video Generation with Transparency
Jan 08, 2025
AutoPresent: Designing Structured Visuals from Scratch
Jan 08, 2025
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
Jan 07, 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Jan 07, 2025
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
Jan 07, 2025
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Jan 07, 2025
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Jan 07, 2025
Graph Generative Pre-trained Transformer
Jan 07, 2025
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
Jan 07, 2025
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
Jan 07, 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Jan 04, 2025
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Jan 04, 2025
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Jan 04, 2025
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Jan 04, 2025
ProgCo: Program Helps Self-Correction of Large Language Models
Jan 04, 2025
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Jan 04, 2025
A3: Android Agent Arena for Mobile GUI Agents
Jan 04, 2025
MLLM-as-a-Judge for Image Safety without Human Labeling
Jan 04, 2025
Dynamic Scaling of Unit Tests for Code Reward Modeling
Jan 04, 2025
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Jan 03, 2025
Xmodel-2 Technical Report
Jan 03, 2025
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Jan 03, 2025
HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
Jan 03, 2025
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Jan 03, 2025
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Jan 02, 2025
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
Jan 02, 2025
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
Jan 01, 2025
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
Jan 01, 2025
Bringing Objects to Life: 4D generation from 3D objects
Jan 01, 2025
Efficiently Serving LLM Reasoning Programs with Certaindex
Jan 01, 2025
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Jan 01, 2025
Edicho: Consistent Image Editing in the Wild
Jan 01, 2025
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Jan 01, 2025
Training Software Engineering Agents and Verifiers with SWE-Gym
Jan 01, 2025
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Jan 01, 2025
Slow Perception: Let's Perceive Geometric Figures Step-by-step
Jan 01, 2025
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Dec 31, 2024
1.58-bit FLUX
Dec 31, 2024
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Dec 31, 2024
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
Dec 31, 2024
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Dec 31, 2024
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Dec 31, 2024
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
Dec 31, 2024
The Superposition of Diffusion Models Using the Itô Density Estimator
Dec 31, 2024
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Dec 31, 2024
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
Dec 31, 2024
YuLan-Mini: An Open Data-efficient Language Model
Dec 28, 2024
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Dec 28, 2024
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Dec 28, 2024
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
Dec 28, 2024
DepthLab: From Partial to Complete
Dec 26, 2024
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
Dec 26, 2024
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Dec 26, 2024
In Case You Missed It: ARC 'Challenge' Is Not That Challenging
Dec 26, 2024
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Dec 26, 2024
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
Dec 26, 2024
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
Dec 26, 2024
MotiF: Making Text Count in Image Animation with Motion Focal Loss
Dec 26, 2024
Bridging the Data Provenance Gap Across Text, Speech and Video
Dec 26, 2024
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
Dec 25, 2024
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Dec 25, 2024
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
Dec 25, 2024
Diving into Self-Evolving Training for Multimodal Reasoning
Dec 25, 2024
Deliberation in Latent Space via Differentiable Cache Augmentation
Dec 25, 2024
Large Motion Video Autoencoding with Cross-modal Video VAE
Dec 25, 2024
OpenAI o1 System Card
Dec 25, 2024
Revisiting In-Context Learning with Long Context Language Models
Dec 25, 2024
Outcome-Refining Process Supervision for Code Generation
Dec 25, 2024
LearnLM: Improving Gemini for Learning
Dec 25, 2024
Parallelized Autoregressive Visual Generation
Dec 24, 2024
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Dec 24, 2024
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Dec 24, 2024
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
Dec 24, 2024
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Dec 24, 2024
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
Dec 24, 2024
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
Dec 24, 2024
TRecViT: A Recurrent Video Transformer
Dec 24, 2024
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Dec 24, 2024
Multi-LLM Text Summarization
Dec 24, 2024
Qwen2.5 Technical Report
Dec 21, 2024
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Dec 21, 2024
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Dec 21, 2024
How to Synthesize Text Data without Model Collapse?
Dec 21, 2024
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Dec 21, 2024
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
Dec 21, 2024
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
Dec 21, 2024
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
Dec 21, 2024
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
Dec 21, 2024
No More Adam: Learning Rate Scaling at Initialization is All You Need
Dec 20, 2024
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Dec 20, 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Dec 20, 2024
AniDoc: Animation Creation Made Easier
Dec 20, 2024
FashionComposer: Compositional Fashion Image Generation
Dec 20, 2024
GUI Agents: A Survey
Dec 20, 2024
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
Dec 20, 2024
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Dec 20, 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Dec 20, 2024
Are Your LLMs Capable of Stable Reasoning?
Dec 19, 2024
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
Dec 19, 2024
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
Dec 19, 2024
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Dec 19, 2024
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Dec 19, 2024
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Dec 19, 2024
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
Dec 19, 2024
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Dec 19, 2024
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
Dec 19, 2024
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
Dec 19, 2024
Byte Latent Transformer: Patches Scale Better Than Tokens
Dec 18, 2024
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
Dec 18, 2024
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
Dec 18, 2024
BrushEdit: All-In-One Image Inpainting and Editing
Dec 18, 2024
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Dec 18, 2024
Smaller Language Models Are Better Instruction Evolvers
Dec 18, 2024
Causal Diffusion Transformers for Generative Modeling
Dec 18, 2024
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
Dec 18, 2024
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
Dec 18, 2024
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Dec 18, 2024
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Dec 17, 2024
GenEx: Generating an Explorable World
Dec 17, 2024
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Dec 17, 2024
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Dec 17, 2024
Large Action Models: From Inception to Implementation
Dec 17, 2024
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Dec 17, 2024
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Dec 17, 2024
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
Dec 17, 2024
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Dec 17, 2024
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Dec 17, 2024
Phi-4 Technical Report
Dec 14, 2024
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Dec 14, 2024
Multimodal Latent Language Modeling with Next-Token Diffusion
Dec 14, 2024
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
Dec 14, 2024
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Dec 14, 2024
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Dec 14, 2024
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
Dec 14, 2024
JuStRank: Benchmarking LLM Judges for System Ranking
Dec 14, 2024
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Dec 13, 2024
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
Dec 13, 2024
POINTS1.5: Building a Vision-Language Model towards Real World Applications
Dec 13, 2024
Learning Flow Fields in Attention for Controllable Person Image Generation
Dec 13, 2024
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Dec 13, 2024
StreamChat: Chatting with Streaming Video
Dec 13, 2024
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Dec 13, 2024
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
Dec 13, 2024
The BrowserGym Ecosystem for Web Agent Research
Dec 13, 2024
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Dec 12, 2024
Hidden in the Noise: Two-Stage Robust Watermarking for Images
Dec 12, 2024
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
Dec 12, 2024
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Dec 12, 2024
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Dec 12, 2024
Mobile Video Diffusion
Dec 12, 2024
Granite Guardian
Dec 12, 2024
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Dec 11, 2024
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Dec 11, 2024
Training Large Language Models to Reason in a Continuous Latent Space
Dec 11, 2024
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Dec 11, 2024
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Dec 11, 2024
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Dec 11, 2024
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Dec 11, 2024
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Dec 11, 2024
Robust Multi-bit Text Watermark with LLM-based Paraphrasers
Dec 11, 2024
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
Dec 11, 2024
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Dec 10, 2024
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
Dec 10, 2024
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Dec 10, 2024
APOLLO: SGD-like Memory, AdamW-level Performance
Dec 10, 2024
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
Dec 10, 2024
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Dec 10, 2024
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Dec 10, 2024
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
Dec 10, 2024
CompCap: Improving Multimodal Large Language Models with Composite Captions
Dec 10, 2024
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Dec 08, 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Dec 08, 2024
NVILA: Efficient Frontier Visual Language Models
Dec 08, 2024
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Dec 08, 2024
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Dec 08, 2024
Evaluating Language Models as Synthetic Data Generators
Dec 08, 2024
A Noise is Worth Diffusion Guidance
Dec 08, 2024
Structured 3D Latents for Scalable and Versatile 3D Generation
Dec 08, 2024
Negative Token Merging: Image-based Adversarial Feature Guidance
Dec 08, 2024
MV-Adapter: Multi-view Consistent Image Generation Made Easy
Dec 08, 2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Nov 28, 2024
Star Attention: Efficient LLM Inference over Long Sequences
Nov 28, 2024
Pathways on the Image Manifold: Image Editing via Video Generation
Nov 28, 2024
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Nov 28, 2024
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Nov 28, 2024
SketchAgent: Language-Driven Sequential Sketch Generation
Nov 28, 2024
TEXGen: a Generative Diffusion Model for Mesh Textures
Nov 28, 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Nov 28, 2024
Learning 3D Representations from Procedural 3D Programs
Nov 28, 2024
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Nov 28, 2024
Material Anything: Generating Materials for Any 3D Object via Diffusion
Nov 27, 2024
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Nov 27, 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Nov 27, 2024
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
Nov 27, 2024
MH-MoE: Multi-Head Mixture-of-Experts
Nov 27, 2024
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
Nov 27, 2024
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Nov 27, 2024
Knowledge Transfer Across Modalities with Natural Language Supervision
Nov 27, 2024
One Diffusion to Generate Them All
Nov 27, 2024
VisualLens: Personalization through Visual History
Nov 27, 2024
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
Nov 26, 2024
Style-Friendly SNR Sampler for Style-Driven Generation
Nov 26, 2024
OminiControl: Minimal and Universal Control for Diffusion Transformer
Nov 26, 2024
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Nov 26, 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Nov 26, 2024
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Nov 26, 2024
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Nov 26, 2024
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
Nov 26, 2024
MyTimeMachine: Personalized Facial Age Transformation
Nov 26, 2024
Novel View Extrapolation with Video Diffusion Priors
Nov 26, 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Nov 23, 2024
Multimodal Autoregressive Pre-training of Large Vision Encoders
Nov 23, 2024
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Nov 23, 2024
Hymba: A Hybrid-head Architecture for Small Language Models
Nov 23, 2024
Natural Language Reinforcement Learning
Nov 23, 2024
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Nov 23, 2024
Ultra-Sparse Memory Network
Nov 23, 2024
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Nov 23, 2024
Stable Flow: Vital Layers for Training-Free Image Editing
Nov 23, 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Nov 23, 2024
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
Nov 22, 2024
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
Nov 22, 2024
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Nov 22, 2024
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Nov 22, 2024
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
Nov 22, 2024
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Nov 22, 2024
Stylecodes: Encoding Stylistic Information For Image Generation
Nov 22, 2024
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Nov 22, 2024
Loss-to-Loss Prediction: Scaling Laws for All Datasets
Nov 22, 2024
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
Nov 22, 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Nov 21, 2024
Continuous Speculative Decoding for Autoregressive Image Generation
Nov 21, 2024
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
Nov 21, 2024
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Nov 21, 2024
Soft Robotic Dynamic In-Hand Pen Spinning
Nov 21, 2024
Building Trust: Foundations of Security, Safety and Transparency in AI
Nov 21, 2024
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
Nov 21, 2024
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
Nov 21, 2024
Generative World Explorer
Nov 20, 2024
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Nov 20, 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Nov 20, 2024
AnimateAnything: Consistent and Controllable Animation for Video Generation
Nov 20, 2024
Top-$nσ$: Not All Logits Are You Need
Nov 20, 2024
Drowning in Documents: Consequences of Scaling Reranker Inference
Nov 20, 2024
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Nov 20, 2024
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
Nov 20, 2024
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
Nov 20, 2024
LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
Nov 20, 2024
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Nov 19, 2024
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
Nov 19, 2024
Xmodel-1.5: An 1B-scale Multilingual LLM
Nov 19, 2024
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
Nov 16, 2024
MagicQuill: An Intelligent Interactive Image Editing System
Nov 16, 2024
Cut Your Losses in Large-Vocabulary Language Models
Nov 16, 2024
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Nov 16, 2024
Sharingan: Extract User Action Sequence from Desktop Recordings
Nov 16, 2024
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks
Nov 16, 2024
Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples
Nov 16, 2024
Direct Preference Optimization Using Sparse Feature-Level Constraints
Nov 15, 2024
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
Nov 15, 2024
Can sparse autoencoders be used to decompose and interpret steering vectors?
Nov 15, 2024
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
Nov 15, 2024
SAMPart3D: Segment Any Part in 3D Objects
Nov 14, 2024
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Nov 14, 2024
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Nov 14, 2024
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Nov 14, 2024
Scaling Properties of Diffusion Models for Perceptual Tasks
Nov 14, 2024
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
Nov 14, 2024
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
Nov 13, 2024
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Nov 13, 2024
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Nov 13, 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Nov 13, 2024
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
Nov 13, 2024
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
Nov 13, 2024
Watermark Anything with Localized Messages
Nov 13, 2024
Autoregressive Models in Vision: A Survey
Nov 13, 2024
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Nov 12, 2024
Balancing Pipeline Parallelism with Vocabulary Parallelism
Nov 12, 2024
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Nov 12, 2024
DELIFT: Data Efficient Language model Instruction Fine Tuning
Nov 12, 2024
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
Nov 12, 2024
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
Nov 12, 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Nov 12, 2024
Improving the detection of technical debt in Java source code with an enriched dataset
Nov 12, 2024
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Nov 09, 2024
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Nov 09, 2024
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Nov 09, 2024
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
Nov 09, 2024
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Nov 09, 2024
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Nov 09, 2024
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model
Nov 09, 2024
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Nov 09, 2024
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation
Nov 09, 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Nov 09, 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
Nov 08, 2024
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Nov 08, 2024
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Nov 08, 2024
Self-Consistency Preference Optimization
Nov 08, 2024
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Nov 08, 2024
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Nov 07, 2024
LLaMo: Large Language Model-based Molecular Graph Assistant
Nov 07, 2024
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Nov 07, 2024
Controlling Language and Diffusion Models by Transporting Activations
Nov 07, 2024
Sample-Efficient Alignment for LLMs
Nov 07, 2024
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
Nov 07, 2024
Adaptive Length Image Tokenization via Recurrent Allocation
Nov 07, 2024
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
Nov 07, 2024
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge
Nov 07, 2024
Inference Optimal VLMs Need Only One Visual Token but Larger Models
Nov 07, 2024
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Nov 06, 2024
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
Nov 06, 2024
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Nov 06, 2024
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Nov 06, 2024
Training-free Regional Prompting for Diffusion Transformers
Nov 06, 2024
How Far is Video Generation from World Model: A Physical Law Perspective
Nov 06, 2024
Survey of Cultural Awareness in Language Models: Text and Beyond
Nov 06, 2024
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
Nov 06, 2024
GenXD: Generating Any 3D and 4D Scenes
Nov 06, 2024
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Nov 06, 2024
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Nov 05, 2024
Personalization of Large Language Models: A Survey
Nov 05, 2024
Constant Acceleration Flow
Nov 05, 2024
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Nov 05, 2024
Randomized Autoregressive Visual Generation
Nov 05, 2024
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Nov 05, 2024
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
Nov 05, 2024
In-Context LoRA for Diffusion Transformers
Nov 05, 2024
Physics in Next-token Prediction
Nov 05, 2024
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes
Nov 05, 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Nov 03, 2024
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
Nov 03, 2024
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
Nov 03, 2024
Language Models can Self-Lengthen to Generate Long Texts
Nov 03, 2024
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Nov 03, 2024
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
Nov 03, 2024
SelfCodeAlign: Self-Alignment for Code Generation
Nov 03, 2024
Learning Video Representations without Natural Videos
Nov 03, 2024
AAAR-1.0: Assessing AI's Potential to Assist Research
Nov 03, 2024
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays
Nov 03, 2024