Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.
| Episode | Date |
|---|---|
|
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
|
Jan 17, 2026 |
|
STEP3-VL-10B Technical Report
|
Jan 17, 2026 |
|
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
|
Jan 17, 2026 |
|
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
|
Jan 17, 2026 |
|
Controlled Self-Evolution for Algorithmic Code Optimization
|
Jan 16, 2026 |
|
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
|
Jan 16, 2026 |
|
MAXS: Meta-Adaptive Exploration with LLM Agents
|
Jan 16, 2026 |
|
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
|
Jan 16, 2026 |
|
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
|
Jan 16, 2026 |
|
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
|
Jan 16, 2026 |
|
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
|
Jan 16, 2026 |
|
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
|
Jan 16, 2026 |
|
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences
|
Jan 15, 2026 |
|
Solar Open Technical Report
|
Jan 15, 2026 |
|
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
|
Jan 15, 2026 |
|
User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
|
Jan 15, 2026 |
|
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
|
Jan 15, 2026 |
|
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
|
Jan 15, 2026 |
|
MemoBrain: Executive Memory as an Agentic Brain for Reasoning
|
Jan 15, 2026 |
|
Motion Attribution for Video Generation
|
Jan 15, 2026 |
|
3AM: Segment Anything with Geometric Consistency in Videos
|
Jan 15, 2026 |
|
BabyVision: Visual Reasoning Beyond Language
|
Jan 14, 2026 |
|
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
|
Jan 14, 2026 |
|
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
|
Jan 14, 2026 |
|
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
|
Jan 14, 2026 |
|
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
|
Jan 14, 2026 |
|
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
|
Jan 14, 2026 |
|
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
|
Jan 14, 2026 |
|
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
|
Jan 13, 2026 |
|
MMFormalizer: Multimodal Autoformalization in the Wild
|
Jan 13, 2026 |
|
CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature
|
Jan 13, 2026 |
|
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
|
Jan 13, 2026 |
|
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
|
Jan 13, 2026 |
|
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
|
Jan 13, 2026 |
|
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
|
Jan 13, 2026 |
|
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
|
Jan 10, 2026 |
|
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
|
Jan 10, 2026 |
|
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
|
Jan 10, 2026 |
|
Token-Level LLM Collaboration via FusionRoute
|
Jan 10, 2026 |
|
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
|
Jan 09, 2026 |
|
Evolving Programmatic Skill Networks
|
Jan 09, 2026 |
|
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
|
Jan 09, 2026 |
|
Benchmark^2: Systematic Evaluation of LLM Benchmarks
|
Jan 09, 2026 |
|
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
|
Jan 08, 2026 |
|
LTX-2: Efficient Joint Audio-Visual Foundation Model
|
Jan 08, 2026 |
|
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
|
Jan 08, 2026 |
|
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
|
Jan 08, 2026 |
|
NitroGen: An Open Foundation Model for Generalist Gaming Agents
|
Jan 08, 2026 |
|
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
|
Jan 07, 2026 |
|
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
|
Jan 07, 2026 |
|
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
|
Jan 07, 2026 |
|
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
|
Jan 07, 2026 |
|
GARDO: Reinforcing Diffusion Models without Reward Hacking
|
Jan 07, 2026 |
|
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
|
Jan 07, 2026 |
|
VINO: A Unified Visual Generator with Interleaved OmniModal Context
|
Jan 07, 2026 |
|
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
|
Jan 06, 2026 |
|
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
|
Jan 06, 2026 |
|
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
|
Jan 06, 2026 |
|
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
|
Jan 06, 2026 |
|
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
|
Jan 06, 2026 |
|
Deep Delta Learning
|
Jan 06, 2026 |
|
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
|
Jan 06, 2026 |
|
Nested Learning: The Illusion of Deep Learning Architectures
|
Jan 06, 2026 |
|
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
|
Jan 03, 2026 |
|
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
|
Jan 03, 2026 |
|
mHC: Manifold-Constrained Hyper-Connections
|
Jan 02, 2026 |
|
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
|
Jan 02, 2026 |
|
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
|
Jan 02, 2026 |
|
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
|
Jan 02, 2026 |
|
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
|
Dec 31, 2025 |
|
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
|
Dec 31, 2025 |
|
Yume-1.5: A Text-Controlled Interactive World Generation Model
|
Dec 31, 2025 |
|
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
|
Dec 31, 2025 |
|
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
|
Dec 31, 2025 |
|
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
|
Dec 31, 2025 |
|
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
|
Dec 31, 2025 |
|
SpotEdit: Selective Region Editing in Diffusion Transformers
|
Dec 31, 2025 |
|
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
|
Dec 31, 2025 |
|
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
|
Dec 30, 2025 |
|
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
|
Dec 30, 2025 |
|
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
|
Dec 30, 2025 |
|
Latent Implicit Visual Reasoning
|
Dec 27, 2025 |
|
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
|
Dec 27, 2025 |
|
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
|
Dec 26, 2025 |
|
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
|
Dec 26, 2025 |
|
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
|
Dec 26, 2025 |
|
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
|
Dec 26, 2025 |
|
SemanticGen: Video Generation in Semantic Space
|
Dec 25, 2025 |
|
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
|
Dec 25, 2025 |
|
LongVideoAgent: Multi-Agent Reasoning with Long Videos
|
Dec 25, 2025 |
|
SpatialTree: How Spatial Abilities Branch Out in MLLMs
|
Dec 25, 2025 |
|
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
|
Dec 24, 2025 |
|
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
|
Dec 24, 2025 |
|
Region-Constraint In-Context Generation for Instructional Video Editing
|
Dec 24, 2025 |
|
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
|
Dec 24, 2025 |
|
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
|
Dec 24, 2025 |
|
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
|
Dec 24, 2025 |
|
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
|
Dec 23, 2025 |
|
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
|
Dec 23, 2025 |
|
When Reasoning Meets Its Laws
|
Dec 23, 2025 |
|
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
|
Dec 23, 2025 |
|
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
|
Dec 23, 2025 |
|
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
|
Dec 23, 2025 |
|
Are We on the Right Way to Assessing LLM-as-a-Judge?
|
Dec 23, 2025 |
|
Kling-Omni Technical Report
|
Dec 20, 2025 |
|
Adaptation of Agentic AI
|
Dec 20, 2025 |
|
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
|
Dec 20, 2025 |
|
Next-Embedding Prediction Makes Strong Vision Learners
|
Dec 20, 2025 |
|
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
|
Dec 20, 2025 |
|
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
|
Dec 20, 2025 |
|
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
|
Dec 20, 2025 |
|
Generative Refocusing: Flexible Defocus Control from a Single Image
|
Dec 20, 2025 |
|
DeContext as Defense: Safe Image Editing in Diffusion Transformers
|
Dec 20, 2025 |
|
Step-GUI Technical Report
|
Dec 19, 2025 |
|
DEER: Draft with Diffusion, Verify with Autoregressive Models
|
Dec 19, 2025 |
|
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
|
Dec 19, 2025 |
|
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
|
Dec 19, 2025 |
|
Puzzle Curriculum GRPO for Vision-Centric Reasoning
|
Dec 19, 2025 |
|
MMGR: Multi-Modal Generative Reasoning
|
Dec 18, 2025 |
|
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
|
Dec 18, 2025 |
|
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
|
Dec 18, 2025 |
|
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
|
Dec 18, 2025 |
|
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
|
Dec 18, 2025 |
|
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
|
Dec 18, 2025 |
|
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
|
Dec 17, 2025 |
|
Towards Scalable Pre-training of Visual Tokenizers for Generation
|
Dec 17, 2025 |
|
Memory in the Age of AI Agents
|
Dec 17, 2025 |
|
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
|
Dec 17, 2025 |
|
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
|
Dec 17, 2025 |
|
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
|
Dec 17, 2025 |
|
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
|
Dec 17, 2025 |
|
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
|
Dec 17, 2025 |
|
KlingAvatar 2.0 Technical Report
|
Dec 17, 2025 |
|
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
|
Dec 17, 2025 |
|
EgoX: Egocentric Video Generation from a Single Exocentric Video
|
Dec 16, 2025 |
|
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
|
Dec 16, 2025 |
|
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
|
Dec 16, 2025 |
|
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
|
Dec 16, 2025 |
|
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
|
Dec 13, 2025 |
|
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
|
Dec 13, 2025 |
|
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
|
Dec 13, 2025 |
|
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
|
Dec 13, 2025 |
|
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
|
Dec 13, 2025 |
|
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
|
Dec 12, 2025 |
|
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
|
Dec 12, 2025 |
|
OmniPSD: Layered PSD Generation with Diffusion Transformer
|
Dec 12, 2025 |
|
Composing Concepts from Images and Videos via Concept-prompt Binding
|
Dec 12, 2025 |
|
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
|
Dec 11, 2025 |
|
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
|
Dec 11, 2025 |
|
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
|
Dec 11, 2025 |
|
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
|
Dec 11, 2025 |
|
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
|
Dec 10, 2025 |
|
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
|
Dec 10, 2025 |
|
Unified Video Editing with Temporal Reasoner
|
Dec 10, 2025 |
|
Voxify3D: Pixel Art Meets Volumetric Rendering
|
Dec 10, 2025 |
|
Scaling Zero-Shot Reference-to-Video Generation
|
Dec 10, 2025 |
|
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
|
Dec 10, 2025 |
|
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows
|
Dec 09, 2025 |
|
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
|
Dec 09, 2025 |
|
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
|
Dec 09, 2025 |
|
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
|
Dec 09, 2025 |
|
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
|
Dec 06, 2025 |
|
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
|
Dec 06, 2025 |
|
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
|
Dec 06, 2025 |
|
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
|
Dec 06, 2025 |
|
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
|
Dec 06, 2025 |
|
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
|
Dec 06, 2025 |
|
PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing
|
Dec 06, 2025 |
|
Qwen3-VL Technical Report
|
Dec 05, 2025 |
|
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
|
Dec 05, 2025 |
|
PretrainZero: Reinforcement Active Pretraining
|
Dec 05, 2025 |
|
ViDiC: Video Difference Captioning
|
Dec 05, 2025 |
|
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
|
Dec 04, 2025 |
|
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
|
Dec 04, 2025 |
|
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
|
Dec 04, 2025 |
|
MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
|
Dec 04, 2025 |
|
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
|
Dec 04, 2025 |
|
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
|
Dec 04, 2025 |
|
Guided Self-Evolving LLMs with Minimal Human Supervision
|
Dec 04, 2025 |
|
SimScale: Learning to Drive via Real-World Simulation at Scale
|
Dec 04, 2025 |
|
InnoGym: Benchmarking the Innovation Potential of AI Agents
|
Dec 04, 2025 |
|
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
|
Dec 03, 2025 |
|
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
|
Dec 03, 2025 |
|
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
|
Dec 03, 2025 |
|
How Far Are We from Genuinely Useful Deep Research Agents?
|
Dec 03, 2025 |
|
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
|
Dec 03, 2025 |
|
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
|
Dec 03, 2025 |
|
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
|
Dec 03, 2025 |
|
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
|
Dec 03, 2025 |
|
LFM2 Technical Report
|
Dec 03, 2025 |
|
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
|
Dec 02, 2025 |
|
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
|
Dec 02, 2025 |
|
Vision Bridge Transformer at Scale
|
Dec 02, 2025 |
|
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
|
Dec 02, 2025 |
|
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
|
Dec 02, 2025 |
|
Multimodal Evaluation of Russian-language Architectures
|
Nov 28, 2025 |
|
Latent Collaboration in Multi-Agent Systems
|
Nov 28, 2025 |
|
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
|
Nov 28, 2025 |
|
GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms
|
Nov 27, 2025 |
|
MedSAM3: Delving into Segment Anything with Medical Concepts
|
Nov 27, 2025 |
|
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
|
Nov 27, 2025 |
|
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
|
Nov 27, 2025 |
|
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
|
Nov 27, 2025 |
|
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
|
Nov 27, 2025 |
|
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
|
Nov 27, 2025 |
|
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
|
Nov 27, 2025 |
|
Soft Adaptive Policy Optimization
|
Nov 27, 2025 |
|
General Agentic Memory Via Deep Research
|
Nov 26, 2025 |
|
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
|
Nov 26, 2025 |
|
Computer-Use Agents as Judges for Generative User Interface
|
Nov 26, 2025 |
|
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
|
Nov 26, 2025 |
|
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
|
Nov 26, 2025 |
|
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
|
Nov 26, 2025 |
|
In-Video Instructions: Visual Signals as Generative Control
|
Nov 26, 2025 |
|
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
|
Nov 25, 2025 |
|
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
|
Nov 25, 2025 |
|
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
|
Nov 25, 2025 |
|
SAM 3: Segment Anything with Concepts
|
Nov 25, 2025 |
|
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
|
Nov 21, 2025 |
|
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
|
Nov 21, 2025 |
|
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
|
Nov 21, 2025 |
|
VisPlay: Self-Evolving Vision-Language Models from Images
|
Nov 21, 2025 |
|
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
|
Nov 21, 2025 |
|
VIDEOP2R: Video Understanding from Perception to Reasoning
|
Nov 20, 2025 |
|
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
|
Nov 20, 2025 |
|
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
|
Nov 20, 2025 |
|
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
|
Nov 20, 2025 |
|
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
|
Nov 20, 2025 |
|
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
|
Nov 20, 2025 |
|
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
|
Nov 20, 2025 |
|
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
|
Nov 19, 2025 |
|
P1: Mastering Physics Olympiads with Reinforcement Learning
|
Nov 19, 2025 |
|
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
|
Nov 19, 2025 |
|
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
|
Nov 19, 2025 |
|
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
|
Nov 19, 2025 |
|
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
|
Nov 19, 2025 |
|
GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning
|
Nov 19, 2025 |
|
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
|
Nov 19, 2025 |
|
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
|
Nov 19, 2025 |
|
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
|
Nov 18, 2025 |
|
DoPE: Denoising Rotary Position Embedding
|
Nov 18, 2025 |
|
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
|
Nov 18, 2025 |
|
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
|
Nov 18, 2025 |
|
AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery
|
Nov 18, 2025 |
|
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
|
Nov 18, 2025 |
|
Virtual Width Networks
|
Nov 18, 2025 |
|
One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
|
Nov 15, 2025 |
|
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
|
Nov 15, 2025 |
|
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
|
Nov 15, 2025 |
|
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
|
Nov 11, 2025 |
|
DeepEyesV2: Toward Agentic Multimodal Model
|
Nov 11, 2025 |
|
Visual Spatial Tuning
|
Nov 11, 2025 |
|
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
|
Nov 11, 2025 |
|
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
|
Nov 08, 2025 |
|
V-Thinker: Interactive Thinking with Images
|
Nov 08, 2025 |
|
Scaling Agent Learning via Experience Synthesis
|
Nov 08, 2025 |
|
Diffusion Language Models are Super Data Learners
|
Nov 07, 2025 |
|
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
|
Nov 07, 2025 |
|
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
|
Nov 07, 2025 |
|
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
|
Nov 06, 2025 |
|
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
|
Nov 06, 2025 |
|
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
|
Nov 06, 2025 |
|
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
|
Nov 05, 2025 |
|
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
|
Nov 05, 2025 |
|
The Underappreciated Power of Vision Models for Graph Structural Understanding
|
Nov 05, 2025 |
|
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
|
Nov 05, 2025 |
|
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
|
Nov 05, 2025 |
|
PHUMA: Physically-Grounded Humanoid Locomotion Dataset
|
Nov 05, 2025 |
|
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
|
Nov 05, 2025 |
|
World Simulation with Video Foundation Models for Physical AI
|
Nov 05, 2025 |
|
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
|
Nov 04, 2025 |
|
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
|
Nov 04, 2025 |
|
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
|
Nov 04, 2025 |
|
The End of Manual Decoding: Towards Truly End-to-End Language Models
|
Nov 01, 2025 |
|
Kimi Linear: An Expressive, Efficient Attention Architecture
|
Nov 01, 2025 |
|
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
|
Nov 01, 2025 |
|
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
|
Nov 01, 2025 |
|
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
|
Nov 01, 2025 |
|
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
|
Oct 29, 2025 |
|
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
|
Oct 24, 2025 |
|
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
|
Oct 24, 2025 |
|
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
|
Oct 24, 2025 |
|
Language Models are Injective and Hence Invertible
|
Oct 24, 2025 |
|
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
|
Oct 24, 2025 |
|
LightMem: Lightweight and Efficient Memory-Augmented Generation
|
Oct 23, 2025 |
|
Efficient Long-context Language Model Training by Core Attention Disaggregation
|
Oct 23, 2025 |
|
World-in-World: World Models in a Closed-Loop World
|
Oct 23, 2025 |
|
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
|
Oct 23, 2025 |
|
Chem-R: Learning to Reason as a Chemist
|
Oct 23, 2025 |
|
MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
|
Oct 23, 2025 |
|
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
|
Oct 23, 2025 |
|
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
|
Oct 23, 2025 |
|
IF-VidCap: Can Video Caption Models Follow Instructions?
|
Oct 23, 2025 |
|
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
|
Oct 22, 2025 |
|
PICABench: How Far Are We from Physically Realistic Image Editing?
|
Oct 22, 2025 |
|
Glyph: Scaling Context Windows via Visual-Text Compression
|
Oct 22, 2025 |
|
FineVision: Open Data Is All You Need
|
Oct 22, 2025 |
|
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
|
Oct 22, 2025 |
|
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
|
Oct 22, 2025 |
|
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
|
Oct 22, 2025 |
|
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
|
Oct 21, 2025 |
|
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
|
Oct 21, 2025 |
|
NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
|
Oct 21, 2025 |
|
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
|
Oct 21, 2025 |
|
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
|
Oct 21, 2025 |
|
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
|
Oct 21, 2025 |
|
Latent Diffusion Model without Variational Autoencoder
|
Oct 21, 2025 |
|
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
|
Oct 18, 2025 |
|
Agentic Entropy-Balanced Policy Optimization
|
Oct 18, 2025 |
|
WithAnyone: Towards Controllable and ID Consistent Image Generation
|
Oct 18, 2025 |
|
AI for Service: Proactive Assistance with AI Glasses
|
Oct 18, 2025 |
|
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
|
Oct 18, 2025 |
|
ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
|
Oct 18, 2025 |
|
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
|
Oct 18, 2025 |
|
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
|
Oct 18, 2025 |
|
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
|
Oct 18, 2025 |
|
BitNet Distillation
|
Oct 18, 2025 |
|
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
|
Oct 16, 2025 |
|
Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
|
Oct 16, 2025 |
|
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
|
Oct 16, 2025 |
|
Scaling Language-Centric Omnimodal Representation Learning
|
Oct 16, 2025 |
|
Robot Learning: A Tutorial
|
Oct 16, 2025 |
|
Detect Anything via Next Point Prediction
|
Oct 16, 2025 |
|
A Survey of Vibe Coding with Large Language Models
|
Oct 16, 2025 |
|
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
|
Oct 16, 2025 |
|
Dr.LLM: Dynamic Layer Routing in LLMs
|
Oct 16, 2025 |
|
Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models
|
Oct 16, 2025 |
|
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
|
Oct 15, 2025 |
|
Diffusion Transformers with Representation Autoencoders
|
Oct 15, 2025 |
|
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
|
Oct 15, 2025 |
|
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
|
Oct 15, 2025 |
|
Spotlight on Token Perception for Multimodal Reinforcement Learning
|
Oct 15, 2025 |
|
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
|
Oct 15, 2025 |
|
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
|
Oct 15, 2025 |
|
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
|
Oct 15, 2025 |
|
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
|
Oct 15, 2025 |
|
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
|
Oct 15, 2025 |
|
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
|
Oct 14, 2025 |
|
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
|
Oct 14, 2025 |
|
TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling
|
Oct 14, 2025 |
|
AutoPR: Let's Automate Your Academic Promotion!
|
Oct 14, 2025 |
|
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
|
Oct 14, 2025 |
|
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
|
Oct 14, 2025 |
|
StreamingVLM: Real-Time Understanding for Infinite Video Streams
|
Oct 14, 2025 |
|
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
|
Oct 14, 2025 |
|
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
|
Oct 14, 2025 |
|
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
|
Oct 14, 2025 |
|
Agent Learning via Early Experience
|
Oct 11, 2025 |
|
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
|
Oct 11, 2025 |
|
MemMamba: Rethinking Memory Patterns in State Space Model
|
Oct 11, 2025 |
|
UniVideo: Unified Understanding, Generation, and Editing for Videos
|
Oct 11, 2025 |
|
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
|
Oct 11, 2025 |
|
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
|
Oct 11, 2025 |
|
Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning
|
Oct 11, 2025 |
|
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
|
Oct 11, 2025 |
|
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
|
Oct 11, 2025 |
|
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
|
Oct 11, 2025 |
|
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
|
Oct 10, 2025 |
|
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
|
Oct 10, 2025 |
|
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
|
Oct 10, 2025 |
|
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
|
Oct 10, 2025 |
|
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
|
Oct 10, 2025 |
|
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
|
Oct 10, 2025 |
|
Vibe Checker: Aligning Code Evaluation with Human Preference
|
Oct 10, 2025 |
|
Less is More: Recursive Reasoning with Tiny Networks
|
Oct 09, 2025 |
|
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
|
Oct 09, 2025 |
|
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
|
Oct 09, 2025 |
|
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
|
Oct 09, 2025 |
|
Fast-dLLM v2: Efficient Block-Diffusion LLM
|
Oct 09, 2025 |
|
CoDA: Coding LM via Diffusion Adaptation
|
Oct 09, 2025 |
|
Drax: Speech Recognition with Discrete Flow Matching
|
Oct 09, 2025 |
|
Paper2Video: Automatic Video Generation from Scientific Papers
|
Oct 08, 2025 |
|
MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information
|
Oct 08, 2025 |
|
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
|
Oct 08, 2025 |
|
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
|
Oct 08, 2025 |
|
Imperceptible Jailbreaking against Large Language Models
|
Oct 08, 2025 |
|
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
|
Oct 08, 2025 |
|
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
|
Oct 08, 2025 |
|
Optimal Scaling Needs Optimal Norm
|
Oct 08, 2025 |
|
Apriel-1.5-15b-Thinker
|
Oct 07, 2025 |
|
Large Reasoning Models Learn Better Alignment from Flawed Thinking
|
Oct 07, 2025 |
|
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
|
Oct 07, 2025 |
|
LongCodeZip: Compress Long Context for Code Language Models
|
Oct 04, 2025 |
|
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
|
Oct 04, 2025 |
|
ExGRPO: Learning to Reason from Experience
|
Oct 04, 2025 |
|
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
|
Oct 04, 2025 |
|
Interactive Training: Feedback-Driven Neural Network Optimization
|
Oct 04, 2025 |
|
ModernVBERT: Towards Smaller Visual Document Retrievers
|
Oct 04, 2025 |
|
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
|
Oct 04, 2025 |
|
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
|
Oct 03, 2025 |
|
GEM: A Gym for Agentic LLMs
|
Oct 03, 2025 |
|
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
|
Oct 03, 2025 |
|
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
|
Oct 03, 2025 |
|
PIPer: On-Device Environment Setup via Online Reinforcement Learning
|
Oct 03, 2025 |
|
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
|
Oct 03, 2025 |
|
ACON: Optimizing Context Compression for Long-horizon LLM Agents
|
Oct 03, 2025 |
|
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
|
Oct 02, 2025 |
|
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
|
Oct 02, 2025 |
|
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
|
Oct 02, 2025 |
|
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning
|
Oct 02, 2025 |
|
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
|
Oct 02, 2025 |
|
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
|
Oct 02, 2025 |
|
OceanGym: A Benchmark Environment for Underwater Embodied Agents
|
Oct 02, 2025 |
|
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
|
Oct 02, 2025 |
|
Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
|
Oct 02, 2025 |
|
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
|
Oct 02, 2025 |
|
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
|
Oct 01, 2025 |
|
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
|
Oct 01, 2025 |
|
Multiplayer Nash Preference Optimization
|
Oct 01, 2025 |
|
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
|
Oct 01, 2025 |
|
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
|
Oct 01, 2025 |
|
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
|
Oct 01, 2025 |
|
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
|
Oct 01, 2025 |
|
Democratizing AI scientists using ToolUniverse
|
Oct 01, 2025 |
|
Visual Jigsaw Post-Training Improves MLLMs
|
Oct 01, 2025 |
|
When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance
|
Oct 01, 2025 |
|
LongLive: Real-time Interactive Long Video Generation
|
Sep 30, 2025 |
|
Quantile Advantage Estimation for Entropy-Safe Reasoning
|
Sep 30, 2025 |
|
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
|
Sep 30, 2025 |
|
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
|
Sep 30, 2025 |
|
ReviewScore: Misinformed Peer Review Detection with Large Language Models
|
Sep 30, 2025 |
|
Variational Reasoning for Language Models
|
Sep 30, 2025 |
|
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
|
Sep 30, 2025 |
|
MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
|
Sep 30, 2025 |
|
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
|
Sep 30, 2025 |
|
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
|
Sep 30, 2025 |
|
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
|
Sep 27, 2025 |
|
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
|
Sep 27, 2025 |
|
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
|
Sep 27, 2025 |
|
Tree Search for LLM Agent Reinforcement Learning
|
Sep 27, 2025 |
|
Seedream 4.0: Toward Next-generation Multimodal Image Generation
|
Sep 27, 2025 |
|
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
|
Sep 27, 2025 |
|
AutoIntent: AutoML for Text Classification
|
Sep 27, 2025 |
|
Video models are zero-shot learners and reasoners
|
Sep 26, 2025 |
|
SIM-CoT: Supervised Implicit Chain-of-Thought
|
Sep 26, 2025 |
|
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
|
Sep 25, 2025 |
|
Reinforcement Learning on Pre-Training Data
|
Sep 25, 2025 |
|
Do You Need Proprioceptive States in Visuomotor Policies?
|
Sep 25, 2025 |
|
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
|
Sep 25, 2025 |
|
LIMI: Less is More for Agency
|
Sep 24, 2025 |
|
Qwen3-Omni Technical Report
|
Sep 24, 2025 |
|
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
|
Sep 24, 2025 |
|
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
|
Sep 24, 2025 |
|
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
|
Sep 24, 2025 |
|
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
|
Sep 23, 2025 |
|
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
|
Sep 23, 2025 |
|
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
|
Sep 23, 2025 |
|
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
|
Sep 20, 2025 |
|
FlowRL: Matching Reward Distributions for LLM Reasoning
|
Sep 20, 2025 |
|
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
|
Sep 20, 2025 |
|
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
|
Sep 20, 2025 |
|
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
|
Sep 20, 2025 |
|
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
|
Sep 20, 2025 |
|
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
|
Sep 19, 2025 |
|
SAIL-VL2 Technical Report
|
Sep 19, 2025 |
|
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
|
Sep 19, 2025 |
|
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
|
Sep 18, 2025 |
|
Scaling Agents via Continual Pre-training
|
Sep 18, 2025 |
|
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
|
Sep 18, 2025 |
|
Towards General Agentic Intelligence via Environment Scaling
|
Sep 18, 2025 |
|
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
|
Sep 18, 2025 |
|
ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
|
Sep 18, 2025 |
|
Single-stream Policy Optimization
|
Sep 18, 2025 |
|
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
|
Sep 17, 2025 |
|
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
|
Sep 17, 2025 |
|
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
|
Sep 17, 2025 |
|
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
|
Sep 16, 2025 |
|
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
|
Sep 16, 2025 |
|
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
|
Sep 13, 2025 |
|
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
|
Sep 13, 2025 |
|
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
|
Sep 13, 2025 |
|
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
|
Sep 13, 2025 |
|
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
|
Sep 13, 2025 |
|
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
|
Sep 13, 2025 |
|
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
|
Sep 13, 2025 |
|
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
|
Sep 13, 2025 |
|
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
|
Sep 13, 2025 |
|
A Survey of Reinforcement Learning for Large Reasoning Models
|
Sep 12, 2025 |
|
RewardDance: Reward Scaling in Visual Generation
|
Sep 12, 2025 |
|
3D and 4D World Modeling: A Survey
|
Sep 12, 2025 |
|
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
|
Sep 12, 2025 |
|
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
|
Sep 11, 2025 |
|
Visual Representation Alignment for Multimodal Large Language Models
|
Sep 11, 2025 |
|
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
|
Sep 11, 2025 |
|
Reconstruction Alignment Improves Unified Multimodal Models
|
Sep 11, 2025 |
|
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
|
Sep 11, 2025 |
|
Reverse-Engineered Reasoning for Open-Ended Generation
|
Sep 10, 2025 |
|
Does DINOv3 Set a New Medical Vision Standard?
|
Sep 10, 2025 |
|
Symbolic Graphics Programming with Large Language Models
|
Sep 09, 2025 |
|
Set Block Decoding is a Language Model Inference Accelerator
|
Sep 09, 2025 |
|
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
|
Sep 06, 2025 |
|
From Editor to Dense Geometry Estimator
|
Sep 06, 2025 |
|
Towards a Unified View of Large Language Model Post-Training
|
Sep 06, 2025 |
|
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks
|
Sep 06, 2025 |
|
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
|
Sep 06, 2025 |
|
Open Data Synthesis For Deep Research
|
Sep 05, 2025 |
|
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
|
Sep 05, 2025 |
|
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
|
Sep 04, 2025 |
|
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
|
Sep 04, 2025 |
|
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
|
Sep 04, 2025 |
|
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
|
Sep 04, 2025 |
|
Baichuan-M2: Scaling Medical Capability with Large Verifier System
|
Sep 04, 2025 |
|
Kwai Keye-VL 1.5 Technical Report
|
Sep 04, 2025 |
|
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
|
Sep 04, 2025 |
|
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
|
Sep 03, 2025 |
|
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
|
Sep 02, 2025 |
|
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
|
Sep 02, 2025 |
|
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
|
Aug 28, 2025 |
|
VibeVoice Technical Report
|
Aug 28, 2025 |
|
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
|
Aug 28, 2025 |
|
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
|
Aug 28, 2025 |
|
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
|
Aug 28, 2025 |
|
Spacer: Towards Engineered Scientific Inspiration
|
Aug 28, 2025 |
|
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
|
Aug 28, 2025 |
|
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
|
Aug 27, 2025 |
|
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
|
Aug 27, 2025 |
|
MV-RAG: Retrieval Augmented Multiview Diffusion
|
Aug 27, 2025 |
|
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
|
Aug 26, 2025 |
|
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
|
Aug 26, 2025 |
|
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
|
Aug 26, 2025 |
|
Intern-S1: A Scientific Multimodal Foundation Model
|
Aug 23, 2025 |
|
Mobile-Agent-v3: Foundamental Agents for GUI Automation
|
Aug 23, 2025 |
|
Deep Think with Confidence
|
Aug 23, 2025 |
|
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
|
Aug 23, 2025 |
|
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
|
Aug 22, 2025 |
|
From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
|
Aug 22, 2025 |
|
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
|
Aug 22, 2025 |
|
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
|
Aug 22, 2025 |
|
Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
|
Aug 22, 2025 |
|
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
|
Aug 21, 2025 |
|
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
|
Aug 21, 2025 |
|
Prompt Orchestration Markup Language
|
Aug 21, 2025 |
|
Ovis2.5 Technical Report
|
Aug 20, 2025 |
|
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
|
Aug 20, 2025 |
|
4DNeX: Feed-Forward 4D Generative Modeling Made Easy
|
Aug 20, 2025 |
|
Next Visual Granularity Generation
|
Aug 20, 2025 |
|
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
|
Aug 20, 2025 |
|
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
|
Aug 20, 2025 |
|
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
|
Aug 20, 2025 |
|
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
|
Aug 20, 2025 |
|
SSRL: Self-Search Reinforcement Learning
|
Aug 19, 2025 |
|
DINOv3
|
Aug 19, 2025 |
|
Thyme: Think Beyond Images
|
Aug 19, 2025 |
|
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
|
Aug 19, 2025 |
|
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
|
Aug 19, 2025 |
|
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning
|
Aug 16, 2025 |
|
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
|
Aug 16, 2025 |
|
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
|
Aug 16, 2025 |
|
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
|
Aug 16, 2025 |
|
Story2Board: A Training-Free Approach for Expressive Storyboard Generation
|
Aug 15, 2025 |
|
Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery
|
Aug 15, 2025 |
|
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
|
Aug 15, 2025 |
|
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
|
Aug 15, 2025 |
|
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
|
Aug 15, 2025 |
|
AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving
|
Aug 15, 2025 |
|
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
|
Aug 14, 2025 |
|
Complex Logical Instruction Generation
|
Aug 14, 2025 |
|
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
|
Aug 14, 2025 |
|
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
|
Aug 14, 2025 |
|
ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
|
Aug 13, 2025 |
|
WideSearch: Benchmarking Agentic Broad Info-Seeking
|
Aug 13, 2025 |
|
Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
|
Aug 13, 2025 |
|
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
|
Aug 13, 2025 |
|
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
|
Aug 13, 2025 |
|
SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
|
Aug 13, 2025 |
|
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
|
Aug 13, 2025 |
|
MolmoAct: Action Reasoning Models that can Reason in Space
|
Aug 13, 2025 |
|
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
|
Aug 12, 2025 |
|
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
|
Aug 12, 2025 |
|
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
|
Aug 08, 2025 |
|
VeriGUI: Verifiable Long-Chain GUI Dataset
|
Aug 08, 2025 |
|
Efficient Agents: Building Effective Agents While Reducing Cost
|
Aug 08, 2025 |
|
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
|
Aug 08, 2025 |
|
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
|
Aug 08, 2025 |
|
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
|
Aug 08, 2025 |
|
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
|
Aug 08, 2025 |
|
Qwen-Image Technical Report
|
Aug 06, 2025 |
|
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
|
Aug 06, 2025 |
|
CellForge: Agentic Design of Virtual Cell Models
|
Aug 06, 2025 |
|
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
|
Aug 06, 2025 |
|
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
|
Aug 06, 2025 |
|
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
|
Aug 05, 2025 |
|
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
|
Aug 05, 2025 |
|
PixNerd: Pixel Neural Field Diffusion
|
Aug 05, 2025 |
|
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
|
Aug 02, 2025 |
|
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
|
Aug 02, 2025 |
|
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
|
Aug 01, 2025 |
|
BANG: Dividing 3D Assets via Generative Exploded Dynamics
|
Aug 01, 2025 |
|
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
|
Aug 01, 2025 |
|
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
|
Jul 31, 2025 |
|
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
|
Jul 31, 2025 |
|
ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge
|
Jul 31, 2025 |
|
Agentic Reinforced Policy Optimization
|
Jul 30, 2025 |
|
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
|
Jul 30, 2025 |
|
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
|
Jul 30, 2025 |
|
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
|
Jul 30, 2025 |
|
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
|
Jul 30, 2025 |
|
Reconstructing 4D Spatial Intelligence: A Survey
|
Jul 30, 2025 |
|
Deep Researcher with Test-Time Diffusion
|
Jul 29, 2025 |
|
$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention
|
Jul 26, 2025 |
|
Group Sequence Policy Optimization
|
Jul 26, 2025 |
|
MUR: Momentum Uncertainty guided Reasoning for Large Language Models
|
Jul 26, 2025 |
|
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
|
Jul 26, 2025 |
|
Pixels, Patterns, but No Poetry: To See The World like Humans
|
Jul 25, 2025 |
|
Yume: An Interactive World Generation Model
|
Jul 25, 2025 |
|
DesignLab: Designing Slides Through Iterative Detection and Correction
|
Jul 25, 2025 |
|
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
|
Jul 25, 2025 |
|
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
|
Jul 24, 2025 |
|
Step-Audio 2 Technical Report
|
Jul 24, 2025 |
|
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
|
Jul 24, 2025 |
|
Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
|
Jul 24, 2025 |
|
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
|
Jul 24, 2025 |
|
GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding
|
Jul 23, 2025 |
|
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
|
Jul 23, 2025 |
|
The Invisible Leash: Why RLVR May Not Escape Its Origin
|
Jul 23, 2025 |
|
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
|
Jul 23, 2025 |
|
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
|
Jul 23, 2025 |
|
GR-3 Technical Report
|
Jul 23, 2025 |
|
Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
|
Jul 23, 2025 |
|
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
|
Jul 23, 2025 |
|
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
|
Jul 23, 2025 |
|
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
|
Jul 22, 2025 |
|
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
|
Jul 22, 2025 |
|
A Survey of Context Engineering for Large Language Models
|
Jul 19, 2025 |
|
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
|
Jul 19, 2025 |
|
$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning
|
Jul 19, 2025 |
|
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
|
Jul 19, 2025 |
|
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
|
Jul 19, 2025 |
|
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
|
Jul 19, 2025 |
|
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
|
Jul 19, 2025 |
|
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
|
Jul 18, 2025 |
|
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
|
Jul 17, 2025 |
|
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
|
Jul 17, 2025 |
|
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
|
Jul 16, 2025 |
|
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
|
Jul 16, 2025 |
|
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
|
Jul 16, 2025 |
|
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
|
Jul 16, 2025 |
|
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
|
Jul 16, 2025 |
|
Test-Time Scaling with Reflective Generative Model
|
Jul 15, 2025 |
|
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
|
Jul 15, 2025 |
|
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
|
Jul 15, 2025 |
|
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
|
Jul 15, 2025 |
|
KV Cache Steering for Inducing Reasoning in Small Language Models
|
Jul 15, 2025 |
|
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
|
Jul 15, 2025 |
|
Neural-Driven Image Editing
|
Jul 15, 2025 |
|
Scaling RL to Long Videos
|
Jul 12, 2025 |
|
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
|
Jul 12, 2025 |
|
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
|
Jul 12, 2025 |
|
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
|
Jul 12, 2025 |
|
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
|
Jul 12, 2025 |
|
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
|
Jul 12, 2025 |
|
PyVision: Agentic Vision with Dynamic Tooling
|
Jul 12, 2025 |
|
4KAgent: Agentic Any Image to 4K Super-Resolution
|
Jul 11, 2025 |
|
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
|
Jul 11, 2025 |
|
Perception-Aware Policy Optimization for Multimodal Reasoning
|
Jul 11, 2025 |
|
MIRIX: Multi-Agent Memory System for LLM-Based Agents
|
Jul 11, 2025 |
|
Rethinking Verification for LLM Code Generation: From Generation to Testing
|
Jul 11, 2025 |
|
SingLoRA: Low Rank Adaptation Using a Single Matrix
|
Jul 10, 2025 |
|
A Survey on Latent Reasoning
|
Jul 10, 2025 |
|
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
|
Jul 10, 2025 |
|
How to Train Your LLM Web Agent: A Statistical Diagnosis
|
Jul 10, 2025 |
|
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
|
Jul 10, 2025 |
|
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
|
Jul 10, 2025 |
|
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
|
Jul 10, 2025 |
|
MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
|
Jul 10, 2025 |
|
MemOS: A Memory OS for AI System
|
Jul 09, 2025 |
|
Should We Still Pretrain Encoders with Masked Language Modeling?
|
Jul 09, 2025 |
|
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
|
Jul 09, 2025 |
|
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
|
Jul 09, 2025 |
|
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
|
Jul 09, 2025 |
|
Pre-Trained Policy Discriminators are General Reward Models
|
Jul 09, 2025 |
|
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
|
Jul 09, 2025 |
|
WebSailor: Navigating Super-human Reasoning for Web Agent
|
Jul 05, 2025 |
|
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
|
Jul 05, 2025 |
|
Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
|
Jul 05, 2025 |
|
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
|
Jul 05, 2025 |
|
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
|
Jul 05, 2025 |
|
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
|
Jul 05, 2025 |
|
Kwai Keye-VL Technical Report
|
Jul 04, 2025 |
|
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
|
Jul 04, 2025 |
|
Depth Anything at Any Condition
|
Jul 04, 2025 |
|
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
|
Jul 04, 2025 |
|
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
|
Jul 03, 2025 |
|
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
|
Jul 03, 2025 |
|
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
|
Jul 03, 2025 |
|
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
|
Jul 03, 2025 |
|
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
|
Jul 03, 2025 |
|
Ovis-U1 Technical Report
|
Jul 02, 2025 |
|
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
|
Jul 02, 2025 |
|
VMoBA: Mixture-of-Block Attention for Video Diffusion Models
|
Jul 02, 2025 |
|
Calligrapher: Freestyle Text Image Customization
|
Jul 02, 2025 |
|
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
|
Jul 01, 2025 |
|
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
|
Jul 01, 2025 |
|
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
|
Jul 01, 2025 |
|
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
|
Jun 17, 2025 |
|
Effective Red-Teaming of Policy-Adherent Agents
|
Jun 17, 2025 |
|
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
|
Jun 17, 2025 |
|
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
|
Jun 14, 2025 |
|
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
|
Jun 14, 2025 |
|
Text-Aware Image Restoration with Diffusion Models
|
Jun 14, 2025 |
|
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
|
Jun 14, 2025 |
|
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
|
Jun 14, 2025 |
|
Discrete Audio Tokens: More Than a Survey!
|
Jun 14, 2025 |
|
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
|
Jun 13, 2025 |
|
Seedance 1.0: Exploring the Boundaries of Video Generation Models
|
Jun 13, 2025 |
|
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
|
Jun 13, 2025 |
|
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
|
Jun 13, 2025 |
|
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
|
Jun 13, 2025 |
|
PlayerOne: Egocentric World Simulator
|
Jun 13, 2025 |
|
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
|
Jun 13, 2025 |
|
Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models
|
Jun 12, 2025 |
|
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
|
Jun 12, 2025 |
|
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
|
Jun 12, 2025 |
|
Reinforcement Pre-Training
|
Jun 11, 2025 |
|
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
|
Jun 11, 2025 |
|
MiniCPM4: Ultra-Efficient LLMs on End Devices
|
Jun 11, 2025 |
|
SpatialLM: Training Large Language Models for Structured Indoor Modeling
|
Jun 11, 2025 |
|
Image Reconstruction as a Tool for Feature Analysis
|
Jun 11, 2025 |
|
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
|
Jun 11, 2025 |
|
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
|
Jun 10, 2025 |
|
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
|
Jun 10, 2025 |
|
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
|
Jun 10, 2025 |
|
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs
|
Jun 10, 2025 |
|
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
|
Jun 07, 2025 |
|
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
|
Jun 07, 2025 |
|
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
|
Jun 07, 2025 |
|
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
|
Jun 07, 2025 |
|
Video World Models with Long-term Spatial Memory
|
Jun 07, 2025 |
|
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights
|
Jun 07, 2025 |
|
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
|
Jun 07, 2025 |
|
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
|
Jun 07, 2025 |
|
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
|
Jun 07, 2025 |
|
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
|
Jun 07, 2025 |
|
MiMo-VL Technical Report
|
Jun 06, 2025 |
|
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
|
Jun 06, 2025 |
|
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
|
Jun 06, 2025 |
|
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
|
Jun 06, 2025 |
|
A Controllable Examination for Long-Context Language Models
|
Jun 06, 2025 |
|
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
|
Jun 06, 2025 |
|
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
|
Jun 06, 2025 |
|
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
|
Jun 06, 2025 |
|
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
|
Jun 05, 2025 |
|
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments
|
Jun 05, 2025 |
|
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
|
Jun 05, 2025 |
|
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
|
Jun 05, 2025 |
|
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
|
Jun 05, 2025 |
|
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
|
Jun 05, 2025 |
|
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
|
Jun 05, 2025 |
|
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation
|
Jun 05, 2025 |
|
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
|
Jun 04, 2025 |
|
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
|
Jun 04, 2025 |
|
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
|
Jun 04, 2025 |
|
Taming LLMs by Scaling Learning Rates with Gradient Grouping
|
Jun 04, 2025 |
|
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
|
Jun 04, 2025 |
|
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models
|
Jun 04, 2025 |
|
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks
|
Jun 04, 2025 |
|
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
|
Jun 04, 2025 |
|
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding
|
Jun 04, 2025 |
|
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
|
Jun 04, 2025 |
|
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
|
Jun 03, 2025 |
|
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
|
Jun 03, 2025 |
|
Time Blindness: Why Video-Language Models Can't See What Humans Can?
|
Jun 03, 2025 |
|
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
|
Jun 03, 2025 |
|
Large Language Models for Data Synthesis
|
Jun 03, 2025 |
|
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
|
Jun 03, 2025 |
|
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
|
Jun 03, 2025 |
|
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
|
Jun 03, 2025 |
|
Table-R1: Inference-Time Scaling for Table Reasoning
|
May 31, 2025 |
|
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
|
May 31, 2025 |
|
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
|
May 31, 2025 |
|
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
|
May 31, 2025 |
|
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
|
May 31, 2025 |
|
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
|
May 31, 2025 |
|
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
|
May 31, 2025 |
|
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
|
May 30, 2025 |
|
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
|
May 30, 2025 |
|
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
|
May 30, 2025 |
|
Skywork Open Reasoner 1 Technical Report
|
May 30, 2025 |
|
Sherlock: Self-Correcting Reasoning in Vision-Language Models
|
May 30, 2025 |
|
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
|
May 30, 2025 |
|
SageAttention2++: A More Efficient Implementation of SageAttention2
|
May 30, 2025 |
|
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
|
May 30, 2025 |
|
Fostering Video Reasoning via Next-Event Prediction
|
May 30, 2025 |
|
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
|
May 30, 2025 |
|
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
|
May 29, 2025 |
|
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
|
May 29, 2025 |
|
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
|
May 29, 2025 |
|
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
|
May 29, 2025 |
|
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
|
May 29, 2025 |
|
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
|
May 29, 2025 |
|
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
|
May 29, 2025 |
|
Exploring the Latent Capacity of LLMs for One-Step Text Generation
|
May 29, 2025 |
|
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
|
May 29, 2025 |
|
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
|
May 29, 2025 |
|
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
|
May 28, 2025 |
|
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
|
May 28, 2025 |
|
Alchemist: Turning Public Text-to-Image Data into Generative Gold
|
May 28, 2025 |
|
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
|
May 28, 2025 |
|
PATS: Process-Level Adaptive Thinking Mode Switching
|
May 28, 2025 |
|
Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance
|
May 28, 2025 |
|
ARM: Adaptive Reasoning Model
|
May 28, 2025 |
|
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
|
May 28, 2025 |
|
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
|
May 28, 2025 |
|
B-score: Detecting biases in large language models using response history
|
May 28, 2025 |
|
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
|
May 27, 2025 |
|
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
|
May 27, 2025 |
|
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
|
May 27, 2025 |
|
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
|
May 27, 2025 |
|
One RL to See Them All: Visual Triple Unified Reinforcement Learning
|
May 27, 2025 |
|
Distilling LLM Agent into Small Models with Retrieval and Code Tools
|
May 27, 2025 |
|
QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
|
May 27, 2025 |
|
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
|
May 27, 2025 |
|
Scaling Image and Video Generation via Test-Time Evolutionary Search
|
May 27, 2025 |
|
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
|
May 27, 2025 |
|
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
|
May 24, 2025 |
|
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
|
May 24, 2025 |
|
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
|
May 24, 2025 |
|
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
|
May 24, 2025 |
|
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
|
May 24, 2025 |
|
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
|
May 24, 2025 |
|
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
|
May 24, 2025 |
|
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
|
May 24, 2025 |
|
Scaling Diffusion Transformers Efficiently via $μ$P
|
May 24, 2025 |
|
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
|
May 23, 2025 |
|
MMaDA: Multimodal Large Diffusion Language Models
|
May 23, 2025 |
|
Scaling Law for Quantization-Aware Training
|
May 23, 2025 |
|
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
|
May 23, 2025 |
|
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
|
May 23, 2025 |
|
Efficient Agent Training for Computer Use
|
May 23, 2025 |
|
This Time is Different: An Observability Perspective on Time Series Foundation Models
|
May 23, 2025 |
|
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
|
May 23, 2025 |
|
Emerging Properties in Unified Multimodal Pretraining
|
May 22, 2025 |
|
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
|
May 22, 2025 |
|
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
|
May 22, 2025 |
|
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
|
May 22, 2025 |
|
Visual Agentic Reinforcement Fine-Tuning
|
May 22, 2025 |
|
Neurosymbolic Diffusion Models
|
May 22, 2025 |
|
Chain-of-Model Learning for Language Model
|
May 21, 2025 |
|
AdaptThink: Reasoning Models Can Learn When to Think
|
May 21, 2025 |
|
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
|
May 21, 2025 |
|
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
|
May 21, 2025 |
|
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
|
May 21, 2025 |
|
Faster Video Diffusion with Trainable Sparse Attention
|
May 21, 2025 |
|
Thinkless: LLM Learns When to Think
|
May 21, 2025 |
|
Model Merging in Pre-training of Large Language Models
|
May 21, 2025 |
|
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
|
May 21, 2025 |
|
Qwen3 Technical Report
|
May 20, 2025 |
|
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
|
May 20, 2025 |
|
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
|
May 20, 2025 |
|
Visual Planning: Let's Think Only with Images
|
May 20, 2025 |
|
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
|
May 17, 2025 |
|
System Prompt Optimization with Meta-Learning
|
May 17, 2025 |
|
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
|
May 16, 2025 |
|
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
|
May 16, 2025 |
|
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
|
May 16, 2025 |
|
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
|
May 15, 2025 |
|
Seed1.5-VL Technical Report
|
May 14, 2025 |
|
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
|
May 14, 2025 |
|
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
|
May 14, 2025 |
|
Learning from Peers in Reasoning Models
|
May 14, 2025 |
|
Unified Continuous Generative Models
|
May 14, 2025 |
|
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback
|
May 14, 2025 |
|
Bielik v3 Small: Technical Report
|
May 13, 2025 |
|
Bielik 11B v2 Technical Report
|
May 13, 2025 |
|
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
|
May 10, 2025 |
|
On Path to Multimodal Generalist: General-Level and General-Bench
|
May 10, 2025 |
|
Flow-GRPO: Training Flow Matching Models via Online RL
|
May 10, 2025 |
|
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
|
May 09, 2025 |
|
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
|
May 09, 2025 |
|
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
|
May 08, 2025 |
|
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
|
May 08, 2025 |
|
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
|
May 08, 2025 |
|
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
|
May 08, 2025 |
|
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
|
May 07, 2025 |
|
RM-R1: Reward Modeling as Reasoning
|
May 07, 2025 |
|
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
|
May 07, 2025 |
|
Practical Efficiency of Muon for Pretraining
|
May 07, 2025 |
|
PixelHacker: Image Inpainting with Structural and Semantic Consistency
|
May 06, 2025 |
|
A Survey of Interactive Generative Video
|
May 03, 2025 |
|
DeepCritic: Deliberate Critique with Large Language Models
|
May 03, 2025 |
|
Sadeed: Advancing Arabic Diacritization Through Small Language Model
|
May 02, 2025 |
|
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
|
May 02, 2025 |
|
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
|
May 02, 2025 |
|
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
|
May 02, 2025 |
|
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
|
May 01, 2025 |
|
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
|
May 01, 2025 |
|
ReasonIR: Training Retrievers for Reasoning Tasks
|
May 01, 2025 |
|
The Leaderboard Illusion
|
May 01, 2025 |
|
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
|
May 01, 2025 |
|
RepText: Rendering Visual Text via Replicating
|
Apr 30, 2025 |
|
Towards Understanding Camera Motions in Any Video
|
Apr 29, 2025 |
|
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
|
Apr 29, 2025 |
|
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
|
Apr 29, 2025 |
|
Step1X-Edit: A Practical Framework for General Image Editing
|
Apr 26, 2025 |
|
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
|
Apr 26, 2025 |
|
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
|
Apr 26, 2025 |
|
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
|
Apr 26, 2025 |
|
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
|
Apr 25, 2025 |
|
Trillion 7B Technical Report
|
Apr 25, 2025 |
|
Tina: Tiny Reasoning Models via LoRA
|
Apr 25, 2025 |
|
I-Con: A Unifying Framework for Representation Learning
|
Apr 25, 2025 |
|
Kuwain 1.5B: An Arabic SLM via Language Injection
|
Apr 24, 2025 |
|
TTRL: Test-Time Reinforcement Learning
|
Apr 24, 2025 |
|
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
|
Apr 24, 2025 |
|
Describe Anything: Detailed Localized Image and Video Captioning
|
Apr 24, 2025 |
|
Learning Adaptive Parallel Reasoning with Language Models
|
Apr 24, 2025 |
|
Learning to Reason under Off-Policy Guidance
|
Apr 23, 2025 |
|
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
|
Apr 23, 2025 |
|
FlowReasoner: Reinforcing Query-Level Meta-Agents
|
Apr 23, 2025 |
|
ToolRL: Reward is All Tool Learning Needs
|
Apr 23, 2025 |
|
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
|
Apr 23, 2025 |
|
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
|
Apr 23, 2025 |
|
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
|
Apr 22, 2025 |
|
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
|
Apr 22, 2025 |
|
NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
|
Apr 22, 2025 |
|
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
|
Apr 19, 2025 |
|
Antidistillation Sampling
|
Apr 19, 2025 |
|
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
|
Apr 19, 2025 |
|
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
|
Apr 19, 2025 |
|
WORLDMEM: Long-term Consistent World Simulation with Memory
|
Apr 19, 2025 |
|
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
|
Apr 19, 2025 |
|
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
|
Apr 18, 2025 |
|
BitNet b1.58 2B4T Technical Report
|
Apr 18, 2025 |
|
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
|
Apr 18, 2025 |
|
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
|
Apr 17, 2025 |
|
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
|
Apr 17, 2025 |
|
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
|
Apr 17, 2025 |
|
Heimdall: test-time scaling on the generative verification
|
Apr 17, 2025 |
|
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
|
Apr 17, 2025 |
|
TextArena
|
Apr 17, 2025 |
|
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
|
Apr 16, 2025 |
|
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
|
Apr 16, 2025 |
|
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
|
Apr 16, 2025 |
|
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
|
Apr 16, 2025 |
|
Iterative Self-Training for Code Generation via Reinforced Re-Ranking
|
Apr 16, 2025 |
|
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
|
Apr 15, 2025 |
|
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
|
Apr 15, 2025 |
|
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
|
Apr 15, 2025 |
|
Kimi-VL Technical Report
|
Apr 12, 2025 |
|
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
|
Apr 12, 2025 |
|
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
|
Apr 12, 2025 |
|
DeepSeek-R1 Thoughtology: Let's
|
Apr 12, 2025 |
|
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
|
Apr 12, 2025 |
|
MM-IFEngine: Towards Multimodal Instruction Following
|
Apr 12, 2025 |
|
HoloPart: Generative 3D Part Amodal Segmentation
|
Apr 12, 2025 |
|
DDT: Decoupled Diffusion Transformer
|
Apr 11, 2025 |
|
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
|
Apr 11, 2025 |
|
A Unified Agentic Framework for Evaluating Conditional Image Generation
|
Apr 11, 2025 |
|
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
|
Apr 11, 2025 |
|
OmniSVG: A Unified Scalable Vector Graphics Generation Model
|
Apr 10, 2025 |
|
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
|
Apr 10, 2025 |
|
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
|
Apr 10, 2025 |
|
An Empirical Study of GPT-4o Image Generation Capabilities
|
Apr 10, 2025 |
|
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
|
Apr 10, 2025 |
|
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
|
Apr 10, 2025 |
|
SmolVLM: Redefining small and efficient multimodal models
|
Apr 09, 2025 |
|
One-Minute Video Generation with Test-Time Training
|
Apr 09, 2025 |
|
Rethinking Reflection in Pre-Training
|
Apr 09, 2025 |
|
URECA: Unique Region Caption Anything
|
Apr 09, 2025 |
|
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
|
Apr 09, 2025 |
|
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
|
Apr 08, 2025 |
|
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
|
Apr 05, 2025 |
|
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
|
Apr 05, 2025 |
|
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
|
Apr 05, 2025 |
|
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
|
Apr 05, 2025 |
|
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
|
Apr 05, 2025 |
|
WikiVideo: Article Generation from Multiple Videos
|
Apr 05, 2025 |
|
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
|
Apr 04, 2025 |
|
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
|
Apr 04, 2025 |
|
Understanding R1-Zero-Like Training: A Critical Perspective
|
Apr 04, 2025 |
|
Towards Physically Plausible Video Generation via VLM Planning
|
Apr 04, 2025 |
|
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
|
Apr 04, 2025 |
|
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
|
Apr 04, 2025 |
|
START: Self-taught Reasoner with Tools
|
Mar 08, 2025 |
|
Token-Efficient Long Video Understanding for Multimodal LLMs
|
Mar 08, 2025 |
|
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
|
Mar 08, 2025 |
|
EgoLife: Towards Egocentric Life Assistant
|
Mar 08, 2025 |
|
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
|
Mar 07, 2025 |
|
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
|
Mar 07, 2025 |
|
Process-based Self-Rewarding Language Models
|
Mar 07, 2025 |
|
Visual-RFT: Visual Reinforcement Fine-Tuning
|
Mar 05, 2025 |
|
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
|
Mar 05, 2025 |
|
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
|
Mar 05, 2025 |
|
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
|
Mar 04, 2025 |
|
Chain of Draft: Thinking Faster by Writing Less
|
Mar 04, 2025 |
|
Multi-Turn Code Generation Through Single-Step Rewards
|
Mar 04, 2025 |
|
Self-rewarding correction for mathematical reasoning
|
Mar 01, 2025 |
|
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
|
Mar 01, 2025 |
|
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
|
Mar 01, 2025 |
|
LongRoPE2: Near-Lossless LLM Context Window Scaling
|
Mar 01, 2025 |
|
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
|
Mar 01, 2025 |
|
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
|
Mar 01, 2025 |
|
UniTok: A Unified Tokenizer for Visual Generation and Understanding
|
Mar 01, 2025 |
|
NeoBERT: A Next-Generation BERT
|
Mar 01, 2025 |
|
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
|
Mar 01, 2025 |
|
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
|
Mar 01, 2025 |
|
GHOST 2.0: generative high-fidelity one shot transfer of heads
|
Feb 28, 2025 |
|
Kanana: Compute-efficient Bilingual Language Models
|
Feb 28, 2025 |
|
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
|
Feb 28, 2025 |
|
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
|
Feb 28, 2025 |
|
Language Models' Factuality Depends on the Language of Inquiry
|
Feb 28, 2025 |
|
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
|
Feb 28, 2025 |
|
Towards an AI co-scientist
|
Feb 28, 2025 |
|
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
|
Feb 28, 2025 |
|
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
|
Feb 28, 2025 |
|
Rank1: Test-Time Compute for Reranking in Information Retrieval
|
Feb 28, 2025 |
|
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
|
Feb 22, 2025 |
|
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
|
Feb 22, 2025 |
|
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
|
Feb 22, 2025 |
|
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
|
Feb 22, 2025 |
|
S*: Test Time Scaling for Code Generation
|
Feb 22, 2025 |
|
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
|
Feb 22, 2025 |
|
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning
|
Feb 22, 2025 |
|
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
|
Feb 22, 2025 |
|
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
|
Feb 22, 2025 |
|
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
|
Feb 22, 2025 |
|
Qwen2.5-VL Technical Report
|
Feb 21, 2025 |
|
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
|
Feb 21, 2025 |
|
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
|
Feb 21, 2025 |
|
MoM: Linear Sequence Modeling with Mixture-of-Memories
|
Feb 21, 2025 |
|
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
|
Feb 21, 2025 |
|
Craw4LLM: Efficient Web Crawling for LLM Pretraining
|
Feb 21, 2025 |
|
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
|
Feb 21, 2025 |
|
Small Models Struggle to Learn from Strong Reasoners
|
Feb 21, 2025 |
|
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
|
Feb 21, 2025 |
|
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
|
Feb 21, 2025 |
|
Soundwave: Less is More for Speech-Text Alignment in LLMs
|
Feb 20, 2025 |
|
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
|
Feb 20, 2025 |
|
Continuous Diffusion Model for Language Modeling
|
Feb 20, 2025 |
|
Phantom: Subject-consistent video generation via cross-modal alignment
|
Feb 20, 2025 |
|
Rethinking Diverse Human Preference Learning through Principal Component Analysis
|
Feb 20, 2025 |
|
Magma: A Foundation Model for Multimodal AI Agents
|
Feb 20, 2025 |
|
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
|
Feb 20, 2025 |
|
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
|
Feb 20, 2025 |
|
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
|
Feb 20, 2025 |
|
You Do Not Fully Utilize Transformer's Representation Capacity
|
Feb 20, 2025 |
|
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
|
Feb 19, 2025 |
|
Learning Getting-Up Policies for Real-World Humanoid Robots
|
Feb 19, 2025 |
|
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
|
Feb 19, 2025 |
|
CRANE: Reasoning with constrained LLM generation
|
Feb 19, 2025 |
|
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
|
Feb 19, 2025 |
|
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
|
Feb 19, 2025 |
|
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
|
Feb 19, 2025 |
|
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
|
Feb 19, 2025 |
|
Region-Adaptive Sampling for Diffusion Transformers
|
Feb 18, 2025 |
|
Large Language Diffusion Models
|
Feb 18, 2025 |
|
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
|
Feb 18, 2025 |
|
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
|
Feb 18, 2025 |
|
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
|
Feb 18, 2025 |
|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
|
Feb 18, 2025 |
|
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
|
Feb 18, 2025 |
|
Diverse Inference and Verification for Advanced Reasoning
|
Feb 18, 2025 |
|
Precise Parameter Localization for Textual Generation in Diffusion Models
|
Feb 18, 2025 |
|
DarwinLM: Evolutionary Structured Pruning of Large Language Models
|
Feb 18, 2025 |
|
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
|
Feb 15, 2025 |
|
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
|
Feb 15, 2025 |
|
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
|
Feb 15, 2025 |
|
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
|
Feb 15, 2025 |
|
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights
|
Feb 15, 2025 |
|
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
|
Feb 15, 2025 |
|
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
|
Feb 15, 2025 |
|
Exploring the Potential of Encoder-free Architectures in 3D LMMs
|
Feb 15, 2025 |
|
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
|
Feb 15, 2025 |
|
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
|
Feb 15, 2025 |
|
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
|
Feb 14, 2025 |
|
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
|
Feb 14, 2025 |
|
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
|
Feb 14, 2025 |
|
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
|
Feb 14, 2025 |
|
Distillation Scaling Laws
|
Feb 14, 2025 |
|
TransMLA: Multi-Head Latent Attention Is All You Need
|
Feb 14, 2025 |
|
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
|
Feb 14, 2025 |
|
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
|
Feb 14, 2025 |
|
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
|
Feb 14, 2025 |
|
Expect the Unexpected: FailSafe Long Context QA for Finance
|
Feb 13, 2025 |
|
Competitive Programming with Large Reasoning Models
|
Feb 13, 2025 |
|
Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models
|
Feb 13, 2025 |
|
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
|
Feb 13, 2025 |
|
Magic 1-For-1: Generating One Minute Video Clips within One Minute
|
Feb 13, 2025 |
|
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
|
Feb 13, 2025 |
|
Teaching Language Models to Critique via Reinforcement Learning
|
Feb 13, 2025 |
|
Scaling Pre-training to One Hundred Billion Data for Vision Language Models
|
Feb 13, 2025 |
|
Enhance-A-Video: Better Generated Video for Free
|
Feb 13, 2025 |
|
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
|
Feb 12, 2025 |
|
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
|
Feb 12, 2025 |
|
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
|
Feb 12, 2025 |
|
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
|
Feb 12, 2025 |
|
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
|
Feb 12, 2025 |
|
LM2: Large Memory Models
|
Feb 12, 2025 |
|
Matryoshka Quantization
|
Feb 12, 2025 |
|
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
|
Feb 12, 2025 |
|
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
|
Feb 12, 2025 |
|
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
|
Feb 12, 2025 |
|
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
|
Feb 11, 2025 |
|
Fast Video Generation with Sliding Tile Attention
|
Feb 11, 2025 |
|
Goku: Flow Based Video Generative Foundation Models
|
Feb 11, 2025 |
|
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
|
Feb 11, 2025 |
|
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
|
Feb 11, 2025 |
|
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
|
Feb 11, 2025 |
|
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
|
Feb 11, 2025 |
|
Agency Is Frame-Dependent
|
Feb 11, 2025 |
|
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
|
Feb 11, 2025 |
|
Generating Symbolic World Models via Test-time Scaling of Large Language Models
|
Feb 11, 2025 |
|
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
|
Feb 08, 2025 |
|
UltraIF: Advancing Instruction Following from the Wild
|
Feb 08, 2025 |
|
Great Models Think Alike and this Undermines AI Oversight
|
Feb 08, 2025 |
|
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
|
Feb 08, 2025 |
|
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
|
Feb 08, 2025 |
|
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
|
Feb 08, 2025 |
|
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion
|
Feb 08, 2025 |
|
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
|
Feb 08, 2025 |
|
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
|
Feb 08, 2025 |
|
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
|
Feb 07, 2025 |
|
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets
|
Feb 07, 2025 |
|
Demystifying Long Chain-of-Thought Reasoning in LLMs
|
Feb 07, 2025 |
|
LIMO: Less is More for Reasoning
|
Feb 07, 2025 |
|
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
|
Feb 07, 2025 |
|
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
|
Feb 07, 2025 |
|
On Teacher Hacking in Language Model Distillation
|
Feb 07, 2025 |
|
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
|
Feb 07, 2025 |
|
Jailbreaking with Universal Multi-Prompts
|
Feb 07, 2025 |
|
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
|
Feb 06, 2025 |
|
Inverse Bridge Matching Distillation
|
Feb 06, 2025 |
|
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
|
Feb 06, 2025 |
|
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
|
Feb 06, 2025 |
|
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
|
Feb 06, 2025 |
|
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
|
Feb 06, 2025 |
|
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
|
Feb 06, 2025 |
|
The Differences Between Direct Alignment Algorithms are a Blur
|
Feb 05, 2025 |
|
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
|
Feb 05, 2025 |
|
Process Reinforcement through Implicit Rewards
|
Feb 05, 2025 |
|
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
|
Feb 05, 2025 |
|
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
|
Feb 05, 2025 |
|
Preference Leakage: A Contamination Problem in LLM-as-a-judge
|
Feb 05, 2025 |
|
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
|
Feb 05, 2025 |
|
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
|
Feb 05, 2025 |
|
AIN: The Arabic INclusive Large Multimodal Model
|
Feb 05, 2025 |
|
s1: Simple test-time scaling
|
Feb 04, 2025 |
|
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
|
Feb 04, 2025 |
|
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
|
Feb 04, 2025 |
|
PixelWorld: Towards Perceiving Everything as Pixels
|
Feb 04, 2025 |
|
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
|
Feb 04, 2025 |
|
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
|
Feb 04, 2025 |
|
Scalable-Softmax Is Superior for Attention
|
Feb 04, 2025 |
|
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
|
Feb 04, 2025 |
|
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
|
Feb 04, 2025 |
|
GuardReasoner: Towards Reasoning-based LLM Safeguards
|
Feb 01, 2025 |
|
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
|
Feb 01, 2025 |
|
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
|
Feb 01, 2025 |
|
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
|
Feb 01, 2025 |
|
Large Language Models Think Too Fast To Explore Effectively
|
Feb 01, 2025 |
|
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
|
Feb 01, 2025 |
|
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
|
Feb 01, 2025 |
|
o3-mini vs DeepSeek-R1: Which One is Safer?
|
Feb 01, 2025 |
|
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
|
Feb 01, 2025 |
|
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
|
Jan 31, 2025 |
|
Atla Selene Mini: A General Purpose Evaluation Model
|
Jan 31, 2025 |
|
Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts
|
Jan 31, 2025 |
|
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
|
Jan 31, 2025 |
|
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
|
Jan 31, 2025 |
|
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
|
Jan 31, 2025 |
|
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
|
Jan 31, 2025 |
|
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
|
Jan 30, 2025 |
|
Optimizing Large Language Model Training Using FP4 Quantization
|
Jan 30, 2025 |
|
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
|
Jan 30, 2025 |
|
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
|
Jan 30, 2025 |
|
Open Problems in Mechanistic Interpretability
|
Jan 30, 2025 |
|
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
|
Jan 30, 2025 |
|
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
|
Jan 30, 2025 |
|
Histoires Morales: A French Dataset for Assessing Moral Alignment
|
Jan 30, 2025 |
|
Qwen2.5-1M Technical Report
|
Jan 29, 2025 |
|
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
|
Jan 29, 2025 |
|
Towards General-Purpose Model-Free Reinforcement Learning
|
Jan 29, 2025 |
|
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
|
Jan 29, 2025 |
|
iFormer: Integrating ConvNet and Transformer for Mobile Application
|
Jan 29, 2025 |
|
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
|
Jan 29, 2025 |
|
CodeMonkeys: Scaling Test-Time Compute for Software Engineering
|
Jan 29, 2025 |
|
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
|
Jan 29, 2025 |
|
Humanity's Last Exam
|
Jan 28, 2025 |
|
Chain-of-Retrieval Augmented Generation
|
Jan 28, 2025 |
|
Redundancy Principles for MLLMs Benchmarks
|
Jan 28, 2025 |
|
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
|
Jan 28, 2025 |
|
RL + Transformer = A General-Purpose Problem Solver
|
Jan 28, 2025 |
|
Relightable Full-Body Gaussian Codec Avatars
|
Jan 28, 2025 |
|
Question Answering on Patient Medical Records with Private Fine-Tuned LLMs
|
Jan 28, 2025 |
|
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
|
Jan 28, 2025 |
|
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
|
Jan 28, 2025 |
|
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
|
Jan 28, 2025 |
|
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
|
Jan 25, 2025 |
|
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
|
Jan 25, 2025 |
|
Improving Video Generation with Human Feedback
|
Jan 25, 2025 |
|
Temporal Preference Optimization for Long-Form Video Understanding
|
Jan 25, 2025 |
|
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
|
Jan 25, 2025 |
|
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
|
Jan 25, 2025 |
|
DiffuEraser: A Diffusion Model for Video Inpainting
|
Jan 25, 2025 |
|
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
|
Jan 25, 2025 |
|
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
|
Jan 25, 2025 |
|
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
|
Jan 25, 2025 |
|
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
|
Jan 24, 2025 |
|
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
|
Jan 24, 2025 |
|
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
|
Jan 24, 2025 |
|
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
|
Jan 24, 2025 |
|
Kimi k1.5: Scaling Reinforcement Learning with LLMs
|
Jan 24, 2025 |
|
Autonomy-of-Experts Models
|
Jan 24, 2025 |
|
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
|
Jan 24, 2025 |
|
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
|
Jan 24, 2025 |
|
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
|
Jan 24, 2025 |
|
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
|
Jan 24, 2025 |
|
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
|
Jan 23, 2025 |
|
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
|
Jan 23, 2025 |
|
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
|
Jan 23, 2025 |
|
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
|
Jan 23, 2025 |
|
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
|
Jan 23, 2025 |
|
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
|
Jan 23, 2025 |
|
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
|
Jan 23, 2025 |
|
Reasoning Language Models: A Blueprint
|
Jan 23, 2025 |
|
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
|
Jan 23, 2025 |
|
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
|
Jan 23, 2025 |
|
GameFactory: Creating New Games with Generative Interactive Videos
|
Jan 22, 2025 |
|
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
|
Jan 22, 2025 |
|
SEAL: Entangled White-box Watermarks on Low-Rank Adaptation
|
Jan 22, 2025 |
|
The Lessons of Developing Process Reward Models in Mathematical Reasoning
|
Jan 15, 2025 |
|
Tensor Product Attention Is All You Need
|
Jan 15, 2025 |
|
$\text{Transformer}^2$: Self-adaptive LLMs
|
Jan 15, 2025 |
|
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
|
Jan 15, 2025 |
|
VideoAuteur: Towards Long Narrative Video Generation
|
Jan 15, 2025 |
|
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
|
Jan 15, 2025 |
|
WebWalker: Benchmarking LLMs in Web Traversal
|
Jan 15, 2025 |
|
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
|
Jan 15, 2025 |
|
UnCommon Objects in 3D
|
Jan 15, 2025 |
|
VideoRAG: Retrieval-Augmented Generation over Video Corpus
|
Jan 14, 2025 |
|
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
|
Jan 14, 2025 |
|
Enabling Scalable Oversight via Self-Evolving Critic
|
Jan 14, 2025 |
|
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
|
Jan 14, 2025 |
|
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
|
Jan 14, 2025 |
|
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
|
Jan 14, 2025 |
|
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
|
Jan 14, 2025 |
|
The GAN is dead; long live the GAN! A Modern GAN Baseline
|
Jan 11, 2025 |
|
An Empirical Study of Autoregressive Pre-training from Videos
|
Jan 11, 2025 |
|
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
|
Jan 11, 2025 |
|
Entropy-Guided Attention for Private LLMs
|
Jan 11, 2025 |
|
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
|
Jan 11, 2025 |
|
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
|
Jan 11, 2025 |
|
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
|
Jan 11, 2025 |
|
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
|
Jan 11, 2025 |
|
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
|
Jan 10, 2025 |
|
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
|
Jan 10, 2025 |
|
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
|
Jan 10, 2025 |
|
Agent Laboratory: Using LLM Agents as Research Assistants
|
Jan 10, 2025 |
|
LLM4SR: A Survey on Large Language Models for Scientific Research
|
Jan 10, 2025 |
|
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
|
Jan 10, 2025 |
|
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
|
Jan 10, 2025 |
|
GeAR: Generation Augmented Retrieval
|
Jan 10, 2025 |
|
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
|
Jan 10, 2025 |
|
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
|
Jan 10, 2025 |
|
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
|
Jan 09, 2025 |
|
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
|
Jan 09, 2025 |
|
Cosmos World Foundation Model Platform for Physical AI
|
Jan 09, 2025 |
|
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
|
Jan 09, 2025 |
|
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
|
Jan 09, 2025 |
|
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
|
Jan 09, 2025 |
|
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
|
Jan 09, 2025 |
|
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
|
Jan 09, 2025 |
|
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
|
Jan 09, 2025 |
|
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
|
Jan 09, 2025 |
|
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
|
Jan 08, 2025 |
|
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
|
Jan 08, 2025 |
|
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
|
Jan 08, 2025 |
|
Personalized Graph-Based Retrieval for Large Language Models
|
Jan 08, 2025 |
|
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring
|
Jan 08, 2025 |
|
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
|
Jan 08, 2025 |
|
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
|
Jan 08, 2025 |
|
TransPixar: Advancing Text-to-Video Generation with Transparency
|
Jan 08, 2025 |
|
AutoPresent: Designing Structured Visuals from Scratch
|
Jan 08, 2025 |
|
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
|
Jan 07, 2025 |
|
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
|
Jan 07, 2025 |
|
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
|
Jan 07, 2025 |
|
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
|
Jan 07, 2025 |
|
SDPO: Segment-Level Direct Preference Optimization for Social Agents
|
Jan 07, 2025 |
|
Graph Generative Pre-trained Transformer
|
Jan 07, 2025 |
|
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
|
Jan 07, 2025 |
|
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
|
Jan 07, 2025 |
|
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
|
Jan 04, 2025 |
|
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
|
Jan 04, 2025 |
|
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
|
Jan 04, 2025 |
|
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
|
Jan 04, 2025 |
|
ProgCo: Program Helps Self-Correction of Large Language Models
|
Jan 04, 2025 |
|
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
|
Jan 04, 2025 |
|
A3: Android Agent Arena for Mobile GUI Agents
|
Jan 04, 2025 |
|
MLLM-as-a-Judge for Image Safety without Human Labeling
|
Jan 04, 2025 |
|
Dynamic Scaling of Unit Tests for Code Reward Modeling
|
Jan 04, 2025 |
|
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
|
Jan 03, 2025 |
|
Xmodel-2 Technical Report
|
Jan 03, 2025 |
|
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
|
Jan 03, 2025 |
|
HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
|
Jan 03, 2025 |
|
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
|
Jan 03, 2025 |
|
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
|
Jan 02, 2025 |
|
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
|
Jan 02, 2025 |
|
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
|
Jan 01, 2025 |
|
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
|
Jan 01, 2025 |
|
Bringing Objects to Life: 4D generation from 3D objects
|
Jan 01, 2025 |
|
Efficiently Serving LLM Reasoning Programs with Certaindex
|
Jan 01, 2025 |
|
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
|
Jan 01, 2025 |
|
Edicho: Consistent Image Editing in the Wild
|
Jan 01, 2025 |
|
Facilitating large language model Russian adaptation with Learned Embedding Propagation
|
Jan 01, 2025 |
|
Training Software Engineering Agents and Verifiers with SWE-Gym
|
Jan 01, 2025 |
|
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
|
Jan 01, 2025 |
|
Slow Perception: Let's Perceive Geometric Figures Step-by-step
|
Jan 01, 2025 |
|
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
|
Dec 31, 2024 |
|
1.58-bit FLUX
|
Dec 31, 2024 |
|
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
|
Dec 31, 2024 |
|
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
|
Dec 31, 2024 |
|
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
|
Dec 31, 2024 |
|
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
|
Dec 31, 2024 |
|
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
|
Dec 31, 2024 |
|
The Superposition of Diffusion Models Using the Itô Density Estimator
|
Dec 31, 2024 |
|
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
|
Dec 31, 2024 |
|
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
|
Dec 31, 2024 |
|
YuLan-Mini: An Open Data-efficient Language Model
|
Dec 28, 2024 |
|
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
|
Dec 28, 2024 |
|
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
|
Dec 28, 2024 |
|
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
|
Dec 28, 2024 |
|
DepthLab: From Partial to Complete
|
Dec 26, 2024 |
|
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
|
Dec 26, 2024 |
|
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
|
Dec 26, 2024 |
|
In Case You Missed It: ARC 'Challenge' Is Not That Challenging
|
Dec 26, 2024 |
|
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
|
Dec 26, 2024 |
|
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
|
Dec 26, 2024 |
|
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
|
Dec 26, 2024 |
|
MotiF: Making Text Count in Image Animation with Motion Focal Loss
|
Dec 26, 2024 |
|
Bridging the Data Provenance Gap Across Text, Speech and Video
|
Dec 26, 2024 |
|
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
|
Dec 25, 2024 |
|
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
|
Dec 25, 2024 |
|
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
|
Dec 25, 2024 |
|
Diving into Self-Evolving Training for Multimodal Reasoning
|
Dec 25, 2024 |
|
Deliberation in Latent Space via Differentiable Cache Augmentation
|
Dec 25, 2024 |
|
Large Motion Video Autoencoding with Cross-modal Video VAE
|
Dec 25, 2024 |
|
OpenAI o1 System Card
|
Dec 25, 2024 |
|
Revisiting In-Context Learning with Long Context Language Models
|
Dec 25, 2024 |
|
Outcome-Refining Process Supervision for Code Generation
|
Dec 25, 2024 |
|
LearnLM: Improving Gemini for Learning
|
Dec 25, 2024 |
|
Parallelized Autoregressive Visual Generation
|
Dec 24, 2024 |
|
Offline Reinforcement Learning for LLM Multi-Step Reasoning
|
Dec 24, 2024 |
|
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
|
Dec 24, 2024 |
|
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
|
Dec 24, 2024 |
|
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
|
Dec 24, 2024 |
|
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
|
Dec 24, 2024 |
|
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
|
Dec 24, 2024 |
|
TRecViT: A Recurrent Video Transformer
|
Dec 24, 2024 |
|
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
|
Dec 24, 2024 |
|
Multi-LLM Text Summarization
|
Dec 24, 2024 |
|
Qwen2.5 Technical Report
|
Dec 21, 2024 |
|
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
|
Dec 21, 2024 |
|
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
|
Dec 21, 2024 |
|
How to Synthesize Text Data without Model Collapse?
|
Dec 21, 2024 |
|
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
|
Dec 21, 2024 |
|
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
|
Dec 21, 2024 |
|
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
|
Dec 21, 2024 |
|
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
|
Dec 21, 2024 |
|
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
|
Dec 21, 2024 |
|
No More Adam: Learning Rate Scaling at Initialization is All You Need
|
Dec 20, 2024 |
|
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
|
Dec 20, 2024 |
|
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
|
Dec 20, 2024 |
|
AniDoc: Animation Creation Made Easier
|
Dec 20, 2024 |
|
FashionComposer: Compositional Fashion Image Generation
|
Dec 20, 2024 |
|
GUI Agents: A Survey
|
Dec 20, 2024 |
|
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
|
Dec 20, 2024 |
|
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
|
Dec 20, 2024 |
|
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
|
Dec 20, 2024 |
|
Are Your LLMs Capable of Stable Reasoning?
|
Dec 19, 2024 |
|
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
|
Dec 19, 2024 |
|
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
|
Dec 19, 2024 |
|
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
|
Dec 19, 2024 |
|
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
|
Dec 19, 2024 |
|
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
|
Dec 19, 2024 |
|
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
|
Dec 19, 2024 |
|
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
|
Dec 19, 2024 |
|
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
|
Dec 19, 2024 |
|
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
|
Dec 19, 2024 |
|
Byte Latent Transformer: Patches Scale Better Than Tokens
|
Dec 18, 2024 |
|
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
|
Dec 18, 2024 |
|
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
|
Dec 18, 2024 |
|
BrushEdit: All-In-One Image Inpainting and Editing
|
Dec 18, 2024 |
|
ColorFlow: Retrieval-Augmented Image Sequence Colorization
|
Dec 18, 2024 |
|
Smaller Language Models Are Better Instruction Evolvers
|
Dec 18, 2024 |
|
Causal Diffusion Transformers for Generative Modeling
|
Dec 18, 2024 |
|
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
|
Dec 18, 2024 |
|
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
|
Dec 18, 2024 |
|
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
|
Dec 18, 2024 |
|
Apollo: An Exploration of Video Understanding in Large Multimodal Models
|
Dec 17, 2024 |
|
GenEx: Generating an Explorable World
|
Dec 17, 2024 |
|
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
|
Dec 17, 2024 |
|
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
|
Dec 17, 2024 |
|
Large Action Models: From Inception to Implementation
|
Dec 17, 2024 |
|
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
|
Dec 17, 2024 |
|
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
|
Dec 17, 2024 |
|
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
|
Dec 17, 2024 |
|
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
|
Dec 17, 2024 |
|
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
|
Dec 17, 2024 |
|
Phi-4 Technical Report
|
Dec 14, 2024 |
|
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
|
Dec 14, 2024 |
|
Multimodal Latent Language Modeling with Next-Token Diffusion
|
Dec 14, 2024 |
|
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
|
Dec 14, 2024 |
|
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
|
Dec 14, 2024 |
|
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
|
Dec 14, 2024 |
|
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
|
Dec 14, 2024 |
|
JuStRank: Benchmarking LLM Judges for System Ranking
|
Dec 14, 2024 |
|
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
|
Dec 13, 2024 |
|
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
|
Dec 13, 2024 |
|
POINTS1.5: Building a Vision-Language Model towards Real World Applications
|
Dec 13, 2024 |
|
Learning Flow Fields in Attention for Controllable Person Image Generation
|
Dec 13, 2024 |
|
StyleMaster: Stylize Your Video with Artistic Generation and Translation
|
Dec 13, 2024 |
|
StreamChat: Chatting with Streaming Video
|
Dec 13, 2024 |
|
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
|
Dec 13, 2024 |
|
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
|
Dec 13, 2024 |
|
The BrowserGym Ecosystem for Web Agent Research
|
Dec 13, 2024 |
|
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
|
Dec 12, 2024 |
|
Hidden in the Noise: Two-Stage Robust Watermarking for Images
|
Dec 12, 2024 |
|
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
|
Dec 12, 2024 |
|
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
|
Dec 12, 2024 |
|
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
|
Dec 12, 2024 |
|
Mobile Video Diffusion
|
Dec 12, 2024 |
|
Granite Guardian
|
Dec 12, 2024 |
|
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
|
Dec 11, 2024 |
|
ProcessBench: Identifying Process Errors in Mathematical Reasoning
|
Dec 11, 2024 |
|
Training Large Language Models to Reason in a Continuous Latent Space
|
Dec 11, 2024 |
|
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
|
Dec 11, 2024 |
|
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
|
Dec 11, 2024 |
|
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
|
Dec 11, 2024 |
|
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
|
Dec 11, 2024 |
|
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
|
Dec 11, 2024 |
|
Robust Multi-bit Text Watermark with LLM-based Paraphrasers
|
Dec 11, 2024 |
|
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
|
Dec 11, 2024 |
|
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
|
Dec 10, 2024 |
|
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
|
Dec 10, 2024 |
|
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
|
Dec 10, 2024 |
|
APOLLO: SGD-like Memory, AdamW-level Performance
|
Dec 10, 2024 |
|
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
|
Dec 10, 2024 |
|
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
|
Dec 10, 2024 |
|
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
|
Dec 10, 2024 |
|
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
|
Dec 10, 2024 |
|
CompCap: Improving Multimodal Large Language Models with Composite Captions
|
Dec 10, 2024 |
|
VisionZip: Longer is Better but Not Necessary in Vision Language Models
|
Dec 08, 2024 |
|
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
|
Dec 08, 2024 |
|
NVILA: Efficient Frontier Visual Language Models
|
Dec 08, 2024 |
|
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
|
Dec 08, 2024 |
|
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
|
Dec 08, 2024 |
|
Evaluating Language Models as Synthetic Data Generators
|
Dec 08, 2024 |
|
A Noise is Worth Diffusion Guidance
|
Dec 08, 2024 |
|
Structured 3D Latents for Scalable and Versatile 3D Generation
|
Dec 08, 2024 |
|
Negative Token Merging: Image-based Adversarial Feature Guidance
|
Dec 08, 2024 |
|
MV-Adapter: Multi-view Consistent Image Generation Made Easy
|
Dec 08, 2024 |
|
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
|
Nov 28, 2024 |
|
Star Attention: Efficient LLM Inference over Long Sequences
|
Nov 28, 2024 |
|
Pathways on the Image Manifold: Image Editing via Video Generation
|
Nov 28, 2024 |
|
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
|
Nov 28, 2024 |
|
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
|
Nov 28, 2024 |
|
SketchAgent: Language-Driven Sequential Sketch Generation
|
Nov 28, 2024 |
|
TEXGen: a Generative Diffusion Model for Mesh Textures
|
Nov 28, 2024 |
|
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
|
Nov 28, 2024 |
|
Learning 3D Representations from Procedural 3D Programs
|
Nov 28, 2024 |
|
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
|
Nov 28, 2024 |
|
Material Anything: Generating Materials for Any 3D Object via Diffusion
|
Nov 27, 2024 |
|
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
|
Nov 27, 2024 |
|
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
|
Nov 27, 2024 |
|
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
|
Nov 27, 2024 |
|
MH-MoE: Multi-Head Mixture-of-Experts
|
Nov 27, 2024 |
|
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
|
Nov 27, 2024 |
|
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
|
Nov 27, 2024 |
|
Knowledge Transfer Across Modalities with Natural Language Supervision
|
Nov 27, 2024 |
|
One Diffusion to Generate Them All
|
Nov 27, 2024 |
|
VisualLens: Personalization through Visual History
|
Nov 27, 2024 |
|
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training
|
Nov 26, 2024 |
|
Style-Friendly SNR Sampler for Style-Driven Generation
|
Nov 26, 2024 |
|
OminiControl: Minimal and Universal Control for Diffusion Transformer
|
Nov 26, 2024 |
|
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
|
Nov 26, 2024 |
|
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
|
Nov 26, 2024 |
|
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
|
Nov 26, 2024 |
|
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
|
Nov 26, 2024 |
|
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
|
Nov 26, 2024 |
|
MyTimeMachine: Personalized Facial Age Transformation
|
Nov 26, 2024 |
|
Novel View Extrapolation with Video Diffusion Priors
|
Nov 26, 2024 |
|
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
|
Nov 23, 2024 |
|
Multimodal Autoregressive Pre-training of Large Vision Encoders
|
Nov 23, 2024 |
|
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
|
Nov 23, 2024 |
|
Hymba: A Hybrid-head Architecture for Small Language Models
|
Nov 23, 2024 |
|
Natural Language Reinforcement Learning
|
Nov 23, 2024 |
|
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
|
Nov 23, 2024 |
|
Ultra-Sparse Memory Network
|
Nov 23, 2024 |
|
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
|
Nov 23, 2024 |
|
Stable Flow: Vital Layers for Training-Free Image Editing
|
Nov 23, 2024 |
|
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
|
Nov 23, 2024 |
|
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
|
Nov 22, 2024 |
|
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
|
Nov 22, 2024 |
|
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
|
Nov 22, 2024 |
|
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
|
Nov 22, 2024 |
|
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
|
Nov 22, 2024 |
|
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
|
Nov 22, 2024 |
|
Stylecodes: Encoding Stylistic Information For Image Generation
|
Nov 22, 2024 |
|
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
|
Nov 22, 2024 |
|
Loss-to-Loss Prediction: Scaling Laws for All Datasets
|
Nov 22, 2024 |
|
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
|
Nov 22, 2024 |
|
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
|
Nov 21, 2024 |
|
Continuous Speculative Decoding for Autoregressive Image Generation
|
Nov 21, 2024 |
|
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
|
Nov 21, 2024 |
|
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
|
Nov 21, 2024 |
|
Soft Robotic Dynamic In-Hand Pen Spinning
|
Nov 21, 2024 |
|
Building Trust: Foundations of Security, Safety and Transparency in AI
|
Nov 21, 2024 |
|
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
|
Nov 21, 2024 |
|
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
|
Nov 21, 2024 |
|
Generative World Explorer
|
Nov 20, 2024 |
|
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
|
Nov 20, 2024 |
|
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
|
Nov 20, 2024 |
|
AnimateAnything: Consistent and Controllable Animation for Video Generation
|
Nov 20, 2024 |
|
Top-$nσ$: Not All Logits Are You Need
|
Nov 20, 2024 |
|
Drowning in Documents: Consequences of Scaling Reranker Inference
|
Nov 20, 2024 |
|
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
|
Nov 20, 2024 |
|
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
|
Nov 20, 2024 |
|
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
|
Nov 20, 2024 |
|
LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
|
Nov 20, 2024 |
|
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
|
Nov 19, 2024 |
|
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
|
Nov 19, 2024 |
|
Xmodel-1.5: An 1B-scale Multilingual LLM
|
Nov 19, 2024 |
|
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
|
Nov 16, 2024 |
|
MagicQuill: An Intelligent Interactive Image Editing System
|
Nov 16, 2024 |
|
Cut Your Losses in Large-Vocabulary Language Models
|
Nov 16, 2024 |
|
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
|
Nov 16, 2024 |
|
Sharingan: Extract User Action Sequence from Desktop Recordings
|
Nov 16, 2024 |
|
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks
|
Nov 16, 2024 |
|
Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples
|
Nov 16, 2024 |
|
Direct Preference Optimization Using Sparse Feature-Level Constraints
|
Nov 15, 2024 |
|
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
|
Nov 15, 2024 |
|
Can sparse autoencoders be used to decompose and interpret steering vectors?
|
Nov 15, 2024 |
|
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
|
Nov 15, 2024 |
|
SAMPart3D: Segment Any Part in 3D Objects
|
Nov 14, 2024 |
|
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
|
Nov 14, 2024 |
|
Stronger Models are NOT Stronger Teachers for Instruction Tuning
|
Nov 14, 2024 |
|
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
|
Nov 14, 2024 |
|
Scaling Properties of Diffusion Models for Perceptual Tasks
|
Nov 14, 2024 |
|
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
|
Nov 14, 2024 |
|
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
|
Nov 13, 2024 |
|
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
|
Nov 13, 2024 |
|
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
|
Nov 13, 2024 |
|
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
|
Nov 13, 2024 |
|
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
|
Nov 13, 2024 |
|
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
|
Nov 13, 2024 |
|
Watermark Anything with Localized Messages
|
Nov 13, 2024 |
|
Autoregressive Models in Vision: A Survey
|
Nov 13, 2024 |
|
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
|
Nov 12, 2024 |
|
Balancing Pipeline Parallelism with Vocabulary Parallelism
|
Nov 12, 2024 |
|
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
|
Nov 12, 2024 |
|
DELIFT: Data Efficient Language model Instruction Fine Tuning
|
Nov 12, 2024 |
|
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
|
Nov 12, 2024 |
|
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
|
Nov 12, 2024 |
|
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
|
Nov 12, 2024 |
|
Improving the detection of technical debt in Java source code with an enriched dataset
|
Nov 12, 2024 |
|
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
|
Nov 09, 2024 |
|
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
|
Nov 09, 2024 |
|
BitNet a4.8: 4-bit Activations for 1-bit LLMs
|
Nov 09, 2024 |
|
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
|
Nov 09, 2024 |
|
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
|
Nov 09, 2024 |
|
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
|
Nov 09, 2024 |
|
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model
|
Nov 09, 2024 |
|
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
|
Nov 09, 2024 |
|
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation
|
Nov 09, 2024 |
|
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
|
Nov 09, 2024 |
|
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
|
Nov 08, 2024 |
|
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
|
Nov 08, 2024 |
|
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
|
Nov 08, 2024 |
|
Self-Consistency Preference Optimization
|
Nov 08, 2024 |
|
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
|
Nov 08, 2024 |
|
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
|
Nov 07, 2024 |
|
LLaMo: Large Language Model-based Molecular Graph Assistant
|
Nov 07, 2024 |
|
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
|
Nov 07, 2024 |
|
Controlling Language and Diffusion Models by Transporting Activations
|
Nov 07, 2024 |
|
Sample-Efficient Alignment for LLMs
|
Nov 07, 2024 |
|
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
|
Nov 07, 2024 |
|
Adaptive Length Image Tokenization via Recurrent Allocation
|
Nov 07, 2024 |
|
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
|
Nov 07, 2024 |
|
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge
|
Nov 07, 2024 |
|
Inference Optimal VLMs Need Only One Visual Token but Larger Models
|
Nov 07, 2024 |
|
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
|
Nov 06, 2024 |
|
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
|
Nov 06, 2024 |
|
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
|
Nov 06, 2024 |
|
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
|
Nov 06, 2024 |
|
Training-free Regional Prompting for Diffusion Transformers
|
Nov 06, 2024 |
|
How Far is Video Generation from World Model: A Physical Law Perspective
|
Nov 06, 2024 |
|
Survey of Cultural Awareness in Language Models: Text and Beyond
|
Nov 06, 2024 |
|
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
|
Nov 06, 2024 |
|
GenXD: Generating Any 3D and 4D Scenes
|
Nov 06, 2024 |
|
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
|
Nov 06, 2024 |
|
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
|
Nov 05, 2024 |
|
Personalization of Large Language Models: A Survey
|
Nov 05, 2024 |
|
Constant Acceleration Flow
|
Nov 05, 2024 |
|
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
|
Nov 05, 2024 |
|
Randomized Autoregressive Visual Generation
|
Nov 05, 2024 |
|
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
|
Nov 05, 2024 |
|
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
|
Nov 05, 2024 |
|
In-Context LoRA for Diffusion Transformers
|
Nov 05, 2024 |
|
Physics in Next-token Prediction
|
Nov 05, 2024 |
|
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes
|
Nov 05, 2024 |
|
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
|
Nov 03, 2024 |
|
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
|
Nov 03, 2024 |
|
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
|
Nov 03, 2024 |
|
Language Models can Self-Lengthen to Generate Long Texts
|
Nov 03, 2024 |
|
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
|
Nov 03, 2024 |
|
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
|
Nov 03, 2024 |
|
SelfCodeAlign: Self-Alignment for Code Generation
|
Nov 03, 2024 |
|
Learning Video Representations without Natural Videos
|
Nov 03, 2024 |
|
AAAR-1.0: Assessing AI's Potential to Assist Research
|
Nov 03, 2024 |
|
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays
|
Nov 03, 2024 |