alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Alt+↵ To search

Events

Watch Recordings

100 Years of Embeddings02/14 · Akshay Agrawal · Marimo

Papers Benchmarks

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

07 Feb 2026

Shenyuan Gao

William Liang

Kaiyuan Zheng

DreamDojo introduces a foundation world model that learns to simulate diverse, dexterous robot tasks by training on an unprecedented 44,711 hours of egocentric human video data. The model achieves real-time inference at 10.81 FPS and demonstrates a strong correlation (Pearson r=0.995) between simulated and real-world policy evaluation success rates.

#agents #computer-science #artificial-intelligence

Paper thumbnail

Generative Modeling via Drifting

06 Feb 2026

Harvard University MIT logo

Mingyang Deng

He Li

Tianhong Li

Drifting Models propose a new generative modeling paradigm that shifts iterative distribution matching to training time, enabling high-quality sample generation in a single forward pass. The method achieves an FID of 1.54 on ImageNet 256x256 using a single neural function evaluation in latent space and 1.61 in pixel space, outperforming previous single-step approaches and showing effectiveness in robotic control tasks.

#computer-science #computer-vision-and-pattern-recognition #machine-learning

Paper thumbnail

05 Feb 2026

Mohammed Abouzaid

Andrew J. Blumberg

Martin Hairer

The "First Proof" project introduced a new methodology with ten unique, unpublished research-level mathematics questions to assess AI's autonomous problem-solving capabilities. Preliminary tests indicate that current frontier AI systems struggle to generate correct proofs for these complex problems independently.

#agents #chain-of-thought #computer-science

Paper thumbnail

From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers

07 Feb 2026

Ziming Liu

Sophia Sanborn

Surya Ganguli

Researchers from Stanford University reveal that general-purpose Transformers can learn fundamental physical laws, rather than just performing accurate predictions, by incorporating three minimal inductive biases: spatial smoothness, spatial stability, and temporal locality. Their approach demonstrates a transition from 'Keplerian' curve-fitting to 'Newtonian' mechanistic understanding, particularly when the model's context length is constrained.

#agents #causal-inference #computer-science

Paper thumbnail

Learning a Generative Meta-Model of LLM Activations

07 Feb 2026

Grace Luo

Jiahai Feng

Trevor Darrell

Researchers from UC Berkeley and independent collaborators developed Generative Latent Prior (GLP), a diffusion model trained on LLM activations to learn their complex, intrinsic distribution. This approach enhances activation steering by projecting off-manifold activations back to a coherent state, and improves concept isolation, achieving an average AUC of 0.87 for 1-D probing tasks on Llama8B, surpassing linear methods.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Learning to Reason in 13 Parameters

04 Feb 2026

John X. Morris

Niloofar Mireshghallah

Mark Ibrahim

Meta FAIR researchers developed TinyLoRA, a parameter-efficient finetuning method that enables large language models to acquire complex mathematical reasoning skills by training as few as 13 parameters. This approach, when combined with reinforcement learning, achieved 91% accuracy on GSM8K, demonstrating a significant advancement in ultra-low-capacity model adaptation.

#computer-science #machine-learning #deep-reinforcement-learning

Paper thumbnail

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

07 Feb 2026

Yuchen Yan

Liang Jiang

Jin Jiang

InftyThink+ presents a reinforcement learning framework to optimize iterative reasoning in large language models, addressing limitations in long-horizon tasks. This approach enhances reasoning accuracy on complex problems while simultaneously improving inference efficiency and accelerating RL training, achieving an average accuracy boost of 9.89 points and reducing latency by 29.20 seconds with an efficiency reward.

#agents #chain-of-thought #computer-science

Paper thumbnail

Context Forcing: Consistent Autoregressive Video Generation with Long Context

06 Feb 2026

University of Waterloo University of California, Merced

Shuo Chen

Cong Wei

Sun Sun

Researchers at the University of California, Merced, and the University of Waterloo developed Context Forcing, a framework for autoregressive video generation that achieves effective context lengths over 20 seconds. This method, leveraging a novel Slow-Fast Memory architecture and a two-stage distillation process, enables consistent minute-long video generation, showing a 2-10x improvement in effective context over prior methods.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

06 Feb 2026

Baichuan-M3 Team

Chengfeng Dou

Fan Yang

Baichuan-M3 advances medical large language models beyond traditional question-answering, transforming them into active clinical decision-support partners. The model achieves state-of-the-art performance in interactive clinical reasoning, hallucination suppression, and diagnostic accuracy on benchmarks like ScanBench and HealthBench, often surpassing leading proprietary models and human baselines.

#agents #ai-for-health #computer-science

Paper thumbnail

MOVA: Towards Scalable and Synchronized Video-Audio Generation

10 Feb 2026

Shanghai Artificial Intelligence Laboratory OpenMOSS

SII-OpenMOSS Team

Donghua Yu

Mingshu Chen

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

#computer-science #computer-vision-and-pattern-recognition #sound

Paper thumbnail

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

06 Feb 2026

National University of Singapore Show Lab

Xiaokang Liu

Zechen Bai

Hai Ci

The paper "World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy" from the Show Lab at the National University of Singapore introduces a framework for efficient reinforcement learning (RL) of robotic policies by co-evolving a video world model and a VLA policy. This method enables substantial improvements in VLA policy success rates, achieving up to 50% in real-world tasks and over 97% in some simulated benchmarks, by minimizing the need for costly real-world interactions.

#computer-science #robotics

Paper thumbnail

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

07 Feb 2026

Alisia Lupidi

Bhavul Gauri

Thomas Simon Foster

AIRS-Bench, a new benchmark suite, evaluates AI Research Agents on machine learning tasks by simulating the full scientific discovery workflow, free from data contamination. It reveals that current agents, while occasionally surpassing human SOTA on specific tasks, generally operate far below human expert levels, demonstrating significant headroom for improvement in autonomous scientific reasoning.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

06 Feb 2026

Ruiqian Nai

Boyuan Zheng

Junming Zhao

HuMI, a Humanoid Manipulation Interface, introduces a system for efficiently collecting robot-free whole-body demonstrations and a hierarchical policy learning framework. The approach enables humanoid robots to perform diverse, complex tasks involving full-body coordination, demonstrating a 3x increase in data collection efficiency and robust generalization to unseen environments and objects.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

06 Feb 2026

Yuchen Jiang

Jie Zhu

Xintian Han

ByteDance AML introduces TokenMixer-Large, an advanced deep learning ranking model for industrial recommender systems, designed to scale efficiently up to multi-billion parameters. Through architectural innovations and comprehensive optimizations, the model achieves consistent online performance gains across Douyin's ad, e-commerce, and live streaming platforms, including a 2.98% increase in GMV and a 2.0% improvement in Advertiser Satisfaction Score.

#computer-science #information-retrieval

Paper thumbnail

Reinforced Attention Learning

05 Feb 2026

Bangzheng Li

Jianmo Ni

Chen Qu

Reinforced Attention Learning (RAL) is a post-training framework for Multimodal Large Language Models (MLLMs) that directly optimizes internal attention distributions instead of just output tokens. This approach improves visual grounding and multimodal alignment, achieving superior performance on diverse image and video question-answering benchmarks, including gains of +5.8% on V*Bench and +3.4% on NExTQA.

#attention-mechanisms #computer-science #computation-and-language

Paper thumbnail

Large Language Model Reasoning Failures

06 Feb 2026

Peiyang Song

Pengrui Han

Noah Goodman

Researchers from Caltech, Stanford, and Carleton College systematically surveyed Large Language Model reasoning failures, proposing a two-axis taxonomy to classify shortcomings in informal, formal, and embodied reasoning. Their analysis reveals that many failures stem from architectural constraints, training data biases, and insufficient real-world grounding, along with discussing mitigation approaches.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Second law of thermodynamics in closed quantum many-body systems

06 Feb 2026

Yuuya Chiba

Yasushi Yoneta

Ryusuke Hamazaki

This paper develops a rigorous framework that reconciles the second law of thermodynamics with the microscopic dynamics of closed quantum many-body systems. It introduces new quantum-mechanical definitions of thermal equilibrium, adiabatic operations, and entropy, rigorously demonstrating the emergence of Planck's principle and the law of increasing entropy.

#statistical-mechanics #physics #quantum-physics

Paper thumbnail

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

07 Feb 2026

Tsinghua University Zhejiang University logo

Zhejiang University

Kaiyi Huang

Yukun Huang

Yu Li

CINESCENE introduces a framework for cinematic video generation that leverages implicit 3D scene representations to decouple static environments from dynamic subjects. This approach enables the synthesis of high-quality, scene-consistent videos featuring novel dynamic elements under user-specified camera trajectories, addressing challenges in traditional and AI-driven video production.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

Maximum Likelihood Reinforcement Learning

03 Feb 2026

Fahim Tajwar

Guanning Zeng

Yueer Zhou

Maximum Likelihood Reinforcement Learning (MaxRL), developed by researchers primarily at Carnegie Mellon University, introduces a sampling-based framework that approximates maximum likelihood optimization for tasks with binary outcome feedback. MaxRL demonstrates superior scaling with increased computational resources, mitigates issues like pass@k degradation in Large Language Models (LLMs), and achieves up to 20x test-time scaling efficiency on mathematical reasoning benchmarks by generating a higher number of correct solutions.

#computer-science #machine-learning

Paper thumbnail

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

06 Feb 2026

Daniil Plyusov

Alexey Gorbatovski

Boris Shaposhnikov

F-GRPO, developed by T-Tech and Saint Petersburg Electrotechnical University "LETI", introduces a difficulty-aware scaling method to address distribution sharpening in Reinforcement Learning with Verifiable Rewards for Large Language Models. This approach significantly improves solution diversity (pass@k) and out-of-domain generalization while maintaining or improving single-attempt accuracy (pass@1), achieving performance comparable to using 4x more computational resources.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

06 Feb 2026

Shanghai Jiao Tong University MBZUAI

Haoyuan Li

Qihang Cao

Tao Tang

Researchers from Sun Yat-sen University, MBZUAI, and Yinwang Intelligent Technology developed GeoThinker, an active geometry integration framework for Multimodal Large Language Models (MLLMs). This framework allows MLLMs to selectively retrieve task-relevant geometric cues, achieving a new state-of-the-art average score of 62.23 on a diverse set of spatial intelligence benchmarks and demonstrating robust generalization to downstream tasks.

#agents #attention-mechanisms #autonomous-vehicles

Paper thumbnail

Forward-mode automatic differentiation for the tensor renormalization group and its relation to the impurity method

10 Feb 2026

Tohoku University

Yuto Sugimoto

We propose a forward-mode automatic differentiation (AD) framework for tensor renormalization group (TRG) methods. In this approach, evaluating the derivatives of the partition function up to order

k

increases the matrix-multiplication cost by a factor of

(k+1)(k+2)/2

compared to computing the free energy alone, while the memory footprint is only

k

times that of the original calculation. In the limit where the derivatives of the SVD are neglected, we establish a theoretical correspondence between our forward-mode AD and conventional impurity methods. Numerically, we find that the proposed AD algorithm can calculate internal energy and specific heat significantly higher accuracy than the impurity method at comparable computational cost. We also provide a practical procedure to extract critical exponents from derivatives of the renormalized tensor in TRG calculations in both two and three dimensions.

#high-energy-physics-lattice #physics

Paper thumbnail

MIND: Benchmarking Memory Consistency and Action Control in World Models

09 Feb 2026

National University of Singapore Central South University

Yixuan Ye

Xuanyu Lu

Yuxin Jiang

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: this https URL

#agents #computer-science #artificial-intelligence

Paper thumbnail

Self-Improving World Modelling with Latent Actions

06 Feb 2026

Yifu Qiu

Zheng Zhao

Waylon Li

The SWIRL framework allows Large Language Models and Vision-Language Models to develop intrinsic world models by learning from unlabelled state-only sequences, treating actions as latent variables. This method achieves up to a 28% improvement in visual dynamics tasks and a 4.03 BLEU increase in textual tool-calling performance, demonstrating enhanced data efficiency and generalization across modalities.

#agents #computer-science #artificial-intelligence

Paper thumbnail

DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

06 Feb 2026

Beijing Jiaotong University Xiaomi logo

Feiyang jia

Lin Liu

Ziying Song

DriveWorld-VLA introduces a unified latent-space world modeling framework that integrates Vision-Language-Action (VLA) models, enabling autonomous driving systems to perform action-conditioned "what-if" reasoning for proactive planning. The system demonstrates state-of-the-art performance on NAVSIMv1, NAVSIMv2, and nuScenes benchmarks, achieving high safety compliance and low collision rates.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

WorldCompass: Reinforcement Learning for Long-Horizon World Models

10 Feb 2026

Zhejiang University The University of Hong Kong logo

The University of Hong Kong

Zehan Wang

Tengfei Wang

Haiyu Zhang

WorldCompass is a reinforcement learning post-training framework that significantly improves interactive video-based world models' ability to accurately follow explicit actions and maintain visual quality over long sequences. It boosts composite action accuracy from approximately 20% to 55% and enhances overall visual fidelity across various generative tasks.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

ERNIE 5.0 Technical Report

05 Feb 2026

Researchers at Baidu developed ERNIE 5.0, a trillion-parameter unified autoregressive foundation model that natively supports multimodal understanding and generation across text, image, video, and audio. It achieves balanced performance comparable to or surpassing specialized baselines on a wide range of perception, reasoning, and generative tasks.

#computer-science #computation-and-language

Paper thumbnail

Shared LoRA Subspaces for almost Strict Continual Learning

06 Feb 2026

Prakhar Kaushik

Ankit Vaidya

Shravan Chaudhari

Researchers at Johns Hopkins University developed "Share," a Parameter-Efficient Continual Finetuning (PaCT) framework that dynamically updates a single, shared low-rank subspace for large pre-trained models. Operating without data replay or increased model size, Share achieves up to 100x parameter reduction and 281x memory savings while maintaining performance comparable to non-continual LoRA and demonstrating backward knowledge transfer in various tasks.

#computer-science #continual-learning #artificial-intelligence

Paper thumbnail

Kimi K2.5: Visual Agentic Intelligence

03 Feb 2026

Moonshot AI's Kimi K2.5 advances multimodal agentic intelligence by jointly optimizing text and vision from early pre-training, demonstrating state-of-the-art performance across diverse benchmarks. Its novel Agent Swarm framework enables parallel task execution, reducing inference latency up to 4.5 times on complex agentic workloads.

#computer-science #computation-and-language

Paper thumbnail

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

06 Feb 2026

Tianyi Hu

Qingxu Fu

Yanxi Chen

A reinforcement learning algorithm called SeeUPO, developed by Tongyi Lab at Alibaba Group, offers convergence guarantees for training large language model (LLM) agents in multi-turn interactions. It delivers up to 54.6% relative performance improvements over baselines with stable training, operating as a critic-free method.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Agentic Uncertainty Reveals Agentic Overconfidence

07 Feb 2026

Jean Kaddour

Srijan Patel

Gbètondji Dovonon

AI agents consistently overestimate their probability of success on complex, multi-step tasks, exhibiting a pervasive overconfidence ranging from a 38 to 55 percentage point gap between predicted and actual success rates. Explicitly prompting agents to seek flaws (adversarial framing) significantly improves the calibration of these self-assessments.

#agents #computer-science #artificial-intelligence

Paper thumbnail

DynaRetarget: Dynamically-Feasible Retargeting using Sampling-Based Trajectory Optimization

07 Feb 2026

Technical University of Munich

Victor Dhedin

Ilyass Taouil

Shafeef Omar

Researchers at the Technical University of Munich introduced DynaRetarget, a pipeline featuring a sampling-based trajectory optimization (SBTO) framework, to produce dynamically feasible humanoid loco-manipulation trajectories from human demonstrations. This method yields high-quality data that enhances reinforcement learning and facilitates zero-shot sim-to-real transfer to physical robots.

#computer-science #robotics

Paper thumbnail

DFlash: Block Diffusion for Flash Speculative Decoding

06 Feb 2026

Jian Chen

Yesheng Liang

Zhijian Liu

A speculative decoding framework utilizing a lightweight block diffusion model for parallel token drafting demonstrates over 6x lossless acceleration for large language model inference. This approach achieves up to 2.5x higher speedup than existing state-of-the-art methods by conditioning the drafter on target model context features.

#computer-science #computation-and-language #efficient-transformers

Paper thumbnail

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

07 Feb 2026

Mohamed bin Zayed University of Artificial Intelligence

Ankan Deria

Komal Kumar

Adinath Madhavrao Dukre

The MBZUAI team developed MedMO, an open-source multimodal large language model, specifically post-trained for medical image understanding and robust visual grounding across diverse modalities. It achieves state-of-the-art performance in medical question answering, report generation, and spatial localization, leveraging a comprehensive dataset of over 26 million medical and biomedical samples.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

09 Feb 2026

Feilong Tang

Xiang An

Yunyao Yan

The OneVision-Encoder framework introduces a codec-aligned sparsity principle for visual representation learning, drawing inspiration from video compression, to selectively process information-rich visual regions. This approach delivers superior performance across image, video, and document understanding benchmarks, requiring substantially fewer pretraining tokens while reducing patch processing by 75.0%-96.9% compared to dense baselines.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

Mining Generalizable Activation Functions

05 Feb 2026

Alex Vitvitskyi

Michael Boratko

Matej Grcic

Google DeepMind researchers leveraged the AlphaEvolve framework, powered by frontier Large Language Models, to discover novel activation functions that explicitly enhance out-of-distribution generalization. The study found that integrating periodic components into activation functions improved performance on complex reasoning benchmarks and molecular property prediction tasks.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

10 Feb 2026

Chinese Academy of Sciences

National University of Singapore

Yu Shang

Zhuohang Li

Yiding Ma

WorldArena introduces a unified benchmark for evaluating embodied world models, systematically assessing both perceptual fidelity and functional utility across 14 diverse models. This framework reveals a consistent perception-functionality gap, where high visual quality does not reliably translate to robust performance in practical embodied tasks, and provides a public resource for standardized evaluation.

#agent-based-systems #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

10 Feb 2026

Nanyang Technological University NTU

Ruijie Zhu

Jiahao Lu

Wenbo Hu

MotionCrafter presents a video diffusion framework for joint dense 4D geometry and motion reconstruction from monocular video in a feed-forward manner. It achieves an average 38.64% improvement in geometry and 25.0% in motion reconstruction over prior methods by employing a novel canonical normalization strategy for its 4D VAE.

#computer-science #artificial-intelligence #computational-geometry

Paper thumbnail

OAT: Ordered Action Tokenization

04 Feb 2026

Chaoqi Liu

Xiaoshen Han

Jiawei Gao

Ordered Action Tokenization (OAT) provides a learned autoencoder framework to discretize continuous robot actions into sequences that are highly compressed, totally decodable, and causally ordered. This approach from Harvard and Stanford enables superior performance and a flexible "anytime" action generation capability for autoregressive robot policies across various manipulation tasks.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Beyond Wigner: Non-Invertible Symmetries Preserve Probabilities

07 Feb 2026

Thomas Bartsch

Yuhan Gai

Sakura Schafer-Nameki

Researchers from the University of Oxford introduce the Categorical Probability Preservation (CPP) Theorem, demonstrating how non-invertible generalized symmetries, when described by unitary fusion categories, preserve quantum probabilities. These symmetries are shown to act as linear isometries mapping states from an initial Hilbert space to an enlarged Hilbert space encompassing all possible outgoing twisted sectors, thereby functioning as trace-preserving quantum channels.

#strongly-correlated-electrons #high-energy-physics-phenomenology #high-energy-physics-theory

Paper thumbnail

There are no more papers matching your filters at the moment.