DreamDojo introduces a foundation world model that learns to simulate diverse, dexterous robot tasks by training on an unprecedented 44,711 hours of egocentric human video data. The model achieves real-time inference at 10.81 FPS and demonstrates a strong correlation (Pearson r=0.995) between simulated and real-world policy evaluation success rates.
View blogDrifting Models propose a new generative modeling paradigm that shifts iterative distribution matching to training time, enabling high-quality sample generation in a single forward pass. The method achieves an FID of 1.54 on ImageNet 256x256 using a single neural function evaluation in latent space and 1.61 in pixel space, outperforming previous single-step approaches and showing effectiveness in robotic control tasks.
View blogThe "First Proof" project introduced a new methodology with ten unique, unpublished research-level mathematics questions to assess AI's autonomous problem-solving capabilities. Preliminary tests indicate that current frontier AI systems struggle to generate correct proofs for these complex problems independently.
View blogResearchers from Stanford University reveal that general-purpose Transformers can learn fundamental physical laws, rather than just performing accurate predictions, by incorporating three minimal inductive biases: spatial smoothness, spatial stability, and temporal locality. Their approach demonstrates a transition from 'Keplerian' curve-fitting to 'Newtonian' mechanistic understanding, particularly when the model's context length is constrained.
View blogResearchers from UC Berkeley and independent collaborators developed Generative Latent Prior (GLP), a diffusion model trained on LLM activations to learn their complex, intrinsic distribution. This approach enhances activation steering by projecting off-manifold activations back to a coherent state, and improves concept isolation, achieving an average AUC of 0.87 for 1-D probing tasks on Llama8B, surpassing linear methods.
View blogMeta FAIR researchers developed TinyLoRA, a parameter-efficient finetuning method that enables large language models to acquire complex mathematical reasoning skills by training as few as 13 parameters. This approach, when combined with reinforcement learning, achieved 91% accuracy on GSM8K, demonstrating a significant advancement in ultra-low-capacity model adaptation.
View blogInftyThink+ presents a reinforcement learning framework to optimize iterative reasoning in large language models, addressing limitations in long-horizon tasks. This approach enhances reasoning accuracy on complex problems while simultaneously improving inference efficiency and accelerating RL training, achieving an average accuracy boost of 9.89 points and reducing latency by 29.20 seconds with an efficiency reward.
View blogResearchers at the University of California, Merced, and the University of Waterloo developed Context Forcing, a framework for autoregressive video generation that achieves effective context lengths over 20 seconds. This method, leveraging a novel Slow-Fast Memory architecture and a two-stage distillation process, enables consistent minute-long video generation, showing a 2-10x improvement in effective context over prior methods.
View blogBaichuan-M3 advances medical large language models beyond traditional question-answering, transforming them into active clinical decision-support partners. The model achieves state-of-the-art performance in interactive clinical reasoning, hallucination suppression, and diagnostic accuracy on benchmarks like ScanBench and HealthBench, often surpassing leading proprietary models and human baselines.
View blogThe paper "World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy" from the Show Lab at the National University of Singapore introduces a framework for efficient reinforcement learning (RL) of robotic policies by co-evolving a video world model and a VLA policy. This method enables substantial improvements in VLA policy success rates, achieving up to 50% in real-world tasks and over 97% in some simulated benchmarks, by minimizing the need for costly real-world interactions.
View blogAIRS-Bench, a new benchmark suite, evaluates AI Research Agents on machine learning tasks by simulating the full scientific discovery workflow, free from data contamination. It reveals that current agents, while occasionally surpassing human SOTA on specific tasks, generally operate far below human expert levels, demonstrating significant headroom for improvement in autonomous scientific reasoning.
View blogHuMI, a Humanoid Manipulation Interface, introduces a system for efficiently collecting robot-free whole-body demonstrations and a hierarchical policy learning framework. The approach enables humanoid robots to perform diverse, complex tasks involving full-body coordination, demonstrating a 3x increase in data collection efficiency and robust generalization to unseen environments and objects.
View blogByteDance AML introduces TokenMixer-Large, an advanced deep learning ranking model for industrial recommender systems, designed to scale efficiently up to multi-billion parameters. Through architectural innovations and comprehensive optimizations, the model achieves consistent online performance gains across Douyin's ad, e-commerce, and live streaming platforms, including a 2.98% increase in GMV and a 2.0% improvement in Advertiser Satisfaction Score.
View blogReinforced Attention Learning (RAL) is a post-training framework for Multimodal Large Language Models (MLLMs) that directly optimizes internal attention distributions instead of just output tokens. This approach improves visual grounding and multimodal alignment, achieving superior performance on diverse image and video question-answering benchmarks, including gains of +5.8% on V*Bench and +3.4% on NExTQA.
View blogResearchers from Caltech, Stanford, and Carleton College systematically surveyed Large Language Model reasoning failures, proposing a two-axis taxonomy to classify shortcomings in informal, formal, and embodied reasoning. Their analysis reveals that many failures stem from architectural constraints, training data biases, and insufficient real-world grounding, along with discussing mitigation approaches.
View blogThis paper develops a rigorous framework that reconciles the second law of thermodynamics with the microscopic dynamics of closed quantum many-body systems. It introduces new quantum-mechanical definitions of thermal equilibrium, adiabatic operations, and entropy, rigorously demonstrating the emergence of Planck's principle and the law of increasing entropy.
View blogCINESCENE introduces a framework for cinematic video generation that leverages implicit 3D scene representations to decouple static environments from dynamic subjects. This approach enables the synthesis of high-quality, scene-consistent videos featuring novel dynamic elements under user-specified camera trajectories, addressing challenges in traditional and AI-driven video production.
View blogMaximum Likelihood Reinforcement Learning (MaxRL), developed by researchers primarily at Carnegie Mellon University, introduces a sampling-based framework that approximates maximum likelihood optimization for tasks with binary outcome feedback. MaxRL demonstrates superior scaling with increased computational resources, mitigates issues like pass@k degradation in Large Language Models (LLMs), and achieves up to 20x test-time scaling efficiency on mathematical reasoning benchmarks by generating a higher number of correct solutions.
View blogF-GRPO, developed by T-Tech and Saint Petersburg Electrotechnical University "LETI", introduces a difficulty-aware scaling method to address distribution sharpening in Reinforcement Learning with Verifiable Rewards for Large Language Models. This approach significantly improves solution diversity (pass@k) and out-of-domain generalization while maintaining or improving single-attempt accuracy (pass@1), achieving performance comparable to using 4x more computational resources.
View blogResearchers from Sun Yat-sen University, MBZUAI, and Yinwang Intelligent Technology developed GeoThinker, an active geometry integration framework for Multimodal Large Language Models (MLLMs). This framework allows MLLMs to selectively retrieve task-relevant geometric cues, achieving a new state-of-the-art average score of 62.23 on a diverse set of spatial intelligence benchmarks and demonstrating robust generalization to downstream tasks.
View blogThe SWIRL framework allows Large Language Models and Vision-Language Models to develop intrinsic world models by learning from unlabelled state-only sequences, treating actions as latent variables. This method achieves up to a 28% improvement in visual dynamics tasks and a 4.03 BLEU increase in textual tool-calling performance, demonstrating enhanced data efficiency and generalization across modalities.
View blogDriveWorld-VLA introduces a unified latent-space world modeling framework that integrates Vision-Language-Action (VLA) models, enabling autonomous driving systems to perform action-conditioned "what-if" reasoning for proactive planning. The system demonstrates state-of-the-art performance on NAVSIMv1, NAVSIMv2, and nuScenes benchmarks, achieving high safety compliance and low collision rates.
View blogWorldCompass is a reinforcement learning post-training framework that significantly improves interactive video-based world models' ability to accurately follow explicit actions and maintain visual quality over long sequences. It boosts composite action accuracy from approximately 20% to 55% and enhances overall visual fidelity across various generative tasks.
View blogResearchers at Baidu developed ERNIE 5.0, a trillion-parameter unified autoregressive foundation model that natively supports multimodal understanding and generation across text, image, video, and audio. It achieves balanced performance comparable to or surpassing specialized baselines on a wide range of perception, reasoning, and generative tasks.
View blogResearchers at Johns Hopkins University developed "Share," a Parameter-Efficient Continual Finetuning (PaCT) framework that dynamically updates a single, shared low-rank subspace for large pre-trained models. Operating without data replay or increased model size, Share achieves up to 100x parameter reduction and 281x memory savings while maintaining performance comparable to non-continual LoRA and demonstrating backward knowledge transfer in various tasks.
View blogMoonshot AI's Kimi K2.5 advances multimodal agentic intelligence by jointly optimizing text and vision from early pre-training, demonstrating state-of-the-art performance across diverse benchmarks. Its novel Agent Swarm framework enables parallel task execution, reducing inference latency up to 4.5 times on complex agentic workloads.
View blogA reinforcement learning algorithm called SeeUPO, developed by Tongyi Lab at Alibaba Group, offers convergence guarantees for training large language model (LLM) agents in multi-turn interactions. It delivers up to 54.6% relative performance improvements over baselines with stable training, operating as a critic-free method.
View blogAI agents consistently overestimate their probability of success on complex, multi-step tasks, exhibiting a pervasive overconfidence ranging from a 38 to 55 percentage point gap between predicted and actual success rates. Explicitly prompting agents to seek flaws (adversarial framing) significantly improves the calibration of these self-assessments.
View blogResearchers at the Technical University of Munich introduced DynaRetarget, a pipeline featuring a sampling-based trajectory optimization (SBTO) framework, to produce dynamically feasible humanoid loco-manipulation trajectories from human demonstrations. This method yields high-quality data that enhances reinforcement learning and facilitates zero-shot sim-to-real transfer to physical robots.
View blogA speculative decoding framework utilizing a lightweight block diffusion model for parallel token drafting demonstrates over 6x lossless acceleration for large language model inference. This approach achieves up to 2.5x higher speedup than existing state-of-the-art methods by conditioning the drafter on target model context features.
View blogThe MBZUAI team developed MedMO, an open-source multimodal large language model, specifically post-trained for medical image understanding and robust visual grounding across diverse modalities. It achieves state-of-the-art performance in medical question answering, report generation, and spatial localization, leveraging a comprehensive dataset of over 26 million medical and biomedical samples.
View blogThe OneVision-Encoder framework introduces a codec-aligned sparsity principle for visual representation learning, drawing inspiration from video compression, to selectively process information-rich visual regions. This approach delivers superior performance across image, video, and document understanding benchmarks, requiring substantially fewer pretraining tokens while reducing patch processing by 75.0%-96.9% compared to dense baselines.
View blogGoogle DeepMind researchers leveraged the AlphaEvolve framework, powered by frontier Large Language Models, to discover novel activation functions that explicitly enhance out-of-distribution generalization. The study found that integrating periodic components into activation functions improved performance on complex reasoning benchmarks and molecular property prediction tasks.
View blogWorldArena introduces a unified benchmark for evaluating embodied world models, systematically assessing both perceptual fidelity and functional utility across 14 diverse models. This framework reveals a consistent perception-functionality gap, where high visual quality does not reliably translate to robust performance in practical embodied tasks, and provides a public resource for standardized evaluation.
View blogMotionCrafter presents a video diffusion framework for joint dense 4D geometry and motion reconstruction from monocular video in a feed-forward manner. It achieves an average 38.64% improvement in geometry and 25.0% in motion reconstruction over prior methods by employing a novel canonical normalization strategy for its 4D VAE.
View blogOrdered Action Tokenization (OAT) provides a learned autoencoder framework to discretize continuous robot actions into sequences that are highly compressed, totally decodable, and causally ordered. This approach from Harvard and Stanford enables superior performance and a flexible "anytime" action generation capability for autoregressive robot policies across various manipulation tasks.
View blogResearchers from the University of Oxford introduce the Categorical Probability Preservation (CPP) Theorem, demonstrating how non-invertible generalized symmetries, when described by unitary fusion categories, preserve quantum probabilities. These symmetries are shown to act as linear isometries mapping states from an initial Hilbert space to an enlarged Hilbert space encompassing all possible outgoing twisted sectors, thereby functioning as trace-preserving quantum channels.
View blog