alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings

Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts05/08 · Prof. Dhruv Kumar and Dhruv Trehan · alphaXiv

Measuring and Improving Long-Horizon Reasoning Capabilities05/15 · Sumeet Motwani and Charles London · alphaXiv

MolmoAct2: Action Reasoning Models for Real-world Deployment

04 May 2026

University of Washington

National University of Singapore

Haoquan Fang

Jiafei Duan

Donovan Clay

MolmoAct2, developed by the Allen Institute for AI and the University of Washington, is a fully open-source action reasoning model designed for real-world robot deployment, enhancing generalizability and efficiency. It incorporates an embodied reasoning VLM backbone and a continuous action expert, achieving up to 87.1% success on real-world DROID tasks with unseen objects and a 2.42x speedup in control rate compared to unoptimized inference.

#computer-science #robotics

Paper thumbnail

On-Policy Distillation

05 May 2026

Thinking Machines

Kevin Lu

This work introduces on-policy distillation, a post-training method for large language models that combines the on-policy relevance of reinforcement learning with dense, token-level feedback from a teacher model. The approach achieved 70% on the AIME'24 mathematical reasoning benchmark with Qwen3-8B and demonstrated a 30x cost reduction compared to off-policy distillation, while also recovering instruction-following abilities in personalized models.

Paper thumbnail

Model Spec Midtraining: Improving How Alignment Training Generalizes

03 May 2026

Chloe Li

Sara Price

Samuel Marks

Model Spec Midtraining (MSM) enhances the generalization of large language models by integrating an intermediate training phase that instills a deep understanding of a Model Spec's principles and values. This approach reduces agentic misalignment in out-of-distribution scenarios and improves the compute efficiency of alignment fine-tuning by up to 60x in low-sample regimes.

#agents #computer-science #artificial-intelligence

Paper thumbnail

RLDX-1 Technical Report

06 May 2026

Dongyoung Kim

Huiwon Jang

Myungkyu Koo

RLDX-1, developed by RLWRLD and KAIST, is a unified Vision-Language-Action (VLA) system for dexterous robotic manipulation that integrates motion awareness, long-term memory, and physical sensing. It demonstrates superior performance over existing frontier VLA models across diverse simulation benchmarks and real-world tasks, including complex humanoid operations and contact-rich scenarios, with a notable 97.8% success rate on LIBERO and significant improvements in functional capabilities.

#agents #computer-science #artificial-intelligence

Paper thumbnail

ProgramBench: Can Language Models Rebuild Programs From Scratch?

05 May 2026

John Yang

Kilian Lieret

Jeffrey Ma

ProgramBench introduces a new benchmark to evaluate large language models' capacity for holistic software development from scratch, providing only a compiled executable and its documentation. The evaluation revealed that current leading models achieve a 0.0% full resolution rate, struggling with high-level architectural design and system decomposition.

#agents #computer-science #artificial-intelligence

Paper thumbnail

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

05 May 2026

Yuwen Du

Rui Ye

Shuo Tang

OpenSeeker-v2, developed by Shanghai Jiao Tong University, achieves state-of-the-art performance among 30B-scale search agents by exclusively using Supervised Fine-Tuning (SFT) on a meticulously curated dataset of high-difficulty trajectories. This approach demonstrates that sophisticated data quality can enable competitive results against models trained with more resource-intensive multi-stage pipelines.

#agents #computer-science #artificial-intelligence

Paper thumbnail

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

04 May 2026

Yangming Shi

Shixiang Zhu

Tao Shen

ByteDance's Mamoda2.5 unifies multimodal understanding, image generation, and video generation/editing within a single Autoregressive–Diffusion framework, leveraging a fine-grained Mixture-of-Experts Diffusion Transformer (DiT-MoE) for computational efficiency. The model demonstrates competitive performance on benchmarks like VBench 2.0 (61.64) and OpenVE-Bench (3.86), while its distilled version achieves up to 95.9 times faster video editing inference.

#computer-science #computer-vision-and-pattern-recognition #efficient-transformers

Paper thumbnail

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

04 May 2026

Jianing Wang

Linsen Guo

Zhengyu Chen

HEAVYSKILL formalizes a two-phase 'heavy thinking' process as an intrinsic LLM skill, combining parallel reasoning with sequential deliberation. This framework consistently improves performance on complex reasoning tasks, outperforming traditional strategies like Best-of-N and demonstrating that deliberation can synthesize correct solutions not present in individual initial attempts.

#agentic-frameworks #agents #chain-of-thought

Paper thumbnail

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

04 May 2026

Ruofeng Yang

Yongcan Li

Shuai Li

Researchers at Shanghai Jiao Tong University developed ARIS, an autonomous research harness for machine learning that uses adversarial multi-agent collaboration to mitigate reliability issues in LLM-driven scientific discovery. The system demonstrated its capability to complete multiple review-revise rounds, improving an internal manuscript score from 5.0 to 7.5/10 and initiating over 20 GPU experiments while pruning unsubstantiated claims.

#agentic-frameworks #agents #computer-science

Paper thumbnail

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

06 May 2026

Shuang Chen

Kaituo Feng

Hangting Chen

OpenSearch-VL introduces an open-source recipe for training multimodal deep search agents, providing high-quality, openly available training data, a diverse tool environment, and an improved reinforcement learning algorithm. It demonstrates substantial performance improvements across multiple multimodal knowledge-intensive QA and web-search benchmarks, achieving an average score of 61.6 with its 30B-A3B model, a 13.8-point gain over a strong baseline.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

06 May 2026

Daniel Wurgaft

Can Rager

Matthew Kowal

Researchers at Stanford University and collaborators revealed that the intrinsic geometry of neural network representations is tightly coupled with model behavior, proposing "manifold steering" as a geometry-aware intervention that yields more natural and coherent behavioral changes than conventional linear steering. This approach demonstrated a 2.8x average improvement in naturalness across language and vision tasks and generalized to multi-dimensional concept control.

#causal-inference #computer-science #machine-learning

Paper thumbnail

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

05 May 2026

Wenjin Hou

Shangpin Peng

Weinong Wang

Researchers from Zhejiang University and Tencent introduce Uni-OPD, a unified on-policy distillation framework that enhances student exploration and teacher supervision reliability in LLMs and MLLMs. The framework consistently improves performance across multi-teacher, strong-to-weak, and cross-modal distillation settings, achieving average gains up to 3.4% in code generation for LLMs and 3.1% in math reasoning for MLLMs over standard methods, while demonstrating faster convergence.

#computer-science #machine-learning

Paper thumbnail

Syn4D: A Multiview Synthetic 4D Dataset

06 May 2026

Zeren Jiang

Yushi Lan

Yihang Luo

Syn4D introduces a comprehensive multiview synthetic 4D dataset, providing dense ground-truth annotations for dynamic scenes, including 3D tracking and parametric human pose. Training with Syn4D improves state-of-the-art models, enhancing 3D tracking performance (e.g., APD from 79.07 to 88.79 on "Warehouse") and achieving better geometric consistency in novel-view synthesis.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

05 May 2026

Lin Song

Wenbo Li

Guoqing Ma

JoyAI-Image, a unified multimodal foundation model from Joy Future Academy, integrates spatial intelligence into multimodal understanding, text-to-image generation, and instruction-guided image editing. It achieves performance improvements in spatial reasoning benchmarks, accurate long-text rendering, and fine-grained spatial image editing, enabling applications like novel view reasoning and 3D reconstruction.

#agents #computer-science #artificial-intelligence

Paper thumbnail

Let ViT Speak: Generative Language-Image Pre-training

01 May 2026

Beijing Jiaotong University

Yan Fang

Mengcheng Lan

Zilong Huang

The GenLIP framework introduces a minimalist generative pre-training approach for Vision Transformers, enabling them to directly predict language tokens from visual inputs. It achieves competitive or superior performance on 14 diverse multimodal benchmarks with 8 billion pretraining samples, outperforming baselines that use up to 40 billion samples.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

05 May 2026

Hongbo Jin

Rongpeng Zhu

Zhongjing Du

Distribution-Guided Policy Optimization (DGPO) introduces a critic-free reinforcement learning framework that resolves coarse-grained credit assignment and gradient instability in large language model alignment. It uses Hellinger distance and an entropy gating mechanism for token-level credit allocation, leading to improved performance on mathematical reasoning tasks while maintaining computational efficiency.

#agents #chain-of-thought #computer-science

Paper thumbnail

Learning to Theorize the World from Observation

05 May 2026

Doojin Baek

Gyubin Lee

Junyeob Baek

Researchers at KAIST introduced the Neural Theorizer (NEO), an AI system that learns to construct explicit, executable programs (theories) directly from raw sensory observations, enabling explanation-driven generalization to novel compositions and longer sequences. NEO demonstrated superior transferability and successfully discovered latent primitive operations compared to monolithic models across tasks like GridWorld and Image Editing.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

05 May 2026

Tsinghua University ByteDance logo

Siyou Lin

Zhou Xue

Hongwen Zhang

Mix3R introduces a method that integrates feed-forward 3D reconstruction with generative 3D models to achieve improved multi-view aligned 3D reconstruction and camera pose estimation. The system generates complete 3D geometries and accurate camera poses, outperforming existing methods on Toys4k with a Chamfer Distance of 0.7419 x10^-3 and a Relative Rotation Accuracy of 95.49 for camera poses.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

06 May 2026

Senkang Hu

Yong Dai

Xudong Han

The Self-Induced Outcome Potential (SIOP) framework provides a method for assigning turn-level credit to LLM agents in interactive environments without requiring explicit gold answers or task-specific verifiers. By clustering sampled final answers into semantic outcome modes and using reliability-calibrated potential differences to reward intermediate steps, SIOP improves performance over existing verifier-free methods and narrows the gap with gold-supervised techniques, particularly on multi-hop reasoning benchmarks.

#agents #clustering-algorithms #computer-science

Paper thumbnail

Audio-Visual Intelligence in Large Foundation Models

05 May 2026

University of Toronto

National University of Singapore

You Qin

Kai Liu

Shengqiong Wu

A comprehensive survey details the landscape of Audio-Visual Intelligence (AVI) in large foundation models, offering a unified taxonomy, synthesizing core methodologies, and outlining open challenges across perception, generation, and interaction tasks.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.