What are the most popular benchmarks for math reasoning?
MolmoAct2, developed by the Allen Institute for AI and the University of Washington, is a fully open-source action reasoning model designed for real-world robot deployment, enhancing generalizability and efficiency. It incorporates an embodied reasoning VLM backbone and a continuous action expert, achieving up to 87.1% success on real-world DROID tasks with unseen objects and a 2.42x speedup in control rate compared to unoptimized inference.
View blogThis work introduces on-policy distillation, a post-training method for large language models that combines the on-policy relevance of reinforcement learning with dense, token-level feedback from a teacher model. The approach achieved 70% on the AIME'24 mathematical reasoning benchmark with Qwen3-8B and demonstrated a 30x cost reduction compared to off-policy distillation, while also recovering instruction-following abilities in personalized models.
View blogModel Spec Midtraining (MSM) enhances the generalization of large language models by integrating an intermediate training phase that instills a deep understanding of a Model Spec's principles and values. This approach reduces agentic misalignment in out-of-distribution scenarios and improves the compute efficiency of alignment fine-tuning by up to 60x in low-sample regimes.
View blogRLDX-1, developed by RLWRLD and KAIST, is a unified Vision-Language-Action (VLA) system for dexterous robotic manipulation that integrates motion awareness, long-term memory, and physical sensing. It demonstrates superior performance over existing frontier VLA models across diverse simulation benchmarks and real-world tasks, including complex humanoid operations and contact-rich scenarios, with a notable 97.8% success rate on LIBERO and significant improvements in functional capabilities.
View blogProgramBench introduces a new benchmark to evaluate large language models' capacity for holistic software development from scratch, providing only a compiled executable and its documentation. The evaluation revealed that current leading models achieve a 0.0% full resolution rate, struggling with high-level architectural design and system decomposition.
View blogOpenSeeker-v2, developed by Shanghai Jiao Tong University, achieves state-of-the-art performance among 30B-scale search agents by exclusively using Supervised Fine-Tuning (SFT) on a meticulously curated dataset of high-difficulty trajectories. This approach demonstrates that sophisticated data quality can enable competitive results against models trained with more resource-intensive multi-stage pipelines.
View blogByteDance's Mamoda2.5 unifies multimodal understanding, image generation, and video generation/editing within a single Autoregressive–Diffusion framework, leveraging a fine-grained Mixture-of-Experts Diffusion Transformer (DiT-MoE) for computational efficiency. The model demonstrates competitive performance on benchmarks like VBench 2.0 (61.64) and OpenVE-Bench (3.86), while its distilled version achieves up to 95.9 times faster video editing inference.
View blogHEAVYSKILL formalizes a two-phase 'heavy thinking' process as an intrinsic LLM skill, combining parallel reasoning with sequential deliberation. This framework consistently improves performance on complex reasoning tasks, outperforming traditional strategies like Best-of-N and demonstrating that deliberation can synthesize correct solutions not present in individual initial attempts.
View blogResearchers at Shanghai Jiao Tong University developed ARIS, an autonomous research harness for machine learning that uses adversarial multi-agent collaboration to mitigate reliability issues in LLM-driven scientific discovery. The system demonstrated its capability to complete multiple review-revise rounds, improving an internal manuscript score from 5.0 to 7.5/10 and initiating over 20 GPU experiments while pruning unsubstantiated claims.
View blogOpenSearch-VL introduces an open-source recipe for training multimodal deep search agents, providing high-quality, openly available training data, a diverse tool environment, and an improved reinforcement learning algorithm. It demonstrates substantial performance improvements across multiple multimodal knowledge-intensive QA and web-search benchmarks, achieving an average score of 61.6 with its 30B-A3B model, a 13.8-point gain over a strong baseline.
View blogResearchers at Stanford University and collaborators revealed that the intrinsic geometry of neural network representations is tightly coupled with model behavior, proposing "manifold steering" as a geometry-aware intervention that yields more natural and coherent behavioral changes than conventional linear steering. This approach demonstrated a 2.8x average improvement in naturalness across language and vision tasks and generalized to multi-dimensional concept control.
View blogResearchers from Zhejiang University and Tencent introduce Uni-OPD, a unified on-policy distillation framework that enhances student exploration and teacher supervision reliability in LLMs and MLLMs. The framework consistently improves performance across multi-teacher, strong-to-weak, and cross-modal distillation settings, achieving average gains up to 3.4% in code generation for LLMs and 3.1% in math reasoning for MLLMs over standard methods, while demonstrating faster convergence.
View blogSyn4D introduces a comprehensive multiview synthetic 4D dataset, providing dense ground-truth annotations for dynamic scenes, including 3D tracking and parametric human pose. Training with Syn4D improves state-of-the-art models, enhancing 3D tracking performance (e.g., APD from 79.07 to 88.79 on "Warehouse") and achieving better geometric consistency in novel-view synthesis.
View blogJoyAI-Image, a unified multimodal foundation model from Joy Future Academy, integrates spatial intelligence into multimodal understanding, text-to-image generation, and instruction-guided image editing. It achieves performance improvements in spatial reasoning benchmarks, accurate long-text rendering, and fine-grained spatial image editing, enabling applications like novel view reasoning and 3D reconstruction.
View blogThe GenLIP framework introduces a minimalist generative pre-training approach for Vision Transformers, enabling them to directly predict language tokens from visual inputs. It achieves competitive or superior performance on 14 diverse multimodal benchmarks with 8 billion pretraining samples, outperforming baselines that use up to 40 billion samples.
View blogDistribution-Guided Policy Optimization (DGPO) introduces a critic-free reinforcement learning framework that resolves coarse-grained credit assignment and gradient instability in large language model alignment. It uses Hellinger distance and an entropy gating mechanism for token-level credit allocation, leading to improved performance on mathematical reasoning tasks while maintaining computational efficiency.
View blogResearchers at KAIST introduced the Neural Theorizer (NEO), an AI system that learns to construct explicit, executable programs (theories) directly from raw sensory observations, enabling explanation-driven generalization to novel compositions and longer sequences. NEO demonstrated superior transferability and successfully discovered latent primitive operations compared to monolithic models across tasks like GridWorld and Image Editing.
View blogMix3R introduces a method that integrates feed-forward 3D reconstruction with generative 3D models to achieve improved multi-view aligned 3D reconstruction and camera pose estimation. The system generates complete 3D geometries and accurate camera poses, outperforming existing methods on Toys4k with a Chamfer Distance of 0.7419 x10^-3 and a Relative Rotation Accuracy of 95.49 for camera poses.
View blogThe Self-Induced Outcome Potential (SIOP) framework provides a method for assigning turn-level credit to LLM agents in interactive environments without requiring explicit gold answers or task-specific verifiers. By clustering sampled final answers into semantic outcome modes and using reliability-calibrated potential differences to reward intermediate steps, SIOP improves performance over existing verifier-free methods and narrows the gap with gold-supervised techniques, particularly on multi-hop reasoning benchmarks.
View blogA comprehensive survey details the landscape of Audio-Visual Intelligence (AVI) in large foundation models, offering a unified taxonomy, synthesizing core methodologies, and outlining open challenges across perception, generation, and interaction tasks.
View blog