Tech Blog May 7, 2026

RLDX-1: A Dexterity-First Foundation Model for Robot Hands

RLWRLD Team

Summary. We present RLDX-1, a new dexterity-first foundation model built to tackle complex tasks in real-world industry using high-DoF robotic hands. Existing models often lack essential capabilities, such as context memorization or force sensing, required for seamless real-world deployment. To address this, RLDX-1 encompasses the complete robotics lifecycle, integrating a scalable data collection pipeline, a versatile architecture design, robust training methodologies, and optimized deployment strategies. As a result, RLDX-1 achieves state-of-the-art performance, showcasing exceptional precision and generalization across both simulated environments and physical industrial applications.

RLWRLD Team · May 7, 2026

VIDEO 1: RLWRLD introduces RLDX-1.

The past five years of large language models trained one deep intuition into the field: make the model bigger, feed it more data, and new capabilities emerge. That intuition has carried directly into robotics. The prevailing bet is that if you train generalist robot policies on enough data, dexterous manipulation will emerge the way unexpected capabilities emerged in LLMs at scale [1, 2, 3, 4, 5].

That bet is promising, and scale will remain a powerful lever. But when we deployed the best available VLAs (Vision-Language-Action models) on the tasks our customers actually need (pouring, in-hand rotation, reactive grasping), they failed consistently, not because they lacked intelligence, but because the task lived in modalities the model was never given: high-DoF hand-object geometry, contact forces, temporal dynamics. Scale cannot recover a modality the model was never given in the first place.

Recent works have explored leveraging video generation (or world) models trained on large-scale human data for policy learning, demonstrating strong generalization to unseen environments and tasks [6, 7, 8, 9]. While such approaches capture diverse behaviors from web-scale data, generalization alone is insufficient for precise physical interaction. Real-world interaction requires recognizing what to do, maintaining relevant state over time, and grounding decisions in physically meaningful signals.

RLDX-1 is a foundation model designed from the ground up for dexterous robot hands. Every component exists because a specific failure mode on a real task required them. The result is a single model that can see, feel, remember, and adapt, deployable across single-arm, dual-arm, and humanoid embodiments with high-DoF hands.

Figure 2: Pouring Coffee, A Canonical Dexterity Task. A humanoid dual-arm robot stabilizes a cup with one five-finger hand while pouring from a coffee pot with the other. The pot's weight changes as liquid transfers. The signal lives in torque, not vision.

The last mile of industrial automation is dexterity, and today's robots still cannot reliably pour coffee as the pot grows lighter, pick a moving object off a conveyor, rotate a hex nut with fingertips. We distilled these recurring customer needs into DexBench, a benchmark that organizes them along five regimes of dexterity, where each regime is a specific failure mode of today's robots and each one shapes a specific piece of RLDX-1. These tasks live in a world built for human hands. A robot that operates in that world has to meet it on its own terms.

A five-finger hand is the natural form factor, but form factor is not dexterity. Dexterity begins when that hand is driven by a model that can reason about space, time, and contact in real time. The five regimes of dexterity, and the piece of RLDX-1 each one shapes:

Grasp Diversity: the hardware premise. Five-finger hands are the prerequisite every regime below assumes; we have run 10+ of them in-house (our separate post will unpack this). Two data pipelines diversify grasping: Synthetic Robot Data (§4-1) augments a ~5× dataset from a small teleoperation set, while Human Data (§4-2) covers the high-DoF in-hand dexterity teleoperation cannot reach.

Spatial Precision: geometric reasoning about the pre-contact approach. The policy must capture sufficient scene structure to place contact correctly before contact is made. RLDX-1 strengthens this capability with a robot-specialized VLM (Vision Language Model) fine-tuned on robot VQA (Visual Question and Answering), where the questions explicitly target the geometric relationship between the robot end-effector and the target object (§3-2). This training encourages the VLM to better ground object locations and spatial relations that are critical for precise contact placement.

Temporal Precision: acting at the right moment in a world that does not wait. A single-frame policy commits to where objects were; by the time the hand arrives, the conveyor object has moved. To address this, the Motion Module (§3-3) extracts motion features from space-time visual correspondences and amortizes multi-frame context into a compact representation. It lets the policy see where and how fast objects are going.

Contact Precision: reasoning about contact using torque and tactile signals. A coffee pot growing lighter is visually invariant; the signal is in wrist torque. The Physics Module (§3-4) gives tactile and torque their own streams and predicts future contact states alongside actions, so the policy anticipates contact transitions before they happen.

Context Awareness: task-level reasoning that wraps around the three precisions. Without it, even a perfectly executed motion is stranded at the single step it was planned for. The policy needs three capabilities:

Memory. A sliding cache of past cognition features (§3-5) tracks what already happened.
Recovery. DAgger (§5-3) turns out-of-distribution drift into new training data, narrowing the failure distribution each iteration.
Progress-Awareness. A progress-estimator VLM (§5-3) gives RL a dense visual reward, so the policy knows which moves actually bring the task closer to done.

Each regime enters the model as a fundamentally different modality: torque is a high-rate continuous stream, video is sparse high-dimensional frames, memory is stateful. In a single conventional transformer, whichever modality dominates the gradient absorbs all the capacity while the rest become decorative. The architectural answer is MSAT (Multi-Stream Action Transformer, §3-1): each modality gets its own processing stream, and cognition tokens (§3-5) compress the VLM output into a fixed-size interface. What follows unpacks each layer: the architecture that holds these modalities together (§3), the data engine that trains it (§4), and the post-training that makes it deployable (§5).

VIDEO 2: Architecture.

RLDX is built on MSAT (Multi-Stream Action Transformer), an architecture where each modality gets its own processing stream and joint self-attention lets them interact.

Figure 4: RLDX Full Architecture. Every modality RLDX needs (vision-language, proprioception, action, memory, tactile, and torque) enters MSAT through its own dedicated stream. Joint self-attention lets them talk without forcing them into a shared representation first. Video context (Motion Module) and long-horizon memory (Memory Module) feed in through the 64 cognition tokens.

3-1. MSAT: Integrating Heterogeneous Signals

Existing VLAs [3, 4, 5] fuse modalities inside a single transformer stream, where whichever modality dominates the gradient absorbs all the capacity. MSAT gives each modality its own dedicated processing stream [10, 11], then lets the streams communicate through joint self-attention without being forced into a shared representation prematurely. Early blocks keep modalities in parallel streams; later blocks fuse them for action decoding.

3-2. Robot-Specialized VLM

General-purpose VLMs are strong at visual reasoning, but they do not automatically understand what matters for robot control. To close this gap, RLDX-1 fine-tunes Qwen3-VL 8B [12] on a robot-trajectory VQA dataset targeting three action-relevant abilities. First, spatial reasoning about the geometric relationship between the end-effector and target objects. Second, task understanding that identifies the intermediate subtask implied by the current observation. Third, action grounding that reasons about the low-level action associated with the current frame. The fine-tuned model, RLDX-1-VLM, serves as the visual reasoning backbone for action generation: +3.42%p over the vanilla VLM on RoboCasa [13].

RLDX-1 vs Qwen3-VL Attention Visualization — Figure 5: Attention Visualization. RLDX-1 performs more specialized vision processing for robot tasks than Qwen3-VL.

Table 1: VLM Fine-tuning Ablation (RoboCasa)

Model	RoboCasa Score	Δ
Qwen3-VL 8B	57.50	—
RLDX-1-VLM 8B	60.92	+3.42

3-3. Seeing Physics

VIDEO 3: Motion Module.

A single-frame policy is always one step behind the scene. By the time the hand arrives, the conveyor object has moved. The Motion Module has two complementary pieces. A video token compression layer feeds multi-frame observations through the VLM, compressing past frames into motion tokens via average pooling [14], so the model efficiently sees where things are going. A motion learning layer in the vision encoder models spatio-temporal self-similarities (STSS) [15], capturing rotation, velocity, and interaction dynamics directly from visual features. Together: +37.5%p over GR00T N1.6 and π₀.₅ on performing pick-and-place task on conveyor belt.

3-4. Feeling Physics

VIDEO 4: Physics Module.

The Physics Module [16] integrates tactile and torque feedback into RLDX as native modalities. These physical signals are crucial for tasks that require contact-rich object manipulation, primarily serving two key functionalities: weight estimation and contact detection. For weight estimation, when a robot pours coffee, the module captures weight shifts across both hands to inform RLDX precisely when to stop. For contact detection, a robot needs to identify the exact moment of contact to transition from approaching to picking (as shown in Figure 6). While joint angles provide ambiguous information regarding contact timing, torque signals offer distinct, sharp changes at the point of contact.

To fully leverage this, RLDX employs a dedicated stream that not only processes these signals but also predicts future torque states, allowing the policy to possess informative physical embeddings. Furthermore, when such sensors are unavailable, the sensory stream automatically deactivates for graceful degradation to vision-only, allowing a single model to support various hardware setups.

Joint Torque Contact Phase Transitions — Figure 6: Joint Torque Reveals Contact Phase Transitions. Torque trace (blue) across the four phases of a card-slide pick: Approach, Contact, Pick, Hand-Over. Contact events are visually ambiguous but show up as sharp torque signatures, which the Physics Module consumes directly as its own modality.

3-5. Cognition Interface & Memory Module

VIDEO 5: Memory Module.

The VLM produces a rich scene understanding, but passing all of its tokens to the action model is slow and wasteful. The Cognition Interface appends 64 learnable cognition tokens to the VLM's input; through attention they compress the full sequence into a fixed-size representation that carries exactly the information the action model needs. The speed win: +35%p inference speedup (16.3→22.1 Hz).

But these tokens do double duty. The same 64-token representation becomes the unit of long-horizon memory. A FIFO sliding cache stores past cognition features across the rollout, and the Memory Module [17] attends over this cache to track task progress. Pack a box, assemble a product, count ten apples into an opaque bag: each step depends on knowing what already happened. Compression and memory are the same mechanism, reused.

Figure 7: Cognition Interface as Memory Substrate. The VLM consumes video observations, language instruction, and learnable cognition tokens. Cognition tokens compress the full vision-language context into a fixed-size feature; past cognition features flow through the Memory Module for long-horizon state.

VIDEO 6: Data.

4-1. Synthetic Robot Data: Generating What You Can't Collect

Real teleoperation alone cannot populate the space a five-finger hand must cover. Our synthetic data pipeline amplifies a small seed set of real demonstrations using video generation models (e.g., Cosmos-Predict2 [9]). A fine-tuned video model synthesizes new trajectories at scale by varying scene factors (e.g., lighting, surfaces, positions, and backgrounds). An inverse dynamics model then annotates the generated videos with action labels, followed by a video quality and motion-consistency filter [18] that retains only instruction-following and physically plausible synthetic data. Ultimately, this yields video-action consistent synthetic data that is beneficial for VLA training, rather than merely plausible-looking outputs, with a ~5× increase in data scale, leading to a 9.2% gain in average success rate on the GR-1 Tabletop benchmark.

Figure 8-2: Synthetic Data Ablation (GR-1 Tabletop).

4-2. Human Data: Learning from Human Hands

There is no better teacher for a dexterous robot hand than a human hand. Teleoperation is often too slow and imprecise for five-finger manipulation, as conventional controllers fail to capture the high-speed reflexes required for dynamic tasks like catching or rapid regrasping. The most adopted alternative, UMI [19], fits the robot's end-effector onto a human, but only for grippers; DexUMI [20] ported the recipe to five-finger hands and has not held up in practice: poor ergonomics, constrained hand motion, and a device that must be redesigned for every new robot hand. RLDX takes the opposite route: record from the bare human hand and close the kinematic and morphological gap in software, with a retargeting framework built for five-finger dexterity.

The pipeline has four stages: (1) track the human hand and object, (2) reconstruct the workspace with 3D Gaussian Splatting [22], (3) retarget onto the robot hand, (4) roll out in simulation to produce VLA training data. This yields 200+ demonstrations per hour and scales further with automated augmentations.

VIDEO 7: Human Data Capture Pipeline

RLDX is trained through a three-stage pipeline, each stage building on the previous checkpoint.

Training Data Mix: Pre-Training, Mid-Training Franka, Mid-Training ALLEX — Figure 9: Training Data Mix, Pre-Training, Mid-Training Franka, Mid-Training ALLEX.

5-1. Pre-Training for General Manipulation

The model learns general manipulation knowledge across single-arm, dual-arm, and humanoid embodiments (many equipped with dexterous five-finger hands) through a shared MSAT core with per-embodiment encoders/decoders. The pre-training mix contains trajectories from diverse real world datasets and our synthetic robot data. We randomly drop embodiment tags so that the model learns both an embodiment-conditioned policy and an embodiment-agnostic one in a single backbone.

5-2. Mid-Training for Target Embodiments

Mid-Training Target Embodiments: ALLEX humanoid and Franka Research 3 — Figure 10: Mid-Training Target Embodiments. The two platforms used for mid-training. **ALLEX humanoid** (48-DoF total): 7-DoF Bi-manual arms, 15-DoF five-finger hands, 2-DoF head/waist, stereo egoview camera, joint-torque sensors. **Franka Research 3** (7-DoF arm): 1-DoF gripper, AnySkin tactile, wrist + third-view cameras, joint torque.

Starting from the pre-trained checkpoint, the Memory Module and Physics Module are added, initialized from scratch, existing weights preserved. Embodiment-specific dexterity data builds temporal and sensory capabilities. Pre-training data is partially reused to prevent catastrophic forgetting; Synthetic robot data fill in for data-scarce embodiments.

5-3. Post-Training for Deployment

VIDEO 8: Deployment.

Imitation learning alone leaves room for improvement for better success rate and optimal motions. Two mechanisms close the gap, corresponding to the two remaining faces of Context Awareness:

DAgger (Recovery) [23] focuses training data on the failures the model actually makes. The model is deployed and corrected when it goes out-of-distribution, and those corrections become new training data. Each iteration narrows the failure distribution until the mistake pattern disappears.

Progress-Aware RL (Progress-Awareness). A separate VLM is post-trained as a learned progress estimator: given a trajectory, it predicts how close the policy is to completing the task. This provides RL (Reinforcement Learning) with a dense, visually-grounded reward signal that drives the policy toward task progress [24, 25] without hand-engineered, task-specific goals. By reusing batch on-policy data [26, 27], every rollout is fully exploited across multiple updates, making real-robot RL more tractable and affordable. The result: the final policy completes tasks ~3× faster than imitation learning alone.

Figure 11: VLM-Critic Progress Estimation Visualized. Estimated progress by VLM-critic; Trained VLM Value captures the critical moments of failure as well as task progress (cube pick and place to the tote box).

RLDX ships as three checkpoints: RLDX-1-PT (pre-trained checkpoint), RLDX-1-MT-ALLEX and RLDX-1-MT-DROID (8.1B each, mid-trained for their target platforms). Serving an 8.1B policy in a real-robot control loop in real time is a graph and memory problem more than a FLOPs one (§6-1). We benchmark RLDX-1-PT in simulation against GR00T N1.5/N1.6 [5] and π₀ [4] / π₀.₅ [28] / π₀-FAST [29] (§6-2), and also evaluate it on the OpenArm real-world benchmark without any platform-specific mid-training (§6-3). The mid-trained checkpoints (RLDX-1-MT-ALLEX, RLDX-1-MT-DROID) are then evaluated on their target platforms (§6-3).

6-1. Inference Optimization

Real-time control is a serving problem, not a FLOPs problem. Under a typical VLA input setup (a short instruction of ~30–40 words and 4 image frames, totaling ~300 tokens per step), MSAT inference is bottlenecked by CPU-side kernel dispatch and HBM (High Bandwidth Memory) round-trips between operators, not by GPU compute. We attack both.

Static Graph. Eager mode dispatches hundreds of kernels from CPU every forward call, and shape-dependent control flow (RoPE layouts, token masks) breaks torch.compile's trace. We pre-compute RoPE frequencies, token layouts, and other shape-derived buffers ahead of time, collapsing the full MSAT forward pass into one uninterrupted trace that compiles end-to-end and captures under CUDA graphs.

Figure 12: Dynamic vs Static Graph. Dynamic graphs hit graph breaks and repay launch overhead every subgraph. A static graph launches once and runs the whole kernel sequence back-to-back.

Operator Fusion. Although inductor performs automatic fusion, its predefined patterns do not always cover model-specific fusion opportunities across operators. We profile its Triton kernels, find groups where algebra, memory I/O, and parallelization can be jointly rewritten, and replace them with hand-fused kernels, e.g., RMSNorm + RoPE + attention → one kernel, cutting HBM access from six passes to two (one load, one store).

End-to-end: up to 1.63× over PyTorch Eager on an RTX 5090 + Intel Core Ultra 7 265K desktop (ALLEX, dual-view 192×256, 4 frames, 40 action chunks, 4 denoising steps). Torch Compile alone provides only modest gains because it optimizes the model in fragments rather than as a whole. To capture the full forward pass, we pair static graph conversion with CUDA Graph, eliminating launch overhead end-to-end. On top of that, kernel optimization removes the remaining memory traffic.

Table 3: Inference Latency, p50 ms (speed-up vs. PyTorch Eager)

Model	PyTorch Eager	Torch Compile	Static Graph Conversion + CUDA Graph	Static Graph Conversion + CUDA Graph + Kernel Opt.
RLDX-1 (w/o physics & memory modules)	67.02	56.87 (1.18×)	46.16 (1.45×)	41.59 (1.61×)
RLDX-1 (all-modality)	71.22	59.64 (1.19×)	48.91 (1.46×)	43.70 (1.63×)

6-2. Simulation Benchmarks

We evaluate the pre-trained checkpoint RLDX-1-PT on 8 simulation benchmarks that collectively span single-arm gripper, dual-arm bimanual, and humanoid embodiments, against recent frontier VLAs (π₀, π₀.₅, π₀-FAST, GR00T N1.5/N1.6).

Per-Benchmark Breakdown: RLDX-1 vs Baselines — Figure 14: Per-Benchmark Breakdown (RLDX-1 vs Baselines). RLDX-1 holds SOTA on LIBERO (97.8), SIMPLER, RoboCasa Kitchen (70.6), and RoboCasa GR-1 Tabletop (58.7).

Classical suites are saturated. Recent frontier VLAs have crowded LIBERO and SIMPLER near their ceilings, and we treat them as a sanity check. RLDX-1 still holds SOTA on all four splits (LIBERO 97.8, SIMPLER Google-VM 81.5, Google-VA 77.4, WidowX 71.9), but the margin to the next-best is small and the headline numbers no longer separate methods.

The signal lives in the RoboCasa family: long-horizon, contact-rich, kitchen-scale tasks that current VLAs are visibly not yet good at, and where every component of RLDX-1 was designed to pay rent.

RoboCasa Kitchen. Every published VLA baseline sits between 62.1 (π₀.₅) and 66.2 (GR00T N1.6). RLDX-1 reaches 70.6, the first and only VLA to break the 70% mark on this suite, a +4.4%p jump over the next best.

RoboCasa GR-1 Tabletop. A humanoid-specific suite where every baseline plateaus below 50 (GR00T N1.5: 48.0, N1.6: 47.6). RLDX-1 hits 58.7 (+10.7%p), the regime our humanoid-first design (high-DoF hand encoders, embodiment-conditioned MSAT streams) was built for.

RoboCasa 365. The hardest of the six suites: 365 long-horizon, multi-stage tasks engineered to break policies that only memorize short clips. Among baselines, GR00T N1.6 tops out at 26.9; RLDX-1 reaches 32.1 (+5.2%p, ~19% relative). The margin widens on the composite tasks, which chain several skills under a single instruction: RLDX-1 is roughly 2× the prior SOTA on seen composites and 3× on unseen composites.

These margins are especially notable given the data asymmetry: GR00T N1.6 includes RoboCasa simulation data in its pre-training mix, while RLDX-1 is trained on real data only. The RoboCasa gap is not a sim-pretraining advantage; it is an architectural and data-engine one.

Beyond nominal success rate, we also evaluate robustness to camera, robot, language, lighting, background, noise, and layout perturbations (LIBERO-Plus [21]). RLDX-1 reaches 86.7% total robustness vs GR00T N1.6 72.6% and π₀-FAST 64.2%.

And we got here on a fraction of the compute. Tracing the iteration trajectory (Internal RLDX 0.2 → 0.25 → 0.3 → 1), RLDX-1 crosses every baseline well before its final checkpoint and reaches SOTA at roughly 20% of the training compute used by GR00T N1.5. The gains above are not a scale story; they come from architecture and data engine design for dexterous manipulation.

Figure 15: RLDX Iteration Trajectory vs Baselines. RLDX iterations (0.2 → 0.25 → 0.3 → 1) on four benchmarks. Horizontal dashed lines mark baseline methods. RLDX-1 matches or exceeds every baseline with **~20% of the compute** used by GR00T N1.5.

6-3. Real-World Benchmark Tasks

Each benchmark task is designed to isolate a specific axis of dexterity. We compare RLDX-1 against strong baseline VLA models, including π₀.₅ and GR00T N1.6.

OpenArm: Cross-Embodiment Generalization

On the OpenArm + Inspire 6-DoF hand platform, we evaluate RLDX-1-PT, without OpenArm-specific mid-training, to probe how well the embodiment-agnostic pre-trained policy generalizes to a platform it was not specialized for. The benchmark targets versatile intelligence, including object grounding, instruction understanding, and generalization to unseen environments.

RLDX-1 Real-World Benchmark: OpenArm — Figure 16: RLDX-1 Real-World Benchmark, OpenArm.

RLDX-1 consistently outperforms the baselines on the OpenArm benchmark for versatile intelligence. In particular, π₀.₅ performs better than GR00T N1.6 on in-domain tasks, but its performance drops below GR00T N1.6 on out-of-domain tasks, indicating limited generalization to unseen settings. GR00T N1.6 shows a different limitation. It completely fails on the object identification task, suggesting that it struggles with fine-grained instance-level object grounding. In contrast, RLDX-1 maintains balanced performance across different task types without collapsing on any specific capability. These results indicate that RLDX-1 is not only stronger in average success rate, but also more reliable across the diverse capabilities required for real-world humanoid manipulation.

ALLEX: Specialized Functional Capabilities (RLDX-1-MT-ALLEX)

Using the ALLEX humanoid, we construct tasks focused on motion awareness, history awareness, and physical signal awareness, evaluated with RLDX-1-MT-ALLEX specialized for the platform.

Table 4: RLDX-1 Real-World Benchmark, ALLEX

Method	Conveyor Pick-and-Place	Object-in-Box Selection	Pot-to-Cup Pouring
π₀.₅	29.2	33.3	38.7
GR00T N1.6	50.0	29.1	37.5
RLDX-1 (Ours)	87.5	91.7	70.8

The results show a large performance gap between RLDX-1 and existing VLAs. On tasks that require specialized functional capabilities, the major baselines achieve success rates below 30%, while RLDX-1 reaches nearly 90%. This suggests that existing VLA models still struggle when a task requires more than generic visual-language understanding, such as tracking motion, using history, or interpreting physical signals. In contrast, RLDX-1 can handle these capability-specific challenges much more reliably.

DROID (Franka Research 3): Memory and Sensory Tasks (RLDX-1-MT-DROID)

RLDX-1-MT-DROID specializes the pre-trained checkpoint to a single-arm Franka Research 3 platform with AnySkin tactile and joint torque sensing. We evaluate two memory-dependent tasks (Swap Cup, Shell Game) and two sensory-dependent tasks (Plug Insertion, Egg Pick & Place) that exercise the Memory Module and Physics Module on a non-humanoid embodiment.

Table 5: RLDX Real-World Benchmark, DROID

Model	Memory: Swap Cup	Memory: Shell Game	Sensory: Plug Insertion	Sensory: Egg Pick & Place
π₀.₅	25.0	45.8	20.8	45.8
GR00T N1.6	12.5	54.2	16.7	37.8
Ours (Pre-Trained)	29.2	58.3	20.8	45.8
Ours (Mid-Trained)	45.8	91.7	33.3	61.1

6-4. Future Direction

Per-task data requirements vary. Some tasks converge quickly with few demonstrations; others need relatively extensive post-training.
Long-horizon planning ability. Our current experiments demonstrate memory-dependent decision-making over short-to-medium interaction horizons. Extending this capability to substantially longer temporal contexts, such as hour-long interactions, remains an important direction for future work.
Zero-shot ability. RLDX-1 achieves strong instruction understanding under our current training and adaptation setting compared to other frontier VLAs, but its zero-shot generalization as a pre-trained policy remains an open direction.
Extending RLDX-1 to the video/world model. RLDX-1 can be naturally extended toward video/world modeling, where the model learns to predict future visual observations conditioned on language instructions and actions. Such an extension could provide a stronger basis for long-horizon planning and action-conditioned imagination in embodied environments, and represents a promising direction for future work.

The prevailing assumption in the field is "intelligence first, dexterity follows." Our experience points the other way. Dexterity is not what comes after intelligence; it is the path intelligence takes when it has to act in the physical world. The deciding signal can live in torque, tactile, depth, or modalities with no human analogue, and scaling vision pre-training does not conjure what was never in the pixels. What scales for embodied intelligence is the number of modalities a model can absorb and let complement each other, not the size of any single one.

RLDX-1 is the first checkpoint along this direction, not the last. If this is the kind of problem you want to work on, we'd love to hear from you.

VIDEO 9: Launching RLDX-1.

Resources

Tech Report arxiv.org/abs/2605.03269
Code github.com/RLWRLD/RLDX-1
Checkpoints huggingface.co/collections/RLWRLD/rldx-1
- RLDX-1-PT: Pre-trained checkpoint
- RLDX-1-MT-ALLEX: ALLEX humanoid mid-trained checkpoint
- RLDX-1-MT-DROID: DROID single-arm mid-trained checkpoint
- and many more fine-tuned models for benchmark results
Deployment & partnerships [email protected]

References

[1] Generalist AI. (2025). GEN-0. Nov 2025.
[2] Generalist AI. (2026). GEN-1. Apr 2026.
[3] Kim, M. J., Pertsch, K., Karamcheti, S., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.
[4] Black, K., Brown, N., Driess, D., et al. (2024). π₀: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.
[5] NVIDIA, Bjorck, J., et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734.
[6] Pai, J., Achenbach, L., Montesinos, V., Forrai, B., Mees, O., & Nava, E. (2025). mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs. arXiv:2512.15692.
[7] Ye, S., Ge, Y., Zheng, K., et al. (2026). DreamZero: World Action Models are Zero-shot Policies. arXiv:2602.15922.
[8] Jang, J., Ye, S., Lin, Z., et al. (2025). DreamGen: Unlocking Generalization in Robot Learning through Video World Models. arXiv:2505.12705.
[9] NVIDIA, et al. (2025). Cosmos: World Simulation with Video Foundation Models for Physical AI. arXiv:2511.00062.
[10] Esser, P., Kulal, S., Blattmann, A., et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML. arXiv:2403.03206.
[11] Black Forest Labs. (2024). FLUX. https://github.com/black-forest-labs/flux.
[12] Qwen Team, Alibaba. (2025). Qwen3-VL Technical Report. arXiv:2511.21631.
[13] Nasiriany, S., Maddukuri, A., Zhang, L., et al. (2024). RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. RSS. arXiv:2406.02523.
[14] Jang, H., Yu, S., Kwon, H., Jeon, H., Seo, Y., & Shin, J. (2025). ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context. arXiv:2510.04246.
[15] Kim, M., Kwon, H., Alahari, K., & Cho, M. (2026). Exploring High-Order Self-Similarity for Video Understanding. arXiv:2604.20760.
[16] Lee, J., Jang, H., Koo, M., Park, J., & Shin, J. (2026). MoSS: Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models. arXiv:2604.23272.
[17] Koo, M., Choi, D., Kim, T., Lee, K., Kim, C., Seo, Y., & Shin, J. (2025). HAMLET: Switch your VLA Model into a History-Aware Policy. arXiv:2510.00695.
[18] Kim, S., Jang, S., Yoon, B., Kim, D., Won, J., & Shin, J. (2026). RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning. arXiv:2602.18742.
[19] Chi, C., et al. (2024). Universal Manipulation Interface. arXiv:2402.10329.
[20] Xu, M., et al. (2025). DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation. arXiv:2505.21864.
[21] Fei, S., Wang, S., Shi, J., et al. (2025). LIBERO-Plus: In-Depth Robustness Analysis of Vision-Language-Action Models. arXiv:2510.13626.
[22] Kerbl, B., et al. (2023). 3D Gaussian Splatting. SIGGRAPH.
[23] Ross, S., Gordon, G., & Bagnell, D. (2011). DAgger. AISTATS.
[24] Ma, Y., et al. (2025). Vision Language Models are In-Context Value Learners. ICLR. arXiv:2411.04549.
[25] Ma, Y., et al. (2023). VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training. ICLR. arXiv:2210.00030.
[26] Physical Intelligence, et al. (2025). π₀.₆: A VLA That Learns From Experience. arXiv:2511.14759.
[27] Ye, C., et al. (2024). Online Iterative Reinforcement Learning from Human Feedback with General Preference Model. NeurIPS.
[28] Physical Intelligence, et al. (2025). π₀.₅: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054.
[29] Pertsch, K., Stachowicz, K., Ichter, B., et al. (2025). FAST: Efficient Action Tokenization for Vision-Language-Action Models. arXiv:2501.09747.

Citation

@article{rldx2026,
  title         = {RLDX-1 Technical Report},
  author        = {Dongyoung Kim and Huiwon Jang and Myungkyu Koo and Suhyeok Jang and Taeyoung Kim and others},
  year          = {2026},
  journal       = {arXiv preprint arXiv:2605.03269},
  eprint        = {2605.03269},
  archivePrefix = {arXiv}
}