DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving (ICML 2026)

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT and Visual CoT, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

Comparison of CoT paradigms teaser — **Comparison of different CoT paradigms in autonomous driving VLA models.** (a) Textual CoT suffers from limited spatiotemporal understanding and high inference latency due to long textual reasoning traces. (b) Visual CoT introduces substantial redundancy and computational overhead from pixel-level generation. (c) Dynamics CoT compresses future dynamics into a small set of tokens, achieving latency-efficient inference with compact reasoning and accurate spatiotemporal modeling.

Dynamics Tokenizer

Given adjacent image observations, a dynamics encoder extracts ego-centric and environment-centric dynamics, which are discretized via VQ codebooks. Then, the ego-centric dynamics are regularized by GT ego action, and the combined dynamics are decoded to reconstruct the future image and BEV map conditioned on each current state.

DynVLA Overview

DynVLA is supervised to first generate discrete dynamics tokens followed by action tokens, forming structured Dynamics CoT modeling.

Training Pipeline

DynVLA first learns a Dynamics Tokenizer by reconstructing future states from adjacent frames, producing discrete dynamics tokens. It then performs SFT on Dynamics CoT, training the model to generate dynamics tokens before action tokens. Finally, the policy is optimized via RFT with trajectory-level reward and KL regularization.

Transferability of Learned Dynamics

We extract discrete dynamics tokens from one scene, inject them into another, and decode the resulting future states.

Original scene dynamics: Ego: Left

Injected dynamics: Ego: Right

Scene B with selected dynamics injection (top) and ground truth (bottom).

Current Image and BEV

Decoded Counterfactual Future

Current Image and BEV

Ground-truth Future

Original scene dynamics: Ego: Left

Injected dynamics: Ego: Right

Scene B with selected dynamics injection (top) and ground truth (bottom).

Current Image and BEV

Decoded Counterfactual Future

Current Image and BEV

Ground-truth Future

Original scene dynamics: Ego: Left

Injected dynamics: Ego: Right

Scene B with selected dynamics injection (top) and ground truth (bottom).

Current Image and BEV

Decoded Counterfactual Future

Current Image and BEV

Ground-truth Future

Performance Comparison

NAVSIM benchmark comparison — Comparison on real-world NAVSIM Benchmark.

Bench2Drive benchmark comparison — Comparison on close-loop Bench2Drive Benchmark.

In-house dataset comparison — Comparison on a large-scale In-house Dataset.

Ablation Studys

CoT design ablation — Analysis on CoT Design and Latency.

Training stages ablation — Ablation on Training Stages.

VQ code activation analysis — Number of Activated VQ Codes during Dynamics Tokenizer Training.

Tokenizer design ablation — Ablation on Dynamics Tokenizer Designs.

Qualitative Analysis

Dynamics CoT improves planning by reasoning over future dynamics. The first two columns show the current observation and the future decoded by reasoned dynamics. The third and last columns compare planning results with and without Dynamics CoT. Compared to direct action prediction, Dynamics CoT provides intent-aware, foresighted, and constraint-compliant future dynamics, enabling safer and more feasible planning in challenging scenarios.