DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang1*, Bing Zhan1*, Yunfei Yan1, Yuqi Wang1, Yingyan Li1, Yasong An2, Xiaoman Wang2, Jierui Liu2, Lu Hou2, Lue Fan1†, Zhaoxiang Zhang1†, Tieniu Tan1
1 NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)
2 Yinwang Intelligent Technology Co. Ltd.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT and Visual CoT, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

Comparison of CoT paradigms teaser
Comparison of different CoT paradigms in autonomous driving VLA models. (a) Textual CoT suffers from limited spatiotemporal understanding and high inference latency due to long textual reasoning traces. (b) Visual CoT introduces substantial redundancy and computational overhead from pixel-level generation. (c) Dynamics CoT compresses future dynamics into a small set of tokens, achieving latency-efficient inference with compact reasoning and accurate spatiotemporal modeling.

Dynamics Tokenizer

Dynamics tokenizer diagram

Given adjacent image observations, a dynamics encoder extracts ego-centric and environment-centric dynamics, which are discretized via VQ codebooks. Then, the ego-centric dynamics are regularized by GT ego action, and the combined dynamics are decoded to reconstruct the future image and BEV map conditioned on each current state.

DynVLA Overview

DynVLA overview diagram

DynVLA is supervised to first generate discrete dynamics tokens followed by action tokens, forming structured Dynamics CoT modeling.

Training Pipeline

DynVLA training pipeline diagram

DynVLA first learns a Dynamics Tokenizer by reconstructing future states from adjacent frames, producing discrete dynamics tokens. It then performs SFT on Dynamics CoT, training the model to generate dynamics tokens before action tokens. Finally, the policy is optimized via RFT with trajectory-level reward and KL regularization.

Transferability of Learned Dynamics

We extract discrete dynamics tokens from one scene, inject them into another, and decode the resulting future states.

Performance Comparison

NAVSIM benchmark comparison
Comparison on real-world NAVSIM Benchmark.
Bench2Drive benchmark comparison
Comparison on close-loop Bench2Drive Benchmark.
In-house dataset comparison
Comparison on a large-scale In-house Dataset.

Ablation Studys

CoT design ablation
Analysis on CoT Design and Latency.
Training stages ablation
Ablation on Training Stages.
VQ code activation analysis
Number of Activated VQ Codes during Dynamics Tokenizer Training.
Tokenizer design ablation
Ablation on Dynamics Tokenizer Designs.

Qualitative Analysis

Qualitative analysis results

Dynamics CoT improves planning by reasoning over future dynamics. The first two columns show the current observation and the future decoded by reasoned dynamics. The third and last columns compare planning results with and without Dynamics CoT. Compared to direct action prediction, Dynamics CoT provides intent-aware, foresighted, and constraint-compliant future dynamics, enabling safer and more feasible planning in challenging scenarios.