Diffusion Transformer Policy: Scaling Diffusion Transformer for Generalist Vision-Language-Action Learning

1Shanghai AI Lab 2College of Computer Science and Technology, Zhejiang University 3MMLab, The Chinese University of Hong Kong 4Peking University 5SenseTime Research 6Tsinghua University 7Center for Artificial Intelligence and Robotics, HKISI, CAS

teaser

Simulation Benchmarks Comparison

Abstract

Recent large vision-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict individual discretized or continuous action by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action sequence with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head for action embedding. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate the effectiveness and generalization of Diffusion Transformer Policy on Maniskill2, Libero, Calvin and SimplerEnv, as well as the real-world Franka arm, achieving consistent better performance on Real-to-Sim benchmark SimplerEnv, real-world Franka Arm and Libero compared to OpenVLA and Octo. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin task ABC->D, improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2.

Video

Real Franka Demonstration

Here, we illustrate the input frames that are inputted into the model. And the video has thus been accelearted by around 5 times. All experiments are 10-shot finetuning and the training samples are provided in the video above.

More tasks are comming.

Real-to-Sim benchmark SimplerEnv Demonstration

Here, we demonstrate the robust of the proposed model under SimplerEnv. We pretrain on OXE datasets and evaluate on SimplerEnv.

Simulation benchmark LIBERO Demonstration

Here, we demonstrate the evaluation examples of the proposed model on LIBERO benchmark. We pretrain on OXE datasets and fintune on LIBERO.

LIBERO-Long

LIBERO-Spatial

LIBERO-Object

LIBERO-Goal

BibTeX

@article{hou2024diffusion,
      title={Diffusion Transformer Policy: Scaling Diffusion Transformer for Generalist Vision-Language-Action Learning},
      author={Hou, Zhi and Zhang, Tianyi and Xiong, Yuwen and Pu, Hengjun and Zhao, Chengyang and Tong, Ronglei and Qiao, Yu and Dai, Jifeng and Chen, Yuntao},
      journal={arXiv preprint arXiv:2410.15959},
      year={2024}
    }