Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Published in Transactions on Machine Learning Research (TMLR), 2026

This comprehensive survey examines the data infrastructure supporting Vision-Language-Action (VLA) models in robotics. The work argues that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols.

Key Contributions:

Systematic analysis of VLA datasets categorized by embodiment diversity and modality composition
Comprehensive review of benchmarks analyzing task complexity and environment structure
Examination of data engines including simulation and automated generation paradigms
Identification of four critical challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation

Research Area: Vision-Language-Action Models, Robotics, Machine Learning

Status: Accepted by TMLR (Transactions on Machine Learning Research) after peer review

Recommended citation: Z. Wang, B. Wang, H. Zhang, T. Du, T. Chen, G. Sun, Y. He, Z. Shen, W. Ye, A. Li. (2026). "Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines." Transactions on Machine Learning Research (TMLR).
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Tingting Du

Share on