Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
Published in Transactions on Machine Learning Research (TMLR), 2026
This comprehensive survey examines the data infrastructure supporting Vision-Language-Action (VLA) models in robotics. The work argues that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols.
Key Contributions:
- Systematic analysis of VLA datasets categorized by embodiment diversity and modality composition
- Comprehensive review of benchmarks analyzing task complexity and environment structure
- Examination of data engines including simulation and automated generation paradigms
- Identification of four critical challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation
Research Area: Vision-Language-Action Models, Robotics, Machine Learning
Status: Accepted by TMLR (Transactions on Machine Learning Research) after peer review
Recommended citation: Z. Wang, B. Wang, H. Zhang, T. Du, T. Chen, G. Sun, Y. He, Z. Shen, W. Ye, A. Li. (2026). "Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines." Transactions on Machine Learning Research (TMLR).
Download Paper
