ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
Published in arXiv preprint arXiv:2602.17951, 2026
This work presents ROCKET, a residual-oriented multi-layer alignment framework that significantly improves spatial reasoning capabilities in Vision-Language-Action (VLA) models. The research was conducted at the University of Maryland, College Park.
Key Contributions:
- Developed a novel multi-layer alignment framework for VLA models
- Enhanced spatial awareness through residual-oriented techniques
- Bridged 2D and 3D representations in vision-language-action systems
Research Area: Vision-Language-Action Models, Robotics, Multimodal Learning
Status: Manuscript in preparation for submission to ICML 2026
Recommended citation: G. Sun, T. Du, K. Feng, C. Luo, X. Ding, Z. Shen, Z. Wang, Y. He, A. Li. (2026). "ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models." arXiv preprint arXiv:2602.17951.
Download Paper
