ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Published in arXiv preprint arXiv:2602.17951, 2026

This work presents ROCKET, a residual-oriented multi-layer alignment framework that significantly improves spatial reasoning capabilities in Vision-Language-Action (VLA) models. The research was conducted at the University of Maryland, College Park.

Key Contributions:

Developed a novel multi-layer alignment framework for VLA models
Enhanced spatial awareness through residual-oriented techniques
Bridged 2D and 3D representations in vision-language-action systems

Research Area: Vision-Language-Action Models, Robotics, Multimodal Learning

Status: Manuscript in preparation for submission to ICML 2026

Recommended citation: G. Sun, T. Du, K. Feng, C. Luo, X. Ding, Z. Shen, Z. Wang, Y. He, A. Li. (2026). "ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models." arXiv preprint arXiv:2602.17951.
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Tingting Du

Share on