3D-CAVLA: Leveraging Depth and 3D Context to Generalize Vision–Language Action Models for Unseen Tasks

New York University

Abstract

Robotic manipulation in 3D requires learning an N degree-of-freedom joint space trajectory of a robot manipulator. Robots must possess semantic and visual perception abilities to transform real-world mappings of their workspace into the low-level control necessary for object manipulation. Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models (VLMs) to learn the mapping between RGB images, language instructions, and joint space control. These models typically take as input RGB images of the workspace and language instructions, and are trained on large datasets of teleoperated robot demonstrations. In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model by integrating chain-of-thought reasoning, depth perception, and task-oriented region of interest detection. Our experiments in the LIBERO simulation environment show that our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1%. We also evaluate the zero-shot capabilities of our method, demonstrating that 3D scene awareness leads to robust learning and adaptation for completely unseen tasks. 3D-CAVLA achieves an absolute improvement of 8.8% on unseen tasks. We open-source our code and the unseen tasks dataset we created to promote community-driven research:

Architecture of 3D-CAVLA

Our proposed model, SA-VLA, integrates chain-of-thought style narrative task descriptions, depth embeddings and Region Of Interest (ROI) pooling to improve the scene awareness of vision-language-action modeling. While GPT4 and ROI Detection are frozen components, our depth encoder is a lightweight PointNet inspired trainable network with spatial invariance transformation, convolution blocks and linear projections to project the embeddings to match the input dimensions of LLaMA2-7B.

Our Task-Aware Framework for Region of Interest Detection via Entity Recognition and Object Tracking.

Experiments

Results on the LIBERO Benchmark

3D-CAVLA shows consistent improvement across all task suites in the dual camera setup. Most baselines overfit to the tasks and thus the margins are quite narrow. The strongest improvements are shown in long-horizon tasks (column 5) where chain-of-thought instructions helps the policy focus on one sub-task at a time. All scores are reported in success rate (%).


Performance on unseen tasks

Success-rate (in %) of OpenVLA-OFT and 3D-CAVLA on 10 unseen tasks. Both models cannot replicate performance on seen tasks. 3D-CAVLA decomposes unseen tasks into seen steps and applies task-aware region-of-interest detection, enabling better generalization.


Robot demonstrations on unseen tasks

Qualitative comparisons of OpenVLA-OFT and 3D-CAVLA on unseen LIBERO tasks. We show first, middle, and last frames of each inference. The final two rows depict failures where both models misidentify target object or get distracted by previously seen objects.


Rollout Videos: Qualitative Comparisons of OpenVLA-OFT and 3D-CAVLA on unseen LIBERO tasks

OpenVLA-OFT:
Put the chocolate pudding on the plate
3D-CAVLA:
Put the chocolate pudding on the plate
OpenVLA-OFT:
Place the white and yellow mug on the plate
3D-CAVLA:
Place the white and yellow mug on the plate
OpenVLA-OFT:
Turn on the stove and put the bowl on it
3D-CAVLA:
Turn on the stove and put the bowl on it

BibTeX