X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model Paper • 2510.10274 • Published Oct 11 • 14
Enhancing Vision-Language Model with Unmasked Token Alignment Paper • 2405.19009 • Published May 29, 2024 • 1
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment Paper • 2406.19736 • Published Jun 28, 2024 • 3