Underwater activities like scuba diving enable millions annually to explore marine environments for recreation and scientific research. Maintaining situational awareness and effective communication are essential for diver safety. Traditional underwater communication systems are often bulky and expensive, limiting their accessibility to divers of all levels. While recent systems leverage lightweight smartphones and support text messaging, the messages are predefined and thus restrict context-specific communication.
In this project, we present AquaVLM, a tap-and-send underwater communication system that automatically generates context-aware messages and transmits them using ubiquitous smartphones. Our system features a mobile vision-language model (VLM) fine-tuned on an auto-generated underwater conversation dataset and employs a hierarchical message generation pipeline. We co-design the VLM and transmission, incorporating error-resilient fine-tuning to improve the system's robustness to transmission errors. We develop a VR simulator to enable users to experience AquaVLM in a realistic underwater environment and create a fully functional prototype on the iOS platform for real-world experiments. Both subjective and objective evaluations validate the effectiveness of AquaVLM and highlight its potential for personal underwater communication as well as broader mobile VLM applications.
Alice mounts her mobile phone on her arm, captures an image of the shark, and selects the "SOS" purpose on the user interface. The mobile phone forms a prompt using multimodal data: the image, recent sensor data (e.g., compass orientation and depth), and her SOS purpose. The mobile VLM generates two message candidates. Alice selects a message to send.
The message is modulated into an acoustic signal by the signal processing module and transmitted via the phone’s speaker. At the receiving end, the acoustic signal is demodulated and converted back into a human-readable message. In case of errors (e.g., altered characters), the mobile VLM attempts to recover the original message . The received message is displayed on Bob’s phone screen, and two reply candidates are generated based on his selected purpose.
The instruction tuning pipeline consists of four stages: (a) We self-collect images from five scuba diving videos and use nine types of critical sensor data typically available on mobile devices or diving watches. (b) We generate underwater conversation using commercial VLMs, ChatGPT-4o, leveraging self-collected multimodal data and carefully designed prompts. (c) We identify three tasks for intruction tuning: sender message generation, reply generation, and message recovery . (d) We fine-tune the MobileVLM2 using LoRA.
To evaluate the effectiveness of AquaVLM, we design and build a virtual reality (VR)-based simulation platform that enables users to wear a headset to explore an underwater world, experience various events, and communicate with virtual divers using AquaVLM at any point during the simulation.
Testbed
Different events
Snapshot
We developed a prototype using an iPhone 12 Pro and Apple Watch Ultra and conducted tests in a lake with a maximum range of 20m and an averagedepth of 3m.
iOS prototype
Test environment
@misc{tian2025aquavlmimprovingunderwatersituation,
title={AquaVLM: Improving Underwater Situation Awareness with Mobile Vision Language Models},
author={Beitong Tian and Lingzhi Zhao and Bo Chen and Haozhen Zheng and Jingcheng Yang and Mingyuan Wu and Deepak Vasisht and Klara Nahrstedt},
year={2025},
eprint={2510.21722},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2510.21722},
}