Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.
We adapt SOTA prompt-tuning methods to Ego-VFMs as baselines (e.g. TPT, VPT, VoPF+C) and propose Ego-VPA that leverages cross-modal prompt synthesis with basis prompts.
Cross-modal Prompt Synthesis with Basis Prompts
Results on Charades-Ego and EGTEA compared to SOTA prompt-tuning methods. Ego-VPA achieves superior performance while using only 0.84% learnable parameters.
Ego-VPA generalizes to Epic-Kitchen-100 multi-instance retrieval task, achieving better parameter-performance trade-off.
Please refer to our paper for more ablations.
This work was partially funded by NSF awards IIS-2303153, gift from Qualcomm and NVIDIA GPU donations.