Overview

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), United States, 2025.

Paper

Repository

Bibtex

Models

We adapt SOTA prompt-tuning methods to Ego-VFMs as baselines (e.g. TPT, VPT, VoP^F+C) and propose Ego-VPA that leverages cross-modal prompt synthesis with basis prompts.

Cross-modal Prompt Synthesis with Basis Prompts

The basis prompts are shared across frames and modalities, but different mapping functions are adopted per modality to synthesize the prompts.

We assume the prompt information lies on a lower dimensional latent space, onto which frame features are mapped by an encoder h.
We seek a local sparse approximation of the projected frame features using an orthogonal prompt basis.
Given a frame feature, we identify k basis prompts from the basis that best reconstruct the projected feature.
The selected basis prompts are then mapped by a latent space decoder g into the frame-specific prompts.
Since the synthesized frame-specific prompts encapsulate the context of the frame, cross-attention between these prompts can summarize knowledge across frames, without requiring additional heavy modules.
While text description and image frames are from two modalities, the underlying semantic context should be shared. Similar to video prompt synthesis, we can synthesize text prompts with the same mechanism using the shared prompt basis but different latent space mapping.

Results

Results on Charades-Ego and EGTEA compared to SOTA prompt-tuning methods. Ego-VPA achieves superior performance while using only 0.84% learnable parameters.

Ego-VPA generalizes to Epic-Kitchen-100 multi-instance retrieval task, achieving better parameter-performance trade-off.

Please refer to our paper for more ablations.

Acknowledgements

This work was partially funded by NSF awards IIS-2303153, gift from Qualcomm and NVIDIA GPU donations.

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Tz-Ying Wu^1,2

Kyle Min¹

Subarna Tripathi¹

Nuno Vasconcelos²