Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation



1Intel Labs            2UC San Diego

Overview


Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

paper

Published in IEEE Winter Conference on Applications of Computer Vision (WACV), United States, 2025.

Paper

Repository

Bibtex

Models


We adapt SOTA prompt-tuning methods to Ego-VFMs as baselines (e.g. TPT, VPT, VoPF+C) and propose Ego-VPA that leverages cross-modal prompt synthesis with basis prompts.

paper

Cross-modal Prompt Synthesis with Basis Prompts

The basis prompts are shared across frames and modalities, but different mapping functions are adopted per modality to synthesize the prompts.
paper
  • We assume the prompt information lies on a lower dimensional latent space, onto which frame features are mapped by an encoder h.
  • We seek a local sparse approximation of the projected frame features using an orthogonal prompt basis.
  • Given a frame feature, we identify k basis prompts from the basis that best reconstruct the projected feature.
  • The selected basis prompts are then mapped by a latent space decoder g into the frame-specific prompts.
  • Since the synthesized frame-specific prompts encapsulate the context of the frame, cross-attention between these prompts can summarize knowledge across frames, without requiring additional heavy modules.
  • While text description and image frames are from two modalities, the underlying semantic context should be shared. Similar to video prompt synthesis, we can synthesize text prompts with the same mechanism using the shared prompt basis but different latent space mapping.

Results


table1

Results on Charades-Ego and EGTEA compared to SOTA prompt-tuning methods. Ego-VPA achieves superior performance while using only 0.84% learnable parameters.

table1

Ego-VPA generalizes to Epic-Kitchen-100 multi-instance retrieval task, achieving better parameter-performance trade-off.

Please refer to our paper for more ablations.

Acknowledgements

This work was partially funded by NSF awards IIS-2303153, gift from Qualcomm and NVIDIA GPU donations.