Speaking Skills Talk - Victor Akinwande

August 27, 2024 1:00pm — 2:00pm

Location:
In Person - Newell-Simon 4305

Speaker:
VICTOR AKINWANDE, Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://home.victorakinwande.com/

Self-supervised vision-language models trained with contrastive objectives perform better as one increases their scale. Typically, the image encoder in such models are larger than the text encoder and we are often able to amortize the inference cost of the text encoder by using a predefined set of text-embedding but not with the image encoder. This poses a challenge for deploying large vision-language models especially in resource-constrained environments.

In this talk, I will present HyperCLIP - a vision-language architecture that dynamically adapts a small image encoder using a hypernetwork. This hypernetwork learns to produce a subset of the image encoder parameters conditioned on the text-embedding, and the entire model (hypernetwork, image encoder, and text encoder) are trained jointly end-end. HyperCLIP increases the zero-shot accuracy of SigLIP models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

Event Website:
https://csd.cmu.edu/calendar/speaking-skills-talk-victor-akinwande

Add event to Google
Add event to iCal

About Main page

Admissions Main page

Academics Main page

People Main page

Research Main page

Speaking Skills Talk - Victor Akinwande

August 27, 2024 1:00pm — 2:00pm