Lilian Weng 6/10/2022

Generalized Visual Language Models

Read Original

This technical article analyzes approaches for building Generalized Visual Language Models (VLMs). It details four main strategies: jointly training image and text embeddings, using image embeddings as a prefix for frozen language models, integrating visual data via cross-attention, and combining models without training. It uses VisualBERT as a case study, explaining its architecture and training on datasets like MS COCO for tasks like image captioning.

Generalized Visual Language Models

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser