Jeremy Howard • 6/5/2025

TIL: Vision-Language Models Read Worse (or Better) Than You Think

The article presents ReadBench, a new benchmark designed to test the often-overlooked ability of Vision-Language Models (VLMs) to read and reason from text within images. It explains that while VLMs excel in visual understanding, their performance degrades significantly when processing long, text-heavy images, which impacts Visual RAG pipelines. The benchmark converts existing text-based QA datasets into image format and is publicly available on HuggingFace, GitHub, and arXiv for community use.

0 comments

#benchmarking #Text Extraction #Multimodal AI