Multimodal AI articles

1/27/2026 • EN

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5 is a new multimodal AI model with visual understanding and a self-directed agent swarm for complex, parallel task execution.

Agent Swarm code generation computer vision llm Multimodal AI

Simon Willison

1/27/2026 • EN

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5 is a new multimodal AI model with visual understanding and a self-directed agent swarm for complex task execution.

Agent Swarm llm Multimodal AI Tool Calling Visual Intelligence

Simon Willison

1/5/2026 • EN

#AI horizons 25-12 – models releases

December's major AI model releases focused on open licensing, long-context efficiency, and multimodal capabilities from companies like Mistral, Amazon, and OpenAI.

llm Long Context Model Distillation Multimodal AI open source

Daniele Grandini

11/27/2025 • EN

Quoting Qwen3-VL Technical Report

Technical report on Qwen3-VL's video processing capabilities, achieving near-perfect accuracy in long-context needle-in-a-haystack evaluations.

computer vision Evaluation Long Context Multimodal AI Positional Encoding

Simon Willison

11/19/2025 • EN

Gemini 3 Prompting: Best Practices for General Usage

Best practices and structural patterns for effectively prompting the Gemini 3 AI model, focusing on directness, logic, and clear instruction.

ai development Geminipro llm Multimodal AI prompt engineering

Philipp Schmid

11/18/2025 • EN

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

A hands-on review of Google's new Gemini 3 Pro AI model, covering its features, benchmarks, pricing, and testing its multimodal capabilities.

AI Benchmarks Audio Transcription Gemini 3 Pro Model Pricing Multimodal AI

Simon Willison

9/9/2025 • EN

Prompt API color sensitivity

Explores color sensitivity in AI models when reading text from a canvas, noting issues with red text on dark backgrounds.

canvas color contrast JavaScript Multimodal AI Wcag

Thomas Steiner

8/27/2025 • EN

The 10 Steps for product AI generation with Gemini 2.5 Flash

A 10-step guide for e-commerce teams to generate consistent product images using Google's Gemini 2.5 Flash AI model for text-to-image and editing tasks.

ai image generation E Commerce Gemini 25 Flash Multimodal AI Product Photography

Philipp Schmid

6/5/2025 • EN

TIL: Vision-Language Models Read Worse (or Better) Than You Think

Introduces ReadBench, a benchmark for evaluating how well Vision-Language Models (VLMs) can read and extract information from images of text.

benchmarking Multimodal AI Text Extraction Vision Language Models Visual Rag

Jeremy Howard

3/15/2025 • EN

Using RealTime AI – Part 1: Getting Started with the Fundamentals of Low-Latency AI Magic

An introduction to RealTime AI, exploring the fundamentals of low-latency AI using the OpenAI Realtime API for fluid, conversational applications.

Low Latency Multimodal AI Openai API Realtime AI Voice Assistant

Code with Dan

5/24/2024 • EN

Discover GPT-4o: OpenAI’s Multimodal

Explores GPT-4o, OpenAI's new multimodal AI model that processes text, images, and audio, now available in preview on Azure AI.

artificial intelligence Azure AI Gpt 4o Multimodal AI Openai

Wessel

12/21/2023 • EN

Test Run - Using Multimodal Vision AI In Test Automation

Explores using multimodal vision AI models like LLaVA for advanced UI/UX test automation, moving beyond traditional methods.

artificial intelligence computer vision Multimodal AI test automation ui testing

Unmesh Gundecha

12/2/2023 • EN

Rough Experiments with Llamafile and LLaVA 1.5

A developer experiments with Llamafile and LLaVA 1.5 to extract structured data from comedy show posters, testing its accuracy and JSON output capabilities.

computer vision Image Recognition Llamafile Llava Multimodal AI

Michael Lynch

10/12/2023 • EN

Deploy Idefics 9B and 80B on Amazon SageMaker

A technical guide on deploying Hugging Face's IDEFICS visual language models (9B & 80B parameters) to Amazon SageMaker using the LLM DLC.

Amazon Sagemaker Idefics large language models Model Deployment Multimodal AI

Philipp Schmid

10/10/2023 • EN

Multimodality and Large Multimodal Models (LMMs)

An in-depth exploration of Large Multimodal Models (LMMs), covering their fundamentals, key architectures like CLIP and Flamingo, and current research directions.

Clip Flamingo Large Multimodal Models llm Multimodal AI

Chip Huyen