Computer vision articles

1/27/2026 • EN

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5 is a new multimodal AI model with visual understanding and a self-directed agent swarm for complex, parallel task execution.

Agent Swarm code generation computer vision llm Multimodal AI

Simon Willison

1/19/2026 • EN

Transparent PNG Stickers with Nano Banana Pro and Gemini interactions API

A technical guide on generating transparent PNG stickers using the Gemini API with chromakey green and HSV color detection for clean background removal.

Chromakey computer vision Gemini API image generation Transparent Png

Philipp Schmid

1/13/2026 • EN

Journey of AI-Led FoodInsight Development with BMAD

Case study on building an edge AI food monitoring system using AI-led development with the BMAD framework, achieving rapid delivery with minimal human oversight.

AI Agents computer vision Edge AI software development Yolo

Benny Cheung

1/12/2026 • EN

FoodInsight: Edge AI Food Monitoring with Local-First Architecture

A guide to building an offline, edge AI food monitoring system using a Raspberry Pi, YOLO11, and a local-first architecture for privacy.

computer vision Edge AI Local First Architecture raspberry pi Yolo11

Benny Cheung

1/11/2026 • EN

You Only Look Once: 8 Years of Food Detection Evolution

A technical comparison of YOLO-based food detection from 2018 to 2026, showing the evolution of deep learning tooling and ease of use.

computer vision Deep Learning Machine Learning Object Detection Yolo

Benny Cheung

12/8/2025 • EN

Advent of AI 2025 - Day 5: I Built a Touchless Flight Tracker You Control With Hand Gestures

A developer builds a gesture-controlled flight tracker using MediaPipe, TanStack Start, and OpenSky API for the Advent of AI 2025 challenge.

computer vision Gesture Recognition Mediapipe React TypeScript

Nick Taylor

11/27/2025 • EN

Quoting Qwen3-VL Technical Report

Technical report on Qwen3-VL's video processing capabilities, achieving near-perfect accuracy in long-context needle-in-a-haystack evaluations.

computer vision Evaluation Long Context Multimodal AI Positional Encoding

Simon Willison

11/12/2025 • EN

Agentic Pelican on a Bicycle

Experiment testing if AI vision models improve SVG drawings of a pelican on a bicycle through iterative, agentic feedback loops.

AI Agents Claude Opus computer vision Iterative Improvement Svg Generation

Simon Willison

10/4/2025 • EN

A moose playing Go in a park while drinking boba

An analysis of AI video generation using a specific, complex prompt to test the capabilities and limitations of models like Sora 2.

AI Video Generation computer vision generative ai prompt engineering Sora 2

Cassidy Williams

9/4/2025 • EN

Computer Vision in Python Building Detection and Object Annotation with Ultralytics YOLO and Supervision

A tutorial on using Python, Ultralytics YOLO, and Supervision for computer vision tasks like object detection and image annotation.

computer vision Object Detection Python Supervision Yolo

Liran Tal

6/7/2025 • EN

Peekaboo MCP – lightning-fast macOS screenshots for AI agents

Introduces Peekaboo MCP, a macOS tool that enables AI agents to capture screenshots and perform visual question answering using local or cloud vision models.

AI Agents computer vision Maco mcp server Screenshots

Peter Steinberger

5/25/2025 • EN

Better Building Footprints

A review of Building Regulariser, a Python package that improves AI-generated building footprints from satellite imagery by making outlines more regular and plausible.

Building Footprints computer vision Geospatial Python Satellite Imagery

Mark Litwintschik

5/21/2025 • EN

Satellites Spotting Depth

A technical guide applying the Depth Anything V2 AI model to analyze high-resolution Maxar satellite imagery of Bangkok for depth estimation.

computer vision Depth Estimation github Python Satellite Imagery

Mark Litwintschik

5/20/2025 • EN

Turning Videos into 3D Worlds

Explores 3D Gaussian Splatting, a technique for creating real-time 3D worlds from videos, comparing different generation methods and web-based tools.

3d Gaussian Splatting 3d Reconstruction computer vision Neural Radiance Fields Photogrammetry

Mark Litwintschik

2/10/2025 • EN

Exploring IBM Granite 3.1: A Deep Dive into AI Reasoning and Vision Capabilities

A technical guide exploring IBM's Granite 3.1 AI models, covering their reasoning and vision capabilities with a demo and local setup instructions.

artificial intelligence computer vision Ibm Granite Machine Learning Reasoning Models

Ruslan Magana Vsevolodovna

11/3/2024 • EN

Understanding Multimodal LLMs

Explains how multimodal LLMs work, compares recent models like Llama 3.2, and outlines two main architectural approaches for building them.

AI Research computer vision large language models Llama 32 Multimodal Llms

Sebastian Raschka

11/3/2024 • EN

Understanding Multimodal LLMs

Explains how multimodal LLMs work, reviews recent models like Llama 3.2, and compares different architectural approaches.

AI Research computer vision large language models Llama 32 Multimodal Llms

Sebastian Raschka

8/29/2024 • EN

AI on Street View

Analyzing the Global Streetscapes dataset, a massive collection of AI-labeled street view imagery, using Python, DuckDB, and a high-performance workstation.

ai computer vision Dataset Geospatial Street View

Mark Litwintschik