Simon Willison • 4/23/2026

Extract PDF text in your browser with LiteParse for the web

This article describes how a developer created a browser-based version of LiteParse, an open-source PDF text extraction tool originally built as a Node.js CLI by LlamaIndex. The browser version runs entirely client-side using PDF.js and Tesseract.js, avoiding AI models and relying on traditional PDF parsing with optional OCR for image-based text. It features spatial text parsing to handle complex PDF layouts like multi-column text, and supports visual citations with bounding boxes for RAG-style Q&A. The author built the tool using Claude Code and Opus 4.7, starting from a mobile phone experiment. The project is hosted on GitHub and available for anyone to try online.

0 comments

#JavaScript #ocr #Browser