Shreya Shankar • 4/8/2024

Comparing LLMs on "Real-World" Retrieval

The article details a personal evaluation of 8 instruction-tuned LLMs (including GPT-4, Claude, Gemini, and open-source models) on a custom "real-world" retrieval task. The author uses ~85 doctor-patient transcripts to test model performance on three questions of varying difficulty, moving beyond standard benchmarks to assess reasoning on unstructured data likely absent from training sets.

0 comments

#LLM Evaluation #Model Comparison #Instruction Tuning