Synthetic Data articles

1/9/2026 • EN

Is AI Model Collapse Inevitable?

Explores the risk of AI model collapse as LLMs increasingly train on AI-generated synthetic data, potentially degrading future model quality.

AI Safety large language models Machine Learning Model Collapse Synthetic Data

Will Vincent

3/18/2025 • EN

Enhancing Text-to-SQL With Synthetic Summaries

Explains a technique using AI-generated summaries of SQL queries to improve the accuracy of text-to-SQL systems with LLMs.

llm Retrieval Augmented Generation SQL Generation Synthetic Data Text To SQL

Saeed Esmaili

1/23/2025 • EN

How to align open LLMs in 2025 with DPO and and synthetic data

A technical guide on aligning open-source large language models (LLMs) in 2025 using Direct Preference Optimization (DPO) and synthetic data.

Direct Preference Optimization LLM Alignment Post Training Preference Learning Synthetic Data

Philipp Schmid

1/17/2025 • EN

Final notes on ‘Prompt Engineering for LLMs’

Final notes from a book on LLM prompt engineering, covering evaluation frameworks, offline/online testing, and LLM-as-judge techniques.

Evaluation Framework LLM Applications prompt engineering Synthetic Data testing

Alex Strick van Linschoten

10/15/2024 • EN

How To T̶r̶a̶i̶n̶ Synthesize Your D̶r̶a̶g̶o̶n̶ Data

Explores the use of LLMs to generate synthetic data for training AI models, discussing challenges, an experiment with coding data, and a new library.

Data Generation Fastdata llm Synthetic Data Tinystories

Jeremy Howard

2/11/2024 • EN

How to Generate and Use Synthetic Data for Finetuning

Explores methods for generating synthetic data (distillation & self-improvement) to fine-tune LLMs for pretraining, instruction-tuning, and preference-tuning.

Finetuning Instruction Tuning llm Preference Tuning Synthetic Data

Eugene Yan

4/16/2022 • EN

Learning with not Enough Data Part 3: Data Generation

Explores synthetic data generation methods like augmentation and pretrained models to overcome limited training data in machine learning.

Data Augmentation image processing Machine Learning Pretrained Models Synthetic Data

Lilian Weng

Synthetic Data Articles