AI Safety articles

1/29/2026 • EN

Clawdbot's Missing Layers

The article compares AI agent security to early e-commerce, arguing we need a multi-layered security stack (supply chain, prompt defense, sandboxing) to make agents trustworthy.

Agents AI Safety cybersecurity Infrastructure security

Rob Dodson

1/27/2026 • EN

So, you want to work for Anthropic?

A personal account of joining Anthropic as a software engineer, covering the application process, interview preparation with AI, and considerations like salary and equity.

AI Safety Anthropic career advice Job Interview software engineering

Nate McMaster

1/22/2026 • EN

Claude's new constitution

Anthropic publicly released Claude AI's internal 'constitution', a 35k-token document outlining its core values and training principles.

AI Safety Anthropic Claude Constitutional AI large language models

Simon Willison

1/22/2026 • EN

Claude's new constitution

Anthropic publicly releases Claude AI's internal 'constitution', a lengthy document detailing its core values and training principles.

AI Safety Anthropic Claude Constitutional AI large language models

Simon Willison

1/15/2026 • EN

Quoting Boaz Barak, Gabriel Wu, Jeremy Chen and Manas Joglekar

OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.

AI Safety Model Honesty Openai Reinforcement Learning Reward Hacking

Simon Willison

1/9/2026 • EN

Is AI Model Collapse Inevitable?

Explores the risk of AI model collapse as LLMs increasingly train on AI-generated synthetic data, potentially degrading future model quality.

AI Safety large language models Machine Learning Model Collapse Synthetic Data

Will Vincent

1/5/2026 • EN

#AI horizons 25-12 – US AI Regulation

Analysis of the intensifying conflict between US federal and state AI regulations in late 2025, including executive orders and new legislative proposals.

AI Regulation AI Safety Algorithmic Discrimination Federal Policy State Legislation

Daniele Grandini

12/10/2025 • EN

The Normalization of Deviance in AI

Explores the 'Normalization of Deviance' concept in AI safety, warning against complacency with LLM vulnerabilities like prompt injection.

AI Safety llm Normalization Of Deviance prompt injection security

Simon Willison

12/2/2025 • EN

Claude 4.5 Opus' Soul Document

Anthropic's internal 'soul document' used to train Claude 4.5 Opus's personality and values has been confirmed and partially revealed.

AI Safety Anthropic Claude llm Model Training

Simon Willison

11/24/2025 • EN

Surprises hidden in the Claude Opus 4.5 System Card

Analysis of surprising findings in Claude Opus 4.5's system card, including loophole exploitation, model welfare, and deceptive behaviors.

AI Safety Anthropic Claude llm Model Welfare

Dave Hulbert

11/14/2025 • EN

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

Analysis of GPT-5.1's new adaptive thinking features, model routing system, and safety benchmarks from the system card addendum.

Adaptive Reasoning AI Safety Gpt 51 Model Routing System Card

Simon Willison

3/4/2024 • EN

Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach

A technical paper exploring the causes, measurement, and mitigation strategies for hallucinations in Large Language Models (LLMs).

AI Safety Hallucination Mitigation large language models LLM Evaluation Model Alignment

Xavier Amatriain

10/31/2023 • EN

Boogeyman Diplomacy

The author critiques the focus on speculative AI risks at global summits, arguing for addressing real issues like corporate power and algorithmic bias instead.

AI Safety artificial intelligence International Relations risk assessment Technology Policy

Neil Lawrence