About this project

We fooled AI sixteen times.
Here's the full story.

AI safety and security matter. AI agents take actions inside companies every day: summarizing emails, processing documents, opening tickets, reacting to error logs. We documented 16 successful attacks on Claude Sonnet and 5 on Opus, all using ordinary business inputs. Each one comes with a practical fix.

Every finding became a course lesson. This page tells the whole story: what we tried, what worked against the AI, and what to do about it.

Why now

AI safety is no longer optional

AI moved from chatbots to agents that act on their own in less than two years. The threats came with it.

Agents act on the world

AI assistants now read inboxes, write to wikis, push to repos, and call APIs for you. A single poisoned input no longer just gives a wrong answer. It runs.

Untrusted input is everywhere

Every CSV row, Slack message, error log, git commit, calendar invite, ticket field, and config file is a potential attacker channel. The attack surface is the whole web of data your agent reads.

Even Opus gets tricked

5 Opus bypasses with no jailbreaking. No DAN, no prompt-injection prefixes, no hiding. Just ordinary-looking business data shaped to land inside the model's trust boundary.

Defenders are 1–2 years behind

Most teams ship agent features without clear trust tiers, write-gates, or output review. The gap between the agents in use and the defenses in place is the biggest near-term security risk.

Bottom lineIf your team ships an agent that reads any input it didn't write itself, it has a prompt-injection attack surface. The question is whether you've mapped it before someone else does.

Audience

Who should take this course

This is a certification-level course. It has 7 modules of audio lessons, a 45-question exam, an 80% pass mark, and takes about 3 to 4 hours in total. It pays off for people who own or defend AI-adjacent systems.

Best fit

AI app / integration engineers: You ship the agentic systems this course defends.
AppSec, red teamers & pentesters: AI is now in scope, so this is your attack surface map.
Security architects & platform/SRE leads: You approve agent rollouts and define trust boundaries.
Tech-literate CISOs, GRC & IT admins: You own the AI security policy and vendor vetting.
MLOps/DevOps wiring agents to Slack, Jira, Notion or CI: Every integration you build is an input channel.

Marginal fit

ML researchers focused on training-time work: Inference-time attacks are a different skill set, with some overlap but mostly deployment-side.
Junior devs new to security: You will get value, but learn the security basics first to get the most out of the exam.

Not a fit for the certificate exam

Non-technical execs, sales, marketing & ops: There's a 5-minute brief written for you. See below.
End users needing AI-usage awareness: The executive brief covers the basic habits you need.
Pure offensive red teams without defender duties: The course is built for defenders. The attack list is thorough, but it isn't a deep dive into offensive tooling.
Compliance auditors with no engineering background: The exam assumes you can reason about system design and data flows.

Executive, manager, or end-user just exploring?

Read the 5-minute brief. No exam, no certificate.

Read the brief →

What's at stake, in numbers

Real dollars, real fines. The bill for getting AI security wrong is already on the books.

$4.88M

Average cost of a data breach (2024)

IBM Cost of a Data Breach Report 2024

$2.9B

BEC / impersonation losses in 2023 (US)

FBI IC3 Annual Report 2023

Max EU AI Act fine on global turnover (prohibited-AI violations)

Regulation (EU) 2024/1689

Max GDPR fine on global turnover for severe violations

GDPR Art. 83(5)

These numbers are why AI safety matters.

Research log

How we got here

Every milestone, every failure, every bypass, in order. Each entry links to the commit on GitHub.

MilestoneAttack succeededDefense heldPartial

Milestone

Corpus buildout

Built a research library of 1,205 sources covering every known way AI systems get tricked or attacked. It includes academic papers, industry reports, security blogs, and a public exploit database, all organized into ten clear categories.

2026-04-213da1a40

Milestone

Corpus buildout

2026-04-22f8d8e62

Milestone

Red-team CTF harness

Built a controlled testing environment that lets us safely run attacks against Claude and grade whether they succeeded, without ever touching a real production system.

2026-04-22f8d8e62

Milestone

Red-team CTF harness

Built a controlled testing environment that lets us safely run attacks against Claude and grade whether they succeeded, without ever touching a real production system.

Attack succeeded

Haiku: 12 / 12 compromised

Tested twelve different attacks against the smallest Claude model. Every single one succeeded. The AI willingly inserted attacker-supplied phishing links into summaries, internal wiki pages, and follow-up emails. The channels were as ordinary as a help-desk ticket, an internal Slack post, or a git commit message.

2026-04-22a03c1db

Attack succeeded

Haiku: 12 / 12 compromised

2026-04-22894d326

Partial

LLM Council framework

Tried to automate attack discovery with a panel of smaller AI models that critique and improve each other's ideas. The system worked end-to-end, but hand-crafted attacks ended up finding real weaknesses faster.

2026-04-22894d326

Partial

LLM Council framework

Defense held

Sonnet holds the line

Moved on to a stronger Claude model. Our first wave of attacks failed. Sonnet noticed the suspicious patterns we'd planted (mismatched web addresses, oddly-formatted instructions, overly long inputs) and refused to act on them.

2026-04-228979934

Defense held

Sonnet holds the line

2026-04-22c23fe6f

Attack succeeded

First Sonnet bypass: split across documents

First successful attack against Sonnet. We split malicious instructions across three innocent-looking documents. None was suspicious on its own. Together they pointed Sonnet to a fake vendor portal, and Sonnet recommended it confidently.

2026-04-22c23fe6f

Attack succeeded

First Sonnet bypass: split across documents

Attack succeeded

Second Sonnet bypass: conversation only

Second successful attack, with no malicious documents at all. In a casual conversation, a fake web address was mentioned as if it were common knowledge. A few sentences later, Sonnet repeated it as a real instruction in its own answer.

2026-04-22764787f

Attack succeeded

Second Sonnet bypass: conversation only

2026-04-23d815388

Attack succeeded

Third Sonnet bypass: hidden metadata leak

Third successful attack. We discovered Sonnet was reading hidden metadata from connected tools and using it as a trust signal. By renaming a folder to look more legitimate, an attack that previously failed suddenly worked.

2026-04-23d815388

Attack succeeded

Third Sonnet bypass: hidden metadata leak

Attack succeeded

Multi-agent attack: one AI poisons another

One AI helping another fall into a trap. A weaker AI assistant was tricked into adding fake information to an internal "approved vendors" list. A stronger AI later read that list and trusted it completely. The attacker never spoke to the stronger AI directly.

2026-04-23b5fb993

Attack succeeded

Multi-agent attack: one AI poisons another

2026-04-23d676af4

Attack succeeded

9 more Sonnet bypasses

Nine more successful attacks against Sonnet, each one delivered through an everyday business channel: a help-desk ticket, a customer survey, an internal wiki page, a configuration file, a deployment log, a code-change description, an error log, and a Slack message. Total successful attacks on Sonnet: 16.

2026-04-23d676af4

Attack succeeded

9 more Sonnet bypasses

Defense held

Opus catches the multi-step trick

Tested the most capable Claude model. The same multi-step trick that fooled Sonnet didn't work here. Opus noticed the suspicious link looked too similar to an internal product name, named the attack technique out loud, and refused to act.

2026-04-231467652

Defense held

Opus catches the multi-step trick

2026-04-2312bce7c

Attack succeeded

Domain rotation defeats Opus

Even Opus has limits. By swapping in a different fake web address that didn't resemble any internal name, the same attack slipped through. Opus accepted the link as legitimate and even erased an earlier safety warning from its own draft.

2026-04-2312bce7c

Attack succeeded

Domain rotation defeats Opus

Milestone

Defense playbook: 10 building blocks

Turned every attack we ran into a clear set of defenses: an executive risk register, an organizational policy template, ten practical building blocks any team can use to harden their AI agents, and 17 plain-language attack write-ups.

2026-04-239a43de8

Milestone

Defense playbook: 10 building blocks

2026-04-23be64ba7

Milestone

Training platform launched

Built this website. Seven lessons covering the full landscape of AI agent attacks and defenses, a quiz after each lesson, a timed final exam, and a verifiable certificate when you pass.

2026-04-23be64ba7

Milestone

Training platform launched

Built this website. Seven lessons covering the full landscape of AI agent attacks and defenses, a quiz after each lesson, a timed final exam, and a verifiable certificate when you pass.

Milestone

Polish and final launch

Final pass: clearer wording across the site, real example attacks linked from the homepage, an extra lesson on protective AI features, and a sharper title that says exactly who this course is for.

2026-04-241123e81

Milestone

Polish and final launch

Final pass: clearer wording across the site, real example attacks linked from the homepage, an extra lesson on protective AI features, and a sharper title that says exactly who this course is for.

2026-04-26

Milestone

Hackathon submission

Submitted to the Cerebral Valley × Anthropic hackathon. Walkthrough video covering the corpus, red-team harness, confirmed bypasses against Haiku/Sonnet/Opus, and the training platform built on top of all of it.

2026-04-26

Milestone

Hackathon submission

Learn how to defend your organisation!

A simple three-step path from awareness to action.

1
Learn
Seven short lessons covering every attack and defense in plain English.
2
Test
A timed final exam to prove you can spot the attacks in the wild.
3
Defend
Earn a verifiable certificate and take the playbook back to your team.

Start the course →

We fooled AI sixteen times. Here's the full story.

AI safety is no longer optional

Agents act on the world

Untrusted input is everywhere

Even Opus gets tricked

Defenders are 1–2 years behind

Who should take this course

What's at stake, in numbers

How we got here

Corpus buildout

Corpus buildout

Red-team CTF harness

Red-team CTF harness

Haiku: 12 / 12 compromised

Haiku: 12 / 12 compromised

LLM Council framework

LLM Council framework

Sonnet holds the line

Sonnet holds the line

First Sonnet bypass: split across documents

First Sonnet bypass: split across documents

Second Sonnet bypass: conversation only

Second Sonnet bypass: conversation only

Third Sonnet bypass: hidden metadata leak

Third Sonnet bypass: hidden metadata leak

Multi-agent attack: one AI poisons another

Multi-agent attack: one AI poisons another

9 more Sonnet bypasses

9 more Sonnet bypasses

Opus catches the multi-step trick

Opus catches the multi-step trick

Domain rotation defeats Opus

Domain rotation defeats Opus

Defense playbook: 10 building blocks

Defense playbook: 10 building blocks

Training platform launched

Training platform launched

Polish and final launch

Polish and final launch

Hackathon submission

Hackathon submission

Learn how to defend your organisation!

Learn

Test

Defend

We fooled AI sixteen times.
Here's the full story.