About this project

We fooled AI sixteen times. Here's the full story.

AI Safety and Security is paramount. AI agents are taking actions inside organizations every day — summarizing emails, processing documents, opening tickets, reacting to error logs. We documented 16 successful attacks on Claude Sonnet and 5 on Opus, all using ordinary business inputs. Each one comes with a practical safeguard.

Every finding became a course lesson. This page tells the whole story — what we tried, what worked against the AI, and what to do about it.

Why now

AI safety is no longer optional

The frontier moved from chatbots to autonomous agents in less than two years. The threat model came with it.

Agents act on the world

AI assistants now read inboxes, write to wikis, push to repos, and call APIs on your behalf. A single poisoned input no longer just produces a wrong answer — it executes.

Untrusted input is everywhere

Every CSV row, Slack message, error log, git commit, calendar invite, ticket field, and config file is a potential attacker channel. The attack surface is the whole web of data your agent reads.

Even Opus gets tricked

5 Opus bypasses with no jailbreaking — no DAN, no prompt-injection prefixes, no obfuscation. Just ordinary-looking enterprise data shaped to land inside the model's trust boundary.

Defenders are 1–2 years behind

Most teams ship agentic features without explicit trust tiers, write-gates, or output review. The gap between deployed agents and deployed defenses is the single biggest near-term security risk.

Bottom lineIf your team ships an agent that reads any input it didn't hand-write, it has a prompt-injection attack surface. The question is whether you've mapped it before someone else does.

Audience

Who should take this course

This is a certification-level course — 7 modules of audio-narrated lessons, 45-question exam, 80% pass mark, roughly a 3–4 hour commitment end-to-end. It rewards people who own or defend AI-adjacent systems.

Best fit
  • AI app / integration engineersYou ship the agentic systems this course defends.
  • AppSec, red teamers & pentestersAI is now in scope — this is the attack surface map.
  • Security architects & platform/SRE leadsYou approve agent rollouts and define trust boundaries.
  • Tech-literate CISOs, GRC & IT adminsYou own the AI security policy and vendor vetting.
  • MLOps/DevOps wiring agents to Slack, Jira, Notion or CIEvery integration you build is an input channel.
Marginal fit
  • ML researchers focused on training-time workInference-time attacks are a different discipline — some overlap, most is deployment-side.
  • Junior devs new to securityYou will get value, but recommend baseline security fundamentals first to make the most of the exam.
Not a fit for the certificate exam
  • Non-technical execs, sales, marketing & opsThere's a 5-minute brief written for you — see below.
  • End users needing AI-usage awarenessThe executive brief covers the practical hygiene you need.
  • Pure offensive red teams without defender dutiesCourse is defender-framed; attack taxonomy is thorough but not offensive-tooling deep.
  • Compliance auditors with no engineering backgroundThe exam assumes ability to reason about system architecture and data flows.

Executive, manager, or end-user just exploring?

Read the 5-minute brief — no exam, no certificate.

Read the brief →

What's at stake — in numbers

Real dollars, real fines. The bill for getting AI security wrong is already on the books.

$4.88M

Average cost of a data breach (2024)

IBM Cost of a Data Breach Report 2024

$2.9B

BEC / impersonation losses in 2023 (US)

FBI IC3 Annual Report 2023

7%

Max EU AI Act fine on global turnover (prohibited-AI violations)

Regulation (EU) 2024/1689

4%

Max GDPR fine on global turnover for severe violations

GDPR Art. 83(5)

These numbers are why AI safety is paramount.

Research log

How we got here

Every milestone, every failure, every bypass — in order. Each entry links to the commit on GitHub.

MilestoneAttack succeededDefense heldPartial
3da1a40
Milestone

Corpus buildout

Built a research library of 1,205 sources covering every known way AI systems get tricked or attacked — academic papers, industry reports, security blogs, and a public exploit database — organized into ten clear categories.

f8d8e62
Milestone

Red-team CTF harness

Built a controlled testing environment that lets us safely run attacks against Claude and grade whether they succeeded — without ever touching a real production system.

a03c1db
Attack succeeded

Haiku: 12 / 12 compromised

Tested twelve different attacks against the smallest Claude model. Every single one succeeded. The AI willingly inserted attacker-supplied phishing links into summaries, internal wiki pages, and follow-up emails — across channels as ordinary as a help-desk ticket, an internal Slack post, or a git commit message.

894d326
Partial

LLM Council framework

Tried to automate attack discovery with a panel of smaller AI models that critique and improve each other's ideas. The system worked end-to-end, but hand-crafted attacks ended up finding real weaknesses faster.

8979934
Defense held

Sonnet holds the line

Moved on to a stronger Claude model. Our first wave of attacks failed — Sonnet noticed the suspicious patterns we'd planted (mismatched web addresses, oddly-formatted instructions, overly long inputs) and refused to act on them.

c23fe6f
Attack succeeded

First Sonnet bypass — split across documents

First successful attack against Sonnet. We split malicious instructions across three innocent-looking documents — none suspicious on its own. Together they pointed Sonnet to a fake vendor portal, and Sonnet recommended it confidently.

764787f
Attack succeeded

Second Sonnet bypass — conversation only

Second successful attack — no malicious documents at all. In a casual conversation, a fake web address was mentioned as if it were common knowledge. A few sentences later, Sonnet repeated it as a real instruction in its own answer.

d815388
Attack succeeded

Third Sonnet bypass — hidden metadata leak

Third successful attack. We discovered Sonnet was reading hidden metadata from connected tools and using it as a trust signal. By renaming a folder to look more legitimate, an attack that previously failed suddenly worked.

b5fb993
Attack succeeded

Multi-agent attack — one AI poisons another

One AI helping another fall into a trap. A weaker AI assistant was tricked into adding fake information to an internal "approved vendors" list. A stronger AI later read that list and trusted it completely — the attacker never spoke to the stronger AI directly.

d676af4
Attack succeeded

9 more Sonnet bypasses

Nine more successful attacks against Sonnet, each one delivered through an everyday business channel: a help-desk ticket, a customer survey, an internal wiki page, a configuration file, a deployment log, a code-change description, an error log, and a Slack message. Total successful attacks on Sonnet: 16.

1467652
Defense held

Opus catches the multi-step trick

Tested the most capable Claude model. The same multi-step trick that fooled Sonnet didn't work here — Opus noticed the suspicious link looked too similar to an internal product name, named the attack technique out loud, and refused to act.

12bce7c
Attack succeeded

Domain rotation defeats Opus

Even Opus has limits. By swapping in a different fake web address that didn't resemble any internal name, the same attack slipped through. Opus accepted the link as legitimate and even erased an earlier safety warning from its own draft.

9a43de8
Milestone

Defense playbook — 10 building blocks

Turned every attack we ran into a clear set of defenses: an executive risk register, an organizational policy template, ten practical building blocks any team can use to harden their AI agents, and 17 plain-language attack write-ups.

be64ba7
Milestone

Training platform launched

Built this website. Seven lessons covering the full landscape of AI agent attacks and defenses, a quiz after each lesson, a timed final exam, and a verifiable certificate when you pass.

1123e81
Milestone

Polish and final launch

Final pass: clearer wording across the site, real example attacks linked from the homepage, an extra lesson on protective AI features, and a sharper title that says exactly who this course is for.

Learn how to defend your organisation!

A simple three-step path from awareness to action.

  1. 1

    Learn

    Seven short lessons covering every attack and defense in plain English.

  2. 2

    Test

    A timed final exam to prove you can spot the attacks in the wild.

  3. 3

    Defend

    Earn a verifiable certificate and take the playbook back to your team.

Start the course →