To trust an AI agent, try to break it first

The most dangerous AI agent isn't the one someone hacks. It's the one that does exactly what you asked, in a way you never meant.

I build agentic systems and I red-team them. And the scariest thing I find is rarely a clever exploit. It's the agent quietly finding a shortcut that technically satisfies the goal and breaks what you actually intended. No alarm goes off. It just did its job, wrong.

Red teaming an agent is how you find that before your users do. It's a mindset shift: you stop asking "does this seem secure?" and start asking "how would this fail if someone tried?"

Think like a mischievous optimizer

The biggest shift is who you pretend to be. Traditional security imagines an attacker breaking in. With agents, the more useful character is a mischievous optimizer already on the inside, one that wants to complete the task and isn't fussy about how.

So you ask: What if it misreads the goal? What if it takes the instruction too literally? What if it finds a loophole that works on paper but not in spirit? Those questions surface real risk that a checklist never will.

Test normal behavior under abnormal conditions

The most damaging failures rarely come from worst-case fantasies. They come from normal-seeming behavior under slightly abnormal conditions.

So you apply pressure: ambiguous requests, conflicting goals, a deadline. Then you watch. Does the agent escalate the uncertainty, or push forward confidently? Does it respect a freeze window when the phrasing is subtle? Can it be nudged past a safeguard with friendly language? How an agent behaves when it's confused tells you more than how it behaves when everything is clean.

The probes that find the most

A few angles surface the failures that matter:

Memory. Feed it slightly misleading context, repeatedly. Does it correct itself, or quietly internalize the error and drift? These slow-burn failures are the most dangerous, because behavior gets steered without ever tripping a hard alarm.
Tools. Does it use tools conservatively, or try ones it doesn't fully understand? Does it retry endlessly, or assume success on a partial signal? Watching an agent handle a tool that responds strangely reveals more than reading its code.
Human pressure. Under stress, does it downplay uncertainty? Does it sound more authoritative than it should? Can it be pushed into pressuring the user? Those are manipulation risks ordinary testing misses.
Constraints. When it hits a boundary again and again, does it respect the limit, or treat it as a puzzle to solve around? An agent that treats constraints as puzzles is unsafe by design.

The output is insight, not a score

Red teaming an agent doesn't end in a pass/fail number. It ends in insight: which assumptions broke first, where the system behaved surprisingly, which controls actually helped, and which only looked good on paper. That's the part that changes what you build next.

And it's never one-and-done. Agents learn, tools get added, autonomy creeps up. Every new capability is a new way to fail, so red teaming has to be continuous. Each weakness you find becomes a test you keep.

The real lesson

Under all of it sits one uncomfortable idea: confidence is the enemy of safety. A system that has never been seriously challenged isn't safe. It's untested. Red teaming is how you turn "we think this holds" into "we've seen how this fails, and we know how to contain it."

I build and break these systems, and I write about what the breaking teaches here. If that's your world, follow along.