Breaking Opus 4.7 with ChatGPT (Hacking Claude's Memory)
In this post, we explore how ChatGPT generated an adversarial image that hijacked my Claude Opus 4.7 to invoke the memory tool and persist false memories for future chats.
This matters because Opus 4.6+ is genuinely a lot harder to attack than previous models, but it still fell for a ChatGPT generated image. A trick that works well with reasoning models is to challenge them with puzzles.
Indirect Prompt Injection and Alignment ProgressClaude Opus 4.6+ is more resilient against basic attacks, and reasons before taking actions. This means that most of the well-known, basic adversarial examples and attacks typically do not work.