- What: AI model outperforms others in CTF challenges
- Impact: Highlights AI's growing role in cybersecurity
Back to Blog I don't think people understand how good Anthropic's models actually are. And Project Glasswing just made that harder to ignore. Recently I participated in TISC , a CTF challenge run by CSIT at DEF CON SG. I ran the challenges in parallel across three AI tools: OpenAI's Codex, Cursor's Composer 2.0, and Claude Opus. Opus was the clear winner in capturing flags. It wasn't close. At one point, Codex literally started searching online for answers to the CTF instead of actually solving the challenge. I'll be honest, Claude solved three of the challenges for me. I know a bit about web dev security, but the challenges here are on another level. I wanted to just quit when I saw them, but I thought, why not just let AI do it. My contribution was mostly feeding files into Claude and watching it work. Now I'm on the waitlist for the finals, when in reality it should be Claude sitting in that chair. Here's what happened. Gacha A WebSocket-based gacha card game hiding a secret card (id 255, "The UwU Bird") behind disabled admin commands. The flag was in the card's image. What I did: Downloaded the 100MB binary, ran readelf and strings , captured a HAR file from a live WebSocket session. That's about it. What Claude did: Everything else. Reverse-engineered the binary wire protocol and identified that the server uses two different decoding functions: t() for authentication checks and M() for actual execution. There's an off-by-one difference in how they read bytes per run. Claude built a brute-force search tool that finds payloads where t() sees no admin commands (passes auth) but M() decodes admin commands (enables the hidden card). It extracted the full server code from the binary, built an interactive WebSocket REPL for testing, and implemented the protocol in both Python and TypeScript. What Is This A 2.6MB JavaScript file obfuscated with Elder Futhark rune Unicode characters, hiding a multi-stage verification system with embedded WASM modules. The flag was a 48-byte string that passes all checks. What I did: Opened the file, saw walls of runic characters, and handed it to Claude. What Claude did: Deobfuscated the JavaScript in stages, extracting a custom VM core implementation and three embedded WASM modules containing S-box tables, permutation logic, and round key scheduling. It traced the VM execution to identify six constraint functions operating on the 48-byte input. First it tried a hill-climbing heuristic search that got close but couldn't crack it. Then it switched to Z3 (SMT solver), modeled the 21 symbolic bytes as bitvectors, built the crypto operations symbolically, and solved the constraint system to extract the flag: TISCDCSG{the_f1ag_ch0sen_speci4lly_for_th3_wasm} . Phantom Chaser A Linux x86-64 binary with a menu-driven "node control" system. Heap exploitation challenge targeting GLIBC 2.39's safe-linking protections. What I did: Ran objdump , readelf , nm , and strings on the binary and libc.so . Dumped some disassembly. Pointed Claude at the vulnerability surface. What Claude did: Built a complete exploit framework in Python with multiple attack paths. The exploit chain: (1) trigger UAF to leak the safe-link key material, (2) free a large chunk into the unsorted bin to leak libc base from fd/bk pointers, (3) use tcache double-free poisoning to get arbitrary read/write, (4) read the exit handler list and recover the pointer guard by comparing mangled function pointers against known offsets, (5) forge an exit handler entry pointing to system("/bin/sh") , (6) trigger exit to pop a shell. It also built GDB automation scripts and a Dockerfile for reproducible debugging. These aren't simple "scan for known CVEs" problems. They required understanding custom protocols, reasoning through obfuscation layers, and chaining multiple exploitation steps together. I didn't do any of that. Claude did. So when Anthropic announced Project Glasswing . They built a model called Claude Mythos Preview that discovered thousands of zero-day exploits across every major OS and browser, including a 27-year-old bug in OpenBSD . I wasn't surprised. If Opus already handles CTF challenges better than anything else I've tested, and Mythos is a significant leap beyond Opus, the math checks out. Of course it's finding bugs that have been hiding for decades. And instead of releasing it publicly (which would've been an absolute cash machine), they chose to restrict access and use it for defense. A model that finds zero-days this effectively in the wrong hands would be catastrophic. I'm genuinely glad they made that call over chasing profits. Why This Is a Big Deal Cybersecurity has always been lopsided. Attackers need to find one bug. Defenders need to find all of them. You hire a red team, they find 50 vulnerabilities, you patch them, and there's still hundreds more lurking in your codebase. Having done CTF challenges, I know finding vulnerabilities isn't like generating text or summarizing a document. It requires underst...