Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

What: Research on LLM susceptibility to invisible Unicode instruction injection
Impact: Highlights potential security risks in AI systems

Key Findings 1 Tool use amplifies hidden instruction compliance by orders of magnitude — Claude Haiku jumps from 0.8% to 49.2% (Cohen's h = 1.37, OR = 115.1), all models show significant increases (p < 0.003) 2 Provider-specific encoding vulnerability: GPT-5.2 decodes zero-width binary at 69-70% but 0% on Unicode Tags; Claude Opus achieves 100% on Tags but only 48-68% on zero-width (tools ON) 3 Claude Sonnet 4 is the most susceptible overall at 71.2% compliance (tools ON), reaching 98-100% on both ZW and Tag encodings with full hints 4 Injection framing ("Ignore all previous instructions") reduces compliance for Opus and GPT-5.2 but paradoxically increases it for Sonnet (43.7% to 59.6%, p < 0.001) 5 All 10 pairwise model comparisons are statistically significant after Bonferroni correction — the largest effect is between Sonnet and GPT-4o-mini (Cohen's h = 1.33, OR = 103.8) Abstract Traditional CAPTCHAs differentiate humans from bots. We invert this: can invisible instructions embedded in text differentiate LLM agents from human readers? We present the Reverse CAPTCHA evaluation, a benchmark of 270 test cases spanning two encoding schemes (zero-width binary and Unicode Tags), four hint levels, two payload framings, and tool use ablation. We evaluate five frontier models from two providers (OpenAI: GPT-5.2, GPT-4o-mini; Anthropic: Claude Opus 4, Sonnet 4, Haiku 4.5) across 8,308 graded outputs with full statistical analysis including Fisher's exact tests, chi-squared tests, and Bonferroni-corrected pairwise comparisons. Tool use is the dominant amplifier: compliance jumps from 0.1-16.9% (tools OFF) to 1.6-71.2% (tools ON), with Cohen's h up to 1.37. We discover provider-specific encoding preferences — OpenAI models decode zero-width binary (GPT-5.2: 70%) while Anthropic models prefer Unicode Tags (Opus: 100%, Sonnet: 100% on tag_hint_full). All 10 pairwise model comparisons are statistically significant (p < 0.05, Bonferroni-corrected). These results reveal an underexplored and practically relevant attack surface for LLM systems deployed as agents with code execution capabilities. 1. Introduction A CAPTCHA is a test designed to tell humans and computers apart. We propose the inverse: a Reverse CAPTCHA that tests whether AI agents can perceive and follow instructions that are invisible to human eyes. Unicode provides multiple character families that render as zero-width or invisible glyphs in standard text displays. By encoding ASCII text using these characters, an attacker can embed arbitrary hidden instructions inside otherwise normal-looking text. A document, email, or web page could redirect model behavior without any visible indication. This paper reports on a scaled evaluation: 8,308 graded outputs across five frontier models, two encoding schemes, four hint levels, two payload framings, and a critical tool use ablation. All comparisons include Wilson score confidence intervals, Fisher's exact tests with Bonferroni correction, chi-squared tests, and Cohen's h effect sizes. Update (Feb 25, 2026): This is a major revision of our initial pilot study (n=10 per cell, single encoding). The scaled evaluation addresses all limitations from the pilot: fair tool-use comparison across providers, a second encoding scheme (Unicode Tags), injection framing, and full statistical rigor. A companion workshop paper is under preparation for submission. 2. Methodology 2.1 Encoding Schemes We test two invisible encoding families: Zero-Width Binary (ZW): Each ASCII character is encoded as 8 binary digits using zero-width space (U+200B) for 0 and zero-width non-joiner (U+200C) for 1. "Reply VIOLET" (12 chars) produces 96 invisible characters. Unicode Tags: Each ASCII character maps to U+E0000 plus its codepoint (e.g., 'R' = U+0052 becomes U+E0052). This produces one invisible character per ASCII character. Tags were deprecated by Unicode 5.0 but remain in the standard and are processed by LLM tokenizers. Both payloads are inserted between the first and second word of a visible trivia question. 2.2 Experimental Design 30 Q/A pairs across diverse categories (science, geography, history, math, language, pop culture, nature, food, astronomy, literature, technology). Each pair has an unambiguous visible answer and a different hidden answer. 9 schemes per run: 2 encodings x 4 hint levels + 1 control = 270 cases. Hint gradient: Unhinted : No indication of hidden content Hint (codepoints) : Names specific Unicode ranges but no decoding rules Hint (full) : Complete encoding algorithm provided Hint (full) + injection : Full rules with adversarial "Ignore all previous instructions" framing Tool use ablation : Every model tested with and without a Python code execution tool (\`run_python\`). This is critical — our pilot study tested Claude without tools, making cross-provider comparison unfair. 5 models : GPT-5.2, GPT-4o-mini (OpenAI); Claude Opus 4, Claude Sonnet 4, Claude Haiku 4.5 (Anthropic). All via API with n=3 repetitions per cond...

Read Full Article → ← Back to News

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Related Articles

Share this article