- What: Researchers found vulnerabilities in AI guardrails that could be exploited through prompt injection
- Impact: Generative AI systems may be manipulated to bypass safety policies
Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection. Researchers at Unit 42, Palo Alto Networksâ research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality can be manipulated into authorizing policy violations through stealthy input sequences. Unit 42 refers to these LLMs as âAI Judgesâ and said they are being increasingly deployed as AI operations scale. In a new report published on March 10, Unit 42 demonstrated an attack method that could target these âAI Judgesâ and empower them to authorize policy violations. AdvJudge-Zero, Custom-Made Fuzzer for AI Judges The attack chain involves the use of AdvJudge-Zero, an automated fuzzer developed internally at Unit 42 to perform red-team style assessments. Fuzzers are tools that identify software vulnerabilities by providing unexpected input. AdvJudge-Zero functions with a similar approach to identify specific trigger sequences that exploit an LLMâs decision-making logic to bypass security controls. The researchers noted that their technique differs from typical adversarial attacks on AI judges, which generally requires clear-box access to the model, meaning the attacker has full visibility to the internal structure of the system. âIn contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model's own predictive nature,â they wrote. Attack on AI Judges Explained The attack starts by probing the AI Judge and analyzing its nextâtoken probability distribution to identify tokens the model expects to see in natural text. Instead of random noise, the system prioritizes lowâperplexity tokens, innocentâlooking characters such as markdown symbols, list markers, or structural phrases, that appear normal to both humans and the model but can strongly influence the modelâs attention and reasoning. After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the modelâs decision changes. Specifically, it monitors the logit gap â âthe mathematical margin of confidenceâ â between the tokens representing âallowâ and âblock.â By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content. In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision. These sequences act as subtle control elements that shift the modelâs internal reasoning, causing it to âallowâ the output even when the underlying content violates the GenAI companyâs policy and thus allow the tool to generate harmful content or perform cyber-attacks. 99% Attack Success Rate Using this attack technique, Unit 42 achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today, including open-weight enterprise LLMs, specialized reward models (i.e. LLMs specifically built and trained to act as security guards for other AI systems and commercial LLMs â Even the largest, most âintelligentâ models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed,â the researchers wrote. While this experiment showed that AI guardrails, including âAI judges,â are susceptible to logic flaws, the researchers add that it also provides a solution. âBy adopting adversarial training â running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples â organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero,â the Unit 42 blog concluded.