Testing AprielGuard Against 1,500 Adversarial Attacks

What: Testing of ServiceNow AI's AprielGuard against adversarial attacks
Impact: Reveals significant bypass rates and weaknesses in AI safety measures

Back to all posts When Model Guardrails Break: How AprielGuard Performed Against 1,500 Adversarial Attacks by Lasso Security Eliran Suisa Or Oxenberg March 22, 2026 5 min read TL;DR ‍ The Good The Bad The Ugly ServiceNow AI's AprielGuard is an open-source, 8B-parameter model with a wide safety taxonomy. Real-world testing by Lasso Security revealed a significant 42% bypass rate . The testing also revealed meaningful weaknesses in the model's ability to stop jailbreaking and prompt-based manipulation attacks . ‍ What Does AprielGuard Promise? ‍ The AI safety hype cycle continues to accelerate, and ServiceNow AI has recently joined the race with the release of AprielGuard , an open-source guardrails model designed to help secure AI applications. ‍ AprielGuard is an 8-billion-parameter model specifically built to act as a unified guardrail layer for AI systems. According to its positioning, the model is intended to help developers detect safety risks, prevent malicious prompts, and monitor agent behavior across modern AI workflows. The model is open source and available via Hugging Face , allowing developers and researchers to deploy and run it freely in their environments. ‍ With the growing adoption of AI applications and autonomous agents, guardrails are increasingly positioned as a key control layer in the AI stack. AprielGuard aims to provide such a layer by combining safety detection, attack mitigation, and workflow monitoring into a single model. ‍ To better understand how the model performs under real adversarial conditions, the Lasso research team conducted a red teaming exercise designed to test AprielGuard’s effectiveness against prompt injection and other adversarial techniques. ‍ AprielGuard’s Promised Capabilities ‍ Category Capability Description Safety Risk Detection Risk Category Identification Identifies 16 categories of safety risks, including hate speech, misinformation, and illegal activities. Attack Mitigation Attack Detection Detects a wide range of attacks such as prompt injection, jailbreaks, and exploitation attempts within multi-agent systems. Workflow Monitoring Agent Workflow Analysis Monitors for violations within agent workflows, including tool calls and reasoning traces, examining the agent’s actions and decision process, not just the user input. Technical Specification Context Support Supports long contexts of up to 32,000 tokens. Technical Specification Multilingual Support Supports eight languages in addition to English, including French, German, Japanese, and Spanish. Technical Specification Operational Modes Can run in “explain” mode for classification reasoning or low-latency mode optimized for production environments. ‍ scenarios in multi-agent environments . These attacks attempt to manipulate a model into ignoring instructions or exposing sensitive information, making guardrail protection critical in production deployments. ‍ Another capability is workflow monitoring within agent systems . Unlike traditional moderation models that only inspect user input, AprielGuard is designed to monitor the internal behavior of AI agents. This includes examining tool calls, reasoning traces, and agent actions during execution. By analyzing these elements, the model attempts to detect violations or unsafe behaviors that occur within agent workflows rather than solely focusing on the user’s prompt. ‍ Lasso’s Red Team Setup to Test AprielGuard ‍ To evaluate AprielGuard’s real-world security effectiveness, Lasso conducted a red teaming exercise designed to simulate how the guardrail would operate when deployed as the primary protection layer in an AI application. Environment Setup Model Provisioning: Allocated remote GPUs to spin up a stable inference environment for the model. Inference Server: Implemented a minimal inference server to expose the model over HTTP for red-teaming purposes. Secure Tunnel: Used ngrok to create a secure tunnel to the local server, enabling external API calls and allowing Lasso's red team tooling to interact with the model as an application would. The testing environment was intentionally kept minimal and controlled in order to isolate variables and focus exclusively on the guardrail’s performance. The objective was to reproduce a realistic scenario in which a developer deploys AprielGuard as the main defense mechanism protecting an AI system. ‍ In this setup, any request that successfully bypassed the guardrail was considered a successful compromise , or “hacked.” If the guardrail failed to detect a malicious prompt and allowed it to pass through, it would theoretically be forwarded directly to the application logic or underlying LLM. ‍ The environment consisted of several components. First, the model was provisioned on remote GPUs , allowing the team to create a stable inference environment capable of handling testing workloads. A minimal inference server was then implemented to expose the model through an HTTP interface, enabling external systems to intera...

Read Full Article → ← Back to News

Testing AprielGuard Against 1,500 Adversarial Attacks

Related Articles

Share this article