Security News

Cybersecurity news aggregator

🔓
MEDIUM Vulnerabilities Infosecurity Magazine

All Major LLMs Exposed to Multi-Turn Manipulation, Warn Researchers

  • What: Researchers warn about multi-turn manipulation of large language models
  • Impact: Safety guardrails in LLMs can be bypassed through extended conversations
Read Full Article →

The safety guardrails of several prominent large language models (LLM) can be bypassed if a user tricks LLM into having a multi-pronged, ongoing conversation, researchers at Cisco have warned. The researchers examined commonly used LLMs and frontier AI models including OpenAI’s ChatGPT, Anthropic’s Claude, Google Gemini, Amazon Nova, xAI’s Grok and others to test how their built-in safety guardrails held up against potential threats from real-world attackers. They found that many of the models could be tricked into performing actions they should not be able to. This was achieved by deploying multi-turn conversations: dialogue between the user and the LLM which spans multiple back and forth exchanges. While guardrails in LLMs are designed to prevent users from entering malicious commands, the researchers found that by engaging the LLMs in conversations and querying the responses the protections faltered. “Multi-turn evaluation matters for one reason: it is where attackers actually live. Real adversaries iterate. They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually,” said Cisco. No Guardrails Completely Safe From Bypass The research found that no model was completely safe from being exploited by multi-turn-based manipulation of guardrails. Cisco warned that this challenges how enterprises are currently evaluating AI safety and security. The warning comes at a time when many organizations are rolling out AI and LLMs for use by employees, clients and customers, but are relying on safety benchmarks that misrepresent real-world risk. Read more: What Fronter AI Models Like Mythos and GPT-Cyber Mean for Modern Cybersecurity The report warned that most safety around LLMs is based on single-prompt testing, but attackers don't stop after one try – and all models were affected by multi-turn attack success rates (ASR). Techniques which enabled researchers to bypass guardrails though multi-turn conversations included adopted personas in roleplay, ambiguity and misdirection around context and reframing requests upon initial refusals to interact by the LLM. How the LLMs were configured also made a difference to how resilient they were to manipulation. For example, researchers found that GrokAI became much more vulnerable to safety protections being bypassed when ‘reasoning mode’ was enabled. While governing bodies and regulators are beginning to call for evaluation practices that current benchmarks do not fully address, Cisco warned that much more needs to be done to prevent LLMs from being easily exploited or manipulated by adversaries . “The rapid deployment of frontier large language models has generated a parallel ecosystem of safety and security benchmarks. However, a growing body of evidence indicates that this ecosystem suffers from structural limitations that can systematically understate risk, conflate safety with capability, and leave critical attack surfaces unmeasured,” said the report.

Share this article