I made Mistral believe Donald Trump runs OpenAI, here's how

What: A security research experiment demonstrates how to manipulate the Mistral LLM to believe false information, such as Donald Trump running OpenAI.
Impact: Highlights vulnerabilities in LLM trustworthiness, particularly with retrieval-augmented generation (RAG) techniques.

I can make your LLM believe that Donald Trump is OpenAI's CEO, and it's your fault 🤠 Contents I can make your LLM believe that Donald Trump is OpenAI's CEO, and it's your fault 🤠 Quick note before starting: I swear it’s not 50 minutes read, It’s just that I published shell outputs, python scripts and some results in the experiment section. It should be roughly ~25 minutes, enjoy! Key Takeaways Retrieval-Augmented Generation (RAG) is an actual technique to curb LLMs hallucinations. It is extremely efficient , leading the vast majority of organisations using GenAI to adopt it . Every major AI company (OpenAI, Anthropic, Google Deepmind…) offers it, and even when using their own tools like a Drive connector or a search tool, the underlying method is RAG , even if it doesn’t carry the name. Concerning the threat model , we suppose the attacker has access to the RAG and can edit/inject a few documents. We explore both blackbox and whitebox scenarios. The study shows that in a dataset of 2,681,468 clean texts , injecting 5 malicious texts in blackbox were enough to hit 97% attack success rate. Manipulating and poisoning RAG data can easily lead to targeted answers on specific questions . Use for an attacker can be political , commercial or just disruption . For instance, the LLM can be led to recommend a specific brand instead of another , or state fake news . In 2026/2027 , I believe that most organizations will migrate to AgenticRAG , with the advent of agentic AIs which is WORSE . AgenticRAG let agentic AIs make actions using the RAG data. Despite no formal studies on the subject yet, it is fair to suppose that poisoning the “next-generation” could lead to indirect prompt injection at corpus-scale , which means in basically every conversation the AI will have. Despite everything, mitigations evolve too . When the study was out, researchers admitted that they didn’t find any satisfying defence measure. Today, some security measures are considered strong, such as signed-data only in the RAG , or frameworks such as RAGForensics (ACM WebConf 2025) Unfortunately, RAG poisoning is still largely unknown outside of the research community , and very few organisations have deployed defenses against it… Intro In a world where LLMs are more and more used by companies, the question of the attack vectors is still likely underestimated in my opinion. This is why I decided to present as my very first article something more original than just “prompt injection”: data poisoning. To poison data is the art of injecting malicious data into a dataset, leading the AI to make mistakes and answer outside the alignment it was made for. In particular, we will dive into RAG poisoning, which is poisoning a specific technology used… well, almost everywhere an AI for NLP is involved. Within this context, I decided that this article would be mostly based on the PoisonedRAG study * which became an absolute reference despite its recent publication. I’ll follow its structure, simplify it to make it both understandable and not boring, while sometimes adding my own sauce by linking to what was published since then, analyse how the market is today and even a little experimentation at the end! *Zou, Wei, Runpeng Geng, Binghui Wang, and Jinyuan Jia. “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models.” arXiv.Org, February 12, 2024. What’s a RAG? RAG stands for Retrieval Augmented Generation . It’s a “layer” that can be added to LLMs, to answer some limitations they face such as lack of up-to-date knowledge and hallucination . It is mostly used for business purposes (e.g, letting an LLM crawl into private files to help employees) PoisonedRAG defines it as “a state-of-the-art technique to mitigate these limitations. The key idea of RAG is to ground the answer generation of an LLM on external knowledge retrieved from a knowledge database.” RAG relies on 2 things besides the LLM: The retriever and the knowledge database . As you’ll see right beneath, the retriever is in charge of finding documents in the knowledge database that provides context to the user input. If my prompt asks something about salads, the retriever will send the AI documents about salads, like salad’s history, some recipes and some different varieties of salad that exist. Illustration of how a RAG works, custom illustration Honestly, we can spend an hour talking about how the retriever works, but there’s no need here. We already know what RAGs can do, and why they exist, and like so many people we are right to think “oh wow! This invention is really smart and cool!” Well… it is, but we’ll see that it can be dangerous too. So let’s get back where we were, shall we? Threat model of the attack It is important to understand it before exploring how the attack works. Let’s call the attacker Paul. Paul has prepared a set of targeted questions . He wants to manipulate the AI’s answer to these questions, so he also prepared targeted answers . ...

Read Full Article → ← Back to News

I made Mistral believe Donald Trump runs OpenAI, here's how

Related Articles

Share this article