A new prompt injection technique called blind boolean-based prompt injection (BBPI) has been proposed. This technique allows an attacker to leak a system prompt against an LLM-powered classifying system constrained to give static responses by updating the response logic and signaling true/false responses to attacker prompts.
I had an idea for leaking a system prompt against a LLM powered classifying system that is constrained to give static responses. The attacker uses a prompt injection to update the response logic and signal true/false responses to attacker prompts. I haven't seen other research on this technique so I'm calling it blind boolean-based prompt injection (BBPI) unless anyone can share research that predates it. There is an accompanying GitHub link in the post if you want to experiment with it locally. submitted by /u/-rootcauz- [link] [comments]