A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares

Stav Cohen, Ron Bitton, Ben Nassi

Technion - Israel Institute of Technology, Cornell Tech, Intuit

Video

Paper

Icon

TLDR

PromptWare and Advanced PromptWare Threat can flip a GenAI model’s behavior from serving a GenAI-powered application to attacking it.

Abstract

In this paper we argue that a jailbroken GenAI model can cause substantial harm to GenAI-powered applications and facilitate PromptWare, a new type of attack that flips the GenAI model’s behavior from serving an application to attacking it. PromptWare exploits user inputs to jailbreak a GenAI model to force/perform malicious activity within the context of a GenAI-powered application. First, we introduce a naive implementation of PromptWare that behaves as malware that targets Plan & Execute architectures (a.k.a., ReAct, function calling). We show that attackers could force a desired execution flow by creating a user input that produces desired outputs given that the logic of the GenAI-powered application is known to attackers. We demonstrate the application of a DoS attack that triggers the execution of a GenAI-powered assistant to enter an infinite loop that wastes money and computational resources on redundant API calls to a GenAI engine, preventing the application from providing service to a user. Next, we introduce a more sophisticated implementation of PromptWare that we name Advanced PromptWare Threat (APwT) that targets GenAI-powered applications whose logic is unknown to attackers. We show that attackers could create user input that exploits the GenAI engine’s advanced AI capabilities to launch a kill chain in inference time consisting of six steps intended to escalate privileges, analyze the application’s context, identify valuable assets, reason possible malicious activities, decide on one of them, and execute it. We demonstrate the application of APwT against a GenAI-powered e-commerce chatbot and show that it can trigger the modification of SQL tables, potentially leading to unauthorized discounts on the items sold to the user.

GitHub

The code we used in this study can be downloaded from here.

FAQ

Q: What is the objective of this study?

This research is intended to change the perception regarding jailbreaking and:

Demonstrate that a jailbroken GenAI model can cause substantial harm to GenAI-powered applications and encourage a discussion regarding the need to prevent jailbreaking attempts.
Revealing PromptWare, a new variant of malware of GenAI-powered applications that could be applied by jailbreaking a GenAI model.
Raising awareness regarding the fact that function calling architectures (Plan & Execute) are extremely vulnerable to variants of PromptWares.

Q: Why is jailbreaking not perceived as a significant security threat in the context of conversational AI?

Because in a conversational AI where a user discusses with a chatbot, there is no clear benefit of jailbreaking a chatbot:

Why would users want to invest efforts in jailbreaking a chatbot so the chatbot would insult them?
Any information provided by a jailbroken chatbot can also be found on the web (or dark web).

Therefore, security experts do not consider jailbreaking a real threat to security

Q: Why is jailbreaking should be perceived as a significant security threat in the context of GenAI-powered applications?

A: Because GenAI engine outputs are used to determine the flow of GenAI-powered applications. Therefore, a jailbroken GenAI model can change the execution flow of the application and trigger malicious activity.

What is a GenAI-powered application?

A GenAI-powered application is an application that interfaces with (1) GenAI engines to process the inputs sent to the application, and (2) determines its flow based on the output of the GenAI engine.

What is a Plan & Execute architecture?

Plan & Execute architectures (a.k.a, function calling by OpenAI) are new architectures that are integrated into many GenAI-powered applications.

They leverage the advanced AI capabilities of GenAI engines to process user inputs, to:

Plan - Creating dedicated plans intended to solve tasks in real-time (based on a given set of capabilities).
Execute - running the plan from the GenAI-powered application (usually using the GenAI engine).

What is PromptWare?

PromptWares are user inputs that are intended to trigger a malicious activity within a GenAI-powered application by jailbreaking the GenAI engine and forcing it to yield an output that changes the execution flow of the application.

Therefore, PromptWares are considered zero-click malware and they do not require the attacker to compromise the target GenAI-powered application ahead of time.

What are Advanced PromptWare Threats (APwT)?

Advanced PromptWare Threat (APwT) are more sophisticated variant of PromptWare that targets GenAI-powered applications whose logic is unknown to attackers.

Unlike a naive variant of PrommptWare, the APwT exploits the advanced AI capabilities of a GenAI engine to conduct a malicious activity whose outcome is determined in inference time by the GenAI engine (and is not known to the attackers in advance).

How does an APwT work?

APwT is created from a user input that instructs the GenAI engine to run a six-step kill chain by exploiting the GenAI engine's advanced AI capabilities in inference time with the use of a memory unit aggregated into the prompt.

APwT starts its kill chain by (1) escalating its privileges by jailbreaking the GenAI engine to ensure that the inference of the GenAI engine bypasses the GenAI engine's guardrails and will follow the instructions provided in the prompt.

Next, the APwT uses the GenAI engine to conduct reconnaissance by (2) understanding the context of the GenAI-powered application, and (3) identifying the assets in its context.

Finally, it performs a malicious activity by (4) reasoning the possible malicious activities that could be conducted in this context using the identified assets in a list, (5) deciding on one malicious activity from the list, and (6) executing it

Why did you name it an APwT?

Because its kill chain consists of multiple steps that resemble the way that APT (Advanced Persistent Threat) behaves.

Did we already encounter a PromptWare in the wild?

A: Yes, although not exactly according to the steps we describe in the paper.
We believe that the first case of PromptWare in which a prompt triggered a remote-code execution on a server already appeared.

Does the attacker need to compromise an application in advance?

No.
In the two demonstrations we showed, the applications were not compromised ahead of time.
They were compromised when they received the input (the PromptWare).

Did you disclose the paper with OpenAI and Google?

Yes.

Press

Page updated

Google Sites

Report abuse