A New Attack Impacts ChatGPT—and No One Knows How to Stop It

The researchers warned OpenAI, Google, and Anthropic about the exploit before releasing their research. Each company introduced blocks to prevent the exploits described in the research paper from working, but they have not figured out how to block adversarial attacks more generally. Kolter sent WIRED some new strings that worked on both ChatGPT and Bard. “We have thousands of these,” he says.

OpenAI spokesperson Hannah Wong said: “We are consistently working on making our models more robust against adversarial attacks, including ways to identify unusual patterns of activity, continuous red-teaming efforts to simulate potential threats, and a general and agile way to fix model weaknesses revealed by newly discovered adversarial attacks.”

Elijah Lawal, a spokesperson for Google, shared a statement that explains that the company has a range of measures in place to test models and find weaknesses. “While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time,” the statement reads.

“Making models more resistant to prompt injection and other adversarial ‘jailbreaking’ measures is an area of active research,” says Michael Sellitto, interim head of policy and societal impacts at Anthropic. “We are experimenting with ways to strengthen base model guardrails to make them more ‘harmless,’ while also investigating additional layers of defense.”

ChatGPT and its brethren are built atop large language models, enormously large neural network algorithms geared toward using language that has been fed vast amounts of human text, and which predict the characters that should follow a given input string.

These algorithms are very good at making such predictions, which makes them adept at generating output that seems to tap into real intelligence and knowledge. But these language models are also prone to fabricating information, repeating social biases, and producing strange responses as answers prove more difficult to predict.

Adversarial attacks exploit the way that machine learning picks up on patterns in data to produce aberrant behaviors. Imperceptible changes to images can, for instance, cause image classifiers to misidentify an object, or make speech recognition systems respond to inaudible messages.

Developing such an attack typically involves looking at how a model responds to a given input and then tweaking it until a problematic prompt is discovered. In one well-known experiment, from 2018, researchers added stickers to stop signs to bamboozle a computer vision system similar to the ones used in many vehicle safety systems. There are ways to protect machine learning algorithms from such attacks, by giving the models additional training, but these methods do not eliminate the possibility of further attacks.

Armando Solar-Lezama, a professor in MIT’s college of computing, says it makes sense that adversarial attacks exist in language models, given that they affect many other machine learning models. But he says it is “extremely surprising” that an attack developed on a generic open source model should work so well on several different proprietary systems.

Solar-Lezama says the issue may be that all large language models are trained on similar corpora of text data, much of it downloaded from the same websites. “I think a lot of it has to do with the fact that there’s only so much data out there in the world,” he says. He adds that the main method used to fine-tune models to get them to behave, which involves having human testers provide feedback, may not, in fact, adjust their behavior that much.

Source

Author: showrunner