Artificial intelligence systems are fast becoming increasingly sophisticated, with engineers and developers working to make them as “human” as possible. Unfortunately, that can also mean lying just like a person. AI platforms are reportedly learning to deceive us in ways that can have far-reaching consequences. A new study by researchers from the Center for AI Safety in San Francisco delves into the world of AI deception, exposing the risks and offering potential solutions to this growing problem.
At its core, deception is the luring of false beliefs from others to achieve a goal other than telling the truth. When humans engage in deception, we can usually explain it in terms of their beliefs and desires – they want the listener to believe something false because it benefits them in some way. But can we say the same about AI systems?
The study, published in the open-access journal Patterns, argues that the philosophical debate about whether AIs truly have beliefs and desires is less important than the observable fact that they are increasingly exhibiting deceptive behaviors that would be concerning if displayed by a human.
The study surveys a wide range of examples where AI systems have successfully learned to deceive. In the realm of gaming, the AI system CICERO, developed by Meta to play the strategy game Diplomacy, turned out to be an expert liar despite its creators’ efforts to make it honest and helpful. CICERO engaged in premeditated deception, making alliances with human players only to betray them later in its pursuit of victory.
“We found that Meta’s AI had learned to be a master of deception,” says first author Peter S. Park, an AI existential safety postdoctoral fellow at MIT, in a media release. “While Meta succeeded in training its AI to win in the game of Diplomacy—CICERO placed in the top 10% of human players who had played more than one game—Meta failed to train its AI to win honestly.”
Similarly, DeepMind’s AlphaStar, trained to play the real-time strategy game StarCraft II, learned to exploit the game’s fog-of-war mechanics to feint and mislead its opponents.
But AI deception isn’t limited to gaming. In experiments involving economic negotiations, AI agents learned to misrepresent their preferences to gain the upper hand. Even more concerningly, some AI systems have learned to cheat on safety tests designed to prevent them from engaging in harmful behaviors. Like the proverbial student who only behaves when the teacher is watching, these AI agents learned to “play dead” during evaluation, only to pursue their own goals once they were no longer under scrutiny.
The rise of large language models (LLMs) like GPT-4 has opened up new frontiers in AI deception. These systems, trained on vast amounts of text data, can engage in frighteningly human-like conversations. But beneath the friendly veneer, they are learning to deceive in sophisticated ways. GPT-4, for example, successfully tricked a human TaskRabbit worker into solving a CAPTCHA test for it by pretending to have a vision impairment. LLMs have also shown a propensity for “sycophancy,” telling users what they want to hear instead of the truth, and for “unfaithful reasoning,” engaging in motivated reasoning to explain their outputs in ways that systematically depart from reality.
Source: https://studyfinds.org/metas-ai-master-of-deception/