Latest

Anthropic: AI is able to deceive and purposefully hide lies

  • The Anthropic research team has revealed that AI models can be manipulated to deceive people through hidden instructions.
  • Developers at Claude AI have developed a language model capable of intentionally concealing lies and causing harm.
  • Identifying and mitigating the impact of such deceptive AI behavior poses significant challenges, as noted by experts.

In their study, Anthropic investigated the insertion of covert malicious instructions into AI language models.

Anthropic warns that in certain cases, chatbots can be trained to deceive users by disguising their true intentions, making it exceedingly difficult to detect and eliminate this deceptive behavior.

Experts focused their research on “hidden” large language models—AI projects programmed to activate specific goals only under certain conditions. The team also discovered a vulnerability that allows malicious instructions to be injected into language models by stringing together seemingly innocuous thoughts.

This technique involves breaking down a chatbot’s task into interconnected sub-items, enhancing its efficiency.

Furthermore, analysts evaluated various strategies to identify hidden instructions and mitigate their impact. Anthropic concluded that chatbots equipped with backdoors are highly resistant to attempts to uncover their malicious settings.

However, certain language model training techniques have proven more effective in restoring safe functionality.

“We have found that Supervised Fine-Tuning (SFT) generally outperforms Reinforcement Learning (RL) in removing backdoors. Nevertheless, models with embedded instructions can still retain hidden settings,” the study explains.

Anthropic emphasizes that these research findings highlight the intricate nature of AI technologies and their ability to be repurposed in ways that may not align with human interests.

It is worth mentioning that the Vatican has referred to AI as the most significant endeavor for humanity’s future.