Towards Trusted AI Week 38 – Prompt Injection Attack Trilogy and AI eliminating Humanity

Secure AI Weekly admin September 21, 2022 301

Prompt injection attacks against GPT-3

Simon Willison, September 12, 2022

Simon Willison’s article provides examples of how the GPT-3 model (and potentially any other large language model) can be used for automatic language translation, a task that performs well, but only on the condition that it is not being fooled by a malicious prompt. When you include untrusted user input in a prompt, all sorts of strange and dangerous situations can occur.

This is a new type of threat for AI applications and this article about prompt injections has received its continuation, and more than one!

People already started to use this attack to have fun with various chatbots and other real NLP applications. Twitter users discovered how to hijack a GPT-3 based tweet bot remoteli.io, dedicated to remote jobs. Using a newly discovered prompt injection attack, they redirected the bot to repeat embarrassing and ridiculous phrases.

GPT-3 is an ideal black box. No matter how many automated tests are written, there will never be 100% certainty that the user will be unable to come up with an unpredicted grammatical construction that will destroy protection in an instant. And all that is needed to conduct a prompt injection attack is to enter exploits in English and adapt other people’s working examples. So an example of a prompt injection attack on a Twitter bot has been presented recently.

After hallucination-like adversarial examples on computer vision and prompt injection attacks we see that AI vulnerabilities are similar to human vulnerabilities and illusions.

How to prevent such attacks? AI seems a good answer. If you bridge the gap with even more AI, there will be no guarantee of 100% protection, because large language models remain impenetrable black boxes. Currently, no one including the creators of such a model has a complete idea of what they can do. But certainly, we must have multiple layers of defenses at each stage of the AI development lifecycle starting from day 0 as this prompt injection example is definitely not the last one that we will experience with all new AI applications.

Attacking Neural Networks Could Lead to a Better Understanding of AI

Technology Networks, September 14, 2022

Neural networks are implemented everywhere, for example, as self-driving cars, virtual assistants, or facial recognition systems, recognizing patterns in datasets. To help researchers understand the demeanor of neural networks, a team at Los Alamos National Laboratory has developed a new approach to compare neural networks. This looks within the black box of artificial intelligence.

Despite the high performance of neural networks, they have vulnerabilities. Imagine an autonomous vehicle that uses AI to detect and recognize road signs but works well in perfect environmental conditions. If you slightly change, for example, a “Stop” sign by placing some kind of aberration on it, say, a sticker, the consequences of identifying such a sign is different. Or in general, AI will miss this sign and won’t be able to detect it, which may turn out to be unfavorable.

To improve the performance and robustness of neural networks, researchers use an approach such as attacking the neural network in the learning process, that is, they purposefully introduce various aberrations into the process and train the AI to ignore them. This approach is called adversarial learning.

It turned out that this process not only making systems more secure but also can make them more robust.

Google Deepmind Researcher Co-Authors Paper Saying AI Will Eliminate Humanity

Vice, September 13, 2022

Artificial intelligence exists in all areas of our lives – it drives cars on the same road as people, it evaluates the life of people, and AI even creates works of art that are awarded. But everyone is concerned about the question of whether AI will be able to “take over the world”, break down and destroy humanity. Researchers at Oxford University and DeepMind think it’s “likely”.

Last month, a paper was published that attempted to think about how and what risk AI could pose to humanity by looking at the artificial creation of reward systems. This paper hypothesizes that once in the future, an advanced AI that controls a significant function may be incentivized to develop fraudulent strategies in order to obtain rewards while harming humanity in the process.

Michael Cohen commented on this article on Twitter: “ Under the conditions we have identified, our conclusion is much stronger than that of any previous publication—an existential catastrophe is not just possible, but likely.”

Well, after reading examples of prompt injection attacks and other vulnerabilities it looks clear that the threat of AI eliminating humanity seems more realistic because the problem of controlling how AI works in various conditions looks like much more complex that it is supposed to.

Read the full article via the link.