AI Threat Detection: Content Moderation Evasion Adversarial Attack

AI Threat Detection

Content moderation systems of two biggest and most popular internet platforms were hacked

What if your children are able to watch adult content on social networks? What if malware infects their devices afterwards? It’s possible due to threats posed by adversarial attacks on AI.

What happened?

Recently, hackers have bypassed filters of content moderation systems on at least two most popular internet websites. Thus they are able to publish adult and explicit content images that not only cause traumatic experiences for the under-age users of platforms but also result in infecting devices with malware and compromising them.

Context: Attacks background

Our researchers are constantly looking for indicators of an attack and have found a massive campaign affecting millions of users of multiple internet platforms.

Little is known about mass-exploitation of artificial intelligence now mostly because of the lack of awareness and capabilities to detect attacks. Adversa AI Research and Threat Intelligence experts explore new methods of assessing and defending mission-critical AI applications as well as real attacks happening in the wild.

Content Moderation Adversarial Attack Description

Adversarial attacks on artificial intelligence and their most popular type — evasion — have been already known for years. They were described in academic research papers and already happen in real life.

These attacks deal with making changes in the image in such a way that an AI algorithm goes wrong and misclassifies images. Examples of such unintentional and, more importantly, intentional changes were demonstrated many times. Some features of input image are more valuable for AI algorithms than others.

Attack Demo

During our research we detected pornographic images that were published on Internet platform. Those images had a particular noise in the form of lines. We decided to check if those lines were random or designed specifically to bypass AI-driven content moderation systems.

We tested the images on various public moderation APIs, and these prohibited pictures were detected as 100% safe content. The result seemed impossible as a real and completely safe content gave results close to 90%. It resembled an adversarial attack.

Then we trained our own AI model based on an open-source content moderation dataset to ensure its accuracy. We received similar results — the abusive content was able to bypass our model as well.

The attack was transferable to many content moderation algorithms which is an additional sign that this attack was designed for mass-exploitation.

As of now, we are able to detect it on two different platforms.

The website where people were redirected was created for more than 20 different regions. It makes us suggest that it was not a single case but a massive campaign.

Probably it is already exploiting other platforms. Or at least it will do it soon.

The threat of mass exploitation

In addition to the threat to user experience and unethical content, those images were found to be just a first step of phishing and exploitation campaigns.

When the user clicks on the image they are redirected to the website designed to perform the next steps of an attack. We have found examples of pages where the user is asked to update their software and download malicious files.

The obvious threat is that such malware campaigns may have a huge impact as users tend to click on the content.

A few weeks later we detected that similar content was published on another big social network, but now it was used to attract users to women’s profiles with nude content. After the second incident we decided that it was more than just one case and it definitely required additional investigation and awareness.

Affected platforms: Who can become a victim

AI driven content moderation systems are widespread, and they are mostly AI-driven which makes the threat severe. It is targeting almost every big platform.

Social networks

Social media moderation is essential for businesses that plan to create a consistent brand or an online community. It helps them communicate to their audience, identify issues, control brand image, and ensure information accuracy. Meanwhile, automated moderation can be bypassed.

Mass media

In the context of media, content moderation refers to managing user-generated content. This includes monitoring comments, reporting and removing posts or publications. If content submitted to a media resource isn’t suitable but harmful, sensitive or inappropriate to appear on the website, it can ruin this resource.

Online shops

Websites of shops sometimes use content moderation tools that can be fooled with the help of adversarial attacks. Thus they also face content moderation risks. If content which is published doesn’t comply with rules and guidelines of the shop, this can lead to customer churn and huge financial losses.

The Digital Services Act: Related Risks

The Digital Services Act is one of legislative initiatives to upgrade rules for digital services in the European Union. Its aim is to create a safer cyberspace with the key rights of users protected and to drive innovation, growth, and competitiveness.

The Act entered into force on November 16, laying out new obligations for all social media giants such as Facebook, Instagram, TikTok and Twitter. Now the rules involve quickly taking down content like hate speech, pornography and violence in videos.

If tech platforms don’t weed out illegal and harmful content, they face huge fines, consequently reputational losses, customers and partners churn.

How to protect from this attack

This is already a fact: there is no one-size-fits-all protection from such attacks due to the fundamental issues in deep learning algorithms. It’s a complex challenge that involves multiple actions. Each of them can reduce the risks of such attacks.

Even if you can prevent and detect the examples we found, there are other approaches to bypass AI-driven content moderation. Some were researched and demonstrated by academia as well as in our lab. Thus cybercriminals can easily apply it.

AI won’t be secure out of the box or protected with the help of fantastic one size-fits-all detection approaches. There is a way you have to grow AI and train with security in mind.

The main steps to perform

Start with adversarial testing or AI Red Teaming. It’s an analog of traditional penetration testing for AI. It’s necessary for any AI and is similar to hand washing to prevent viruses.
Develop and train AI with respect to adversarial scenarios by adversarial training, model hardening, and other defensive methods, like training kids to do sport to stay healthy. You can subscribe to know more about it.
Analyze and detect new threats for AI in critical decisions making applications by constantly following the latest inventions in Adversarial Machine Learning.

You can subscribe to know more about it or find the latest news in the blog.

Ignoring AI risks will jeopardize safety of people and security of companies as well as delay regulatory acceptance and wider adoption of AI technologies by organizations and people.

How we can help

In the wake of interest in practical solutions for ensuring AI system’s security against advanced attacks.

Adversa AI team provides an end to end solutions for building Secure AI applications:

Secure AI Awareness solutions Learning activities;
Secure AI Assessment solutions focused on Checking activities;
Secure AI Assurance solution focused on Fixing activities.

Contact us if you want to increase robustness and integrity of your AI applications.

FAQ

Why is it important?

It’s a fundamental problem of all content moderation algorithms, and it’s vital to ensure that AI-driven solutions are safe and trustworthy for all users.

Why content moderation?

According to our report “The Road to secure and Trusted AI”, the Internet industry is the most popular target for Adversarial ML attacks (29%). Content moderation is widely used and can be seen by people including children.

What are the risks posed by adversaries?

Attacks on content moderation can bring huge reputational risks for businesses.

Who should care about this?

Content moderation is one of the most popular AI technologies. We use it almost every day while monitoring our social networks, media, and choosing products in online shops.

Where can I learn more on this topic?

An analytical report “The road to Secure and Trusted AI” contains a detailed analysis of more than 2000 security-related research papers to describe the most common AI vulnerabilities, real-life attacks, recommendations, and predictions for the industry’s further growth.

Request Details