Marijan Hassan - Tech Journalist

AI companies reportedly still scraping websites despite blocking protocols

Major AI companies are being accused of ignoring established protocols designed to protect website content, raising concerns about data ethics and copyright infringement.

Background

Web scraping, the process of extracting data from websites using automated bots, has long been a contentious issue. While it is a common practice for gathering large datasets to train AI models, it raises significant legal and ethical questions, particularly when done without the consent of website owners.

Over the past few years, many websites have implemented various measures to curb the practice including the use of robots.txt file which essentially functions as a "do not enter" sign.

However, despite these efforts, recent reports indicate that some AI companies are circumventing these barriers to continue harvesting data.

The allegations

Forbes was the first to come out accusing the AI startup of stealing its story and republishing it across multiple platforms.

Wired then reported that Perplexity was bypassing its Robots.txt protocol to scrape its website and other Condé Nast publications. Technology website, The Shortcut is another publication that has singled out Perplexity.

More culprits

A recent investigation by Reuters revealed that several AI companies are also bypassing these instructions. The news site claims to have seen a letter from TollBit, a startup that connects publishers with AI firms, warning publishers that "AI agents from multiple sources" are disregarding robots.txt limitations.

OpenAI and Anthropic – the creators of the ChatGPT and Claude chatbots – are believed to be the other culprits. Both companies previously claimed to respect robots.txt instructions.

The scraping of website content raises several concerns. First, it disregards the wishes of website owners. Second, it raises copyright infringement issues if the scraped content is used to train AI models or republished without permission. Finally, the lack of transparency about data collection practices is concerning.

The incident highlights the need for stricter regulations and clearer ethical guidelines in the rapidly evolving field of AI. It remains to be seen how these companies will respond to the allegations and what steps will be taken to ensure responsible data collection practices in the future.

AI companies reportedly still scraping websites despite blocking protocols

Background

The allegations

More culprits

Recent Posts