Cloudflare has introduced a new feature for blocking web traffic from bots that AI developers use to extract data for training their own AI. Research has shown that up to 85% of customers want to block these data-scraping AIs, but many websites do not fully block them in their robots.txt file, only potentially excluding well-known bots like OpenAI’s GPTBot.
This new feature provides a one-stop option to block all bots simultaneously. Cloudflare will track where each bot comes from and implement automatic blocking measures. When it comes to web scraping volume, GPTBot is not the top data scraper but rather Bytespider, Amazonbot, and ClaudeBot, followed by GPTBot. Bytespider does not clearly state that it is used for training AI, although there have been reports of data being used for LLM training. On the other hand, ClaudeBot from Anthorpic is explicitly used for data scraping for training purposes.
Source: Cloudflare Blog
TLDR: Cloudflare has rolled out a feature to block AI data-scraping bots, tracking their origins and automatically implementing blocking measures. Top scrapers include Bytespider, Amazonbot, ClaudeBot, and GPTBot, with varying levels of transparency regarding their use for AI training.
Leave a Comment