Baidu Blocks Google and Bing from Scraping Content as AI Data Demand Grows

Chinese search giant Baidu has taken steps to block Google and Microsoft’s Bing from indexing content on its Wikipedia-style platform, Baidu Baike. This move, uncovered through a survey by the South China Morning Post, marks a significant effort by Baidu to protect its online assets as the demand for data used in artificial intelligence (AI) projects continues to rise.

On August 8, Baidu updated the robots.txt file of Baidu Baike, which instructs search engine crawlers on which parts of the website they can access. The update specifically blocks Googlebot and Bingbot from indexing content on the platform, a significant change from earlier the same day when these crawlers were still allowed to access most of the site’s nearly 30 million entries.

This action highlights Baidu’s intention to safeguard its vast data repository, especially as AI developers seek large datasets to train and enhance AI models and applications. The move follows similar actions by other platforms, such as Reddit, which in July restricted various search engines, except Google, from indexing its content to protect data used for AI training.

The growing demand for high-quality data for generative AI (GenAI) projects has led to an increase in deals between AI developers and content publishers. For instance, OpenAI recently secured a deal with Time magazine, granting it access to over 100 years of archived content for its AI services.

Despite the update to Baidu Baike’s robots.txt, a survey conducted on Friday found that some older cached content from the platform still appears in Google and Bing search results. This suggests that while new content may be blocked, remnants of older data remain accessible for now.

Representatives from Baidu, Google, and Microsoft did not immediately respond to requests for comment on this development.