AI Data Shortage Intensifies: MIT Report Indicates Decreasing Availability of Public Web Data

The accessibility of open data is gradually decreasing, and information that was once easily obtainable is becoming increasingly difficult to access.

01 Research Methods

Generally speaking, there are two types of measures to restrict web crawlers:

  • Robots Exclusion Protocol (REP)
  • Website Terms of Service (ToS)

The birth of REP can be traced back to 1995, before the AI era. This protocol requires the inclusion of robots.txt in website source files to manage the activities of web crawlers and other robots, such as user agents or specific file access permissions.

You can view the effectiveness of robots.txt as a "code of conduct" sign posted on the walls of gyms, bars, or community centers. It has no enforcement power on its own; good robots will follow the guidelines, but bad robots can simply ignore them.

The paper surveyed website sources from 3 datasets, as shown in Table 1. These are widely influential open-source datasets with download counts ranging from 100k to 1M+.

For each data source, the top 2k website domains by total token count were taken, and their union resulted in 3.95k website domains, labeled as HEADAll. Those sourced only from the C4 dataset are labeled HEADC4, which can be considered as the largest, most frequently maintained, and most critical AI training data sources.

10k domains were randomly sampled (RANDOM10k), from which 2k were randomly selected for manual annotation (RANDOM2k). RANDOM10k was sampled only from the intersection of domains across the three datasets, meaning they are more likely to be high-quality web pages.

As shown in Table 2, the manual annotation of RANDOM2k covered many aspects, including various content attributes and access permissions. To make longitudinal comparisons over time, the authors referred to historical web page data archived by the Wayback Machine.

The manually annotated content used in the study has been made public to facilitate future research replication.

02 Overview of Results

Increase in Data Restrictions

In addition to collecting historical data, the paper also used the SARIMA method (Seasonal Autoregressive Integrated Moving Average) to predict future trends.

Looking at robots.txt restrictions, there was a surge in the number of websites implementing complete restrictions after the emergence of GPTBot (mid-2023), while the increase in ToS restrictions was more stable and balanced, focusing more on commercial use.

According to the SARIMA model predictions, this trend of increasing restrictions will continue for both robots.txt and ToS.

The following graph calculates the proportion of websites restricting specific organizations or companies' agents. OpenAI's robots lead by a large margin, followed by Anthropic, Google, and the Common Crawl crawler for open-source datasets.

A similar trend can be observed from the perspective of token count.

Inconsistent and Ineffective AI Permissions

There are significant differences in the degree of permission for AI agents from different organizations across websites.

OpenAI, Anthropic, and Common Crawl rank in the top three for restriction percentages, all exceeding 80%, while website owners are generally more tolerant and open to non-AI crawlers like Internet Archive or Google Search.

Robots.txt is mainly used to regulate the behavior of web crawlers, while website ToS are legal agreements with users. The former is more mechanical, structured, and highly executable, while the latter can express richer and more nuanced policies.

The two should complement each other, but in practice, robots.txt often fails to capture the intent of the ToS, and they frequently have contradictory meanings (Figure 3).

Mismatch Between Real-world Use Cases and Web Data

The paper compares web content with the question distribution in the WildChat dataset, which is a recently collected user data from ChatGPT, containing about 1M conversations.

As seen in Figure 4, the differences between the two are quite significant. News and encyclopedias, which make up the largest proportion of web data, are almost negligible in user data, while fictional writing functions frequently used by users are rarely seen in web data.