AI Data Controversy: Why Do Silicon Valley Leaders Support Companies "Borrowing" Information?

By 2028, all high-quality text data on the internet will be exhausted, and AI companies are facing a data shortage. This has undoubtedly become a hot topic in the AI industry recently. How to obtain more data and computing power are equally the most troubling issues for AI companies today. In response to this, former Google CEO Eric Schmidt made a shocking statement during a speech at Stanford University on August 14th. He suggested that AI startups could first steal intellectual property using AI tools and then hire lawyers to deal with legal disputes.

Eric Schmidt used the controversial TikTok as an example, "If TikTok is banned, I suggest each of you make a copy of TikTok, steal all the users, steal all the music, put the preferences in, make this program in the next 30 seconds, and release it." He further explained, "If you're a Silicon Valley entrepreneur, what you'd do is if the product takes off, then hire a bunch of lawyers to clean up the mess, but if no one uses your product, it doesn't matter even if you stole all the content."

As the former CEO of Google, Eric Schmidt's prescription is indeed quite characteristic of the "Silicon Valley spirit." Just a few weeks ago, The Economist magazine pointed out in an article titled "AI companies will soon run out of most internet data" that by 2028, all high-quality text data on the internet will be exhausted, and machine learning datasets may run out of all "high-quality language data" by 2026.

Synthetic data was previously considered an effective solution by the industry. Since human-generated data can't keep up with the needs of AI model iterations, why not directly use AI-generated data? However, a paper published in Nature at the end of July confirmed that using AI-generated datasets to train large models would contaminate their output and cannot avoid the problem of "model collapse." With this paper out, AI companies will inevitably be more cautious about using synthetic data.

Open-source databases like Common Crawl and The Pile corpus have already nurtured many well-known and unknown large models such as GPT-4 and Gemini. The current situation is that free, open-source, and quality-assured databases have been nearly exhausted, while paid data is readily available, such as X, Reddit, and various news media who are apparently very willing to sell their data.

At the same time Eric Schmidt suggested AI startups steal data, Nature revealed another big news: a large group of academic publishers represented by Taylor & Francis and Wiley have already provided paid access to their papers to companies like Microsoft, allowing them to use relevant scientific papers to train large models. The problem is that AI startups, who are reluctant to spend even a penny, are often unwilling to pay for data.

For an AI startup, operating costs are mainly computing power, human resources, and data. Before AGI is truly realized, hiring AI scientists and programmers to train AI is essential work, and purchasing computing cards from NVIDIA is also a hard expense, as AI startups can't steal chips from TSMC's factories. In fact, Eric Schmidt's suggestion that AI startups can first steal data and then use lawyers to solve problems proves that he is indeed an important promoter of Google's growth into a tech giant and a qualified Silicon Valley person.

There's a classic saying in Silicon Valley, "Fake it until you make it." From Steve Jobs founding Apple in the last century to Zuckerberg creating the social network, to Musk establishing Tesla, generation after generation of Silicon Valley people have built their enormous businesses under the guidance of this motto.

First boasting about their ideas, selling a good story to investors, attracting capital and talent, then working hard to catch up with the goals and finally achieving them - this is the secret recipe of Silicon Valley entrepreneurs. Exaggerating the future, covering up failures, fabricating data, and ignoring common sense are commonplace in Silicon Valley. For example, the "pirate spirit" that Jobs used to mention is about focusing on goals, using any means necessary, breaking conventions, and even throwing morality aside.

Currently, the biggest challenge for AI entrepreneurs is survival. With the receding of the AI investment boom and the rise of AI bubble theory, investors' attitudes towards AI startups have not only cooled down but become increasingly cautious, making it more difficult for them to obtain financing. In this situation, only startups that can produce better-performing large models can obtain funds to maintain their existence.

If they don't break conventions and continue to follow the usual path, the result will be being surpassed by competitors who dare to take the road less traveled. So Eric Schmidt's words are "golden advice" for AI startups. If the product fails, the company will naturally have to close down, and no one will seek compensation for infringement; but once it takes off, companies with money can use "plea bargaining" to solve problems.

In fact, before Eric Schmidt made these shocking statements, many AI startups were already practicing the approach of "stealing" data. The "chaotic times" have already arrived, and Eric Schmidt, as a Silicon Valley tycoon, is now publicly acknowledging this reality. After all, it's almost inevitable that AI startups with an endless thirst for data will use technical means to break through the defenses of data owners, who in turn will build "fortresses."

AI Data Controversy: Why Do Silicon Valley Leaders Support Companies "Borrowing" Information?

Turbulent times have actually already arrived; this time it's just a public acknowledgment of this fact.