NVIDIA AI Data Collection Controversy: Officials Deny Allegations of Violations

This is equivalent to the amount of visual information a person receives in their entire lifetime.

According to leaked internal documents, NVIDIA is reportedly developing a video AI model codenamed Cosmos, led by research VP Ming-Yu Liu. The project aims to build a state-of-the-art foundational video model combining light transport, physics and intelligence simulation for various downstream applications.

Leaked emails reveal NVIDIA's goal of creating a "video data factory" that can produce training data equivalent to a human's lifetime of visual experiences daily. The company is allegedly scraping large amounts of unauthorized data from sources like YouTube and Netflix to train the model.

NVIDIA employees are said to use tools like yt-dlp to download videos, using virtual machines to avoid detection. When asked for comment, NVIDIA claimed their practices are legal and comply with copyright law, stating that copyright doesn't protect facts, ideas or information that can be freely learned from other sources.

However, YouTube's CEO has previously stated that using their videos to train AI models like OpenAI's Sora would violate their terms of service. Netflix also said they have no content extraction agreement with NVIDIA and their terms prohibit scraping.

This comes as YouTube creators are seeking a class action lawsuit against OpenAI for allegedly using millions of YouTube videos to train AI models without permission or compensation.

While controversial, high-quality training data from original internet sources has proven valuable for AI model development. Recent research suggests models trained on early internet data may have advantages over those using later AI-generated data.

The ethics and legality of scraping online data for AI training remains a contentious issue in the industry.