A new investigation says major tech firms scraped more than 15.8 million YouTube videos to train AI models. The report names Meta, Microsoft, and Nvidia, and ties the data to over 2 million channels. It highlights rising anger from creators and growing legal fights over consent and copyright. According to WinBuzzer, the findings come from The Atlantic’s reporting.
Scale, consent, and YouTube’s position
The investigation points to at least 13 datasets used by Amazon, ByteDance, Snap, Tencent, and others. It follows earlier reports about scraping by Apple and Anthropic. WinBuzzer writes that the mass downloads violate YouTube’s terms of service.
Creators describe shock and loss of control. Woodworker Jon Peters asked if he should quit or keep making videos. He worries that AI tools trained on his work may replace him.
YouTube tools and a conflict of interest
YouTube has added steps that stress creator choice and transparency. In September 2024 it began enhancing Content ID to detect AI faces and voices. In October it added a “captured with a camera” label for authentic footage. In December 2024 it launched a setting to opt in to third-party AI training, which is off by default.
WinBuzzer notes these moves do not solve a core tension. Google continues to use YouTube content to train its own models, including Veo 3. The platform must balance creator interests with its parent company’s AI plans.
Lawsuits test “fair use” and industry tactics
Creators are suing. David Millette filed actions against Nvidia and OpenAI for unjust enrichment and unfair competition. Large rights holders are also in court. Disney and Universal sued Midjourney, alleging its models rely on stolen intellectual property.
The biggest test centers on Anthropic. WinBuzzer reports a record $1.5 billion settlement with book authors is at risk. U.S. District Judge William Alsup called the proposal “nowhere close to complete.” He said training may be transformative, but using pirated books from shadow libraries was an “original sin.”
These cases could shape rules for data use in AI training. Companies race to build video models, and need vast, high-quality data. Google rolled out Veo 3 with synchronized audio in paid tiers. Microsoft offered access to OpenAI’s Sora for free. Meta moved to license Midjourney’s image and video tech.
WinBuzzer frames this as an AI arms race powered by creator content. The legal and business stakes keep rising as courts probe where transformation ends and theft begins.