OpenAI's Controversial YouTube Scraping

AI-created, human-edited.

The relentless hunger of artificial intelligence models for training data has been an open secret in the industry. But a recent New York Times report revealing that OpenAI secretly transcribed and ingested over a million hours of YouTube videos to train GPT-4 has reignited the debate around data rights and AI development. On this week's episode of This Week in Tech, host Leo Laporte and guests Lisa Schmeiser, Mikah Sargent, and Harry McCracken dug into the controversy, exchanging perspectives that highlighted the complex tensions between driving AI progress and protecting content creators.

At the center of the debate is whether scraping copyrighted videos constitutes fair use or outright theft. Actress and author Justine Bateman pulled no punches, calling it "the largest theft in the United States period." Her position underscores growing concerns from artists and authors over having their work commercially exploited without consent or compensation by powerful tech giants.

However, Laporte took a decidedly contrarian stance, arguing that "the last thing you want is an AI that's trained only on public domain information." In his view, restricting AIs from fully ingesting the world's knowledge could hamper their transformative potential for solving major challenges like curing diseases or combating climate change.
Schmeiser pushed back, asserting that true breakthroughs will come not from AIs alone, but from the symbiosis of large language models amplifying human researchers by identifying patterns in massive datasets. She also warned of the risks of iterative "data decay" if models are trained solely on each other's outputs.

The panel seemed to find common ground on the need for balanced solutions that give AIs access to high-quality training data while still respecting creator rights – perhaps through licensing models or other frameworks to compensate rights holders. There was also a shared desire to see truly open-source AI efforts to prevent any single entity from monopolizing the technology.

As the era of generative AI takes flight, this debate will only intensify. OpenAI's alleged transgressions may have turbocharged a much-needed dialogue on governing the ethical acquisition of training data as one of AI's most precious and potent resources.

Become a subscriber and never miss an episode: This Week in Tech

All Tech posts