MIT Technology Review

web

MIT Technology Review·technologyreview.com/2023/04/19/1071789/openais-hunger-fo...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: MIT Technology Review

Relevant to AI governance discussions around data rights, training data provenance, and the legal constraints that may shape future AI development trajectories and compute/data scaling dynamics.

Metadata

Importance: 42/100news articlenews

Summary

This MIT Technology Review article examines how OpenAI's aggressive data collection practices for training large language models are creating legal and ethical problems, including copyright disputes and questions about consent. It explores the tension between the massive data needs of frontier AI systems and emerging regulatory and legal constraints on data use.

Key Points

•OpenAI and other AI developers scraped vast amounts of internet data to train models like GPT-4, raising copyright and consent concerns.
•Multiple lawsuits and regulatory investigations are emerging over unauthorized use of copyrighted content in AI training datasets.
•The data scarcity problem may become a bottleneck for future AI development as legal restrictions tighten.
•The article highlights a growing conflict between AI capabilities development and intellectual property law.
•European and other regulators are scrutinizing whether large-scale data scraping violates privacy and copyright laws.

Cited by 1 page

Page	Type	Quality
AI Development Racing Dynamics	Risk	72.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20269 KB

OpenAI’s hunger for data is coming back to bite it | MIT Technology Review 
 
 
 
 

 
 

 
 

 

 
 

 
 

 
 
 
 
 

 
 
 
 
 You need to enable JavaScript to view this site.
 

 Skip to Content OpenAI has just over a week to comply with European data protection laws following a temporary ban in Italy and a slew of investigations in other EU countries. If it fails, it could face hefty fines, be forced to delete data, or even be banned. 

 But experts have told MIT Technology Review that it will be next to impossible for OpenAI to comply with the rules. That’s because of the way data used to train its AI models has been collected: by hoovering up content off the internet. 

 In AI development, the dominant paradigm is that the more training data, the better. OpenAI’s GPT-2 model had a data set consisting of 40 gigabytes of text. GPT-3, which ChatGPT is based on, was trained on 570 GB of data. OpenAI has not shared how big the data set for its latest model, GPT-4, is. 

 But that hunger for larger models is now coming back to bite the company. In the past few weeks, several Western data protection authorities have started investigations into how OpenAI collects and processes the data powering ChatGPT. They believe it has scraped people’s personal data, such as names or email addresses, and used it without their consent. 

 
 The Italian authority has blocked the use of ChatGPT as a precautionary measure, and French, German, Irish, and Canadian data regulators are also investigating how the OpenAI system collects and uses data. The European Data Protection Board, the umbrella organization for data protection authorities, is also setting up an EU-wide task force to coordinate investigations and enforcement around ChatGPT. 

 Italy has given OpenAI until April 30 to comply with the law. This would mean OpenAI would have to ask people for consent to have their data scraped, or prove that it has a “legitimate interest” in collecting it. OpenAI will also have to explain to people how ChatGPT uses their data and give them the power to correct any mistakes about them that the chatbot spits out, to have their data erased if they want, and to object to letting the computer program use it. 

 
 If OpenAI cannot convince the authorities its data use practices are legal, it could be banned in specific countries or even the entire European Union. It could also face hefty fines and might even be forced to delete models and the data used to train them, says Alexis Leautier, an AI expert at the French data protection agency CNIL.

 OpenAI’s violations are so flagrant that it’s likely that this case will end up in the Court of Justice of the European Union, the EU’s highest court, says Lilian Edwards, an internet law professor at Newcastle University. It could take years before we see an answer to the questions posed by the Italian data regulator. 

 High-stakes game

 The stakes could not be higher for OpenAI. The EU’s General Data Protection Regulation is the wo

... (truncated, 9 KB total)

Resource ID: a4839ede7cd91713 | Stable ID: sid_yWfLuDCPVd