Back
MIT Technology Review
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: MIT Technology Review
Relevant to AI governance discussions around data rights, training data provenance, and the legal constraints that may shape future AI development trajectories and compute/data scaling dynamics.
Metadata
Importance: 42/100news articlenews
Summary
This MIT Technology Review article examines how OpenAI's aggressive data collection practices for training large language models are creating legal and ethical problems, including copyright disputes and questions about consent. It explores the tension between the massive data needs of frontier AI systems and emerging regulatory and legal constraints on data use.
Key Points
- •OpenAI and other AI developers scraped vast amounts of internet data to train models like GPT-4, raising copyright and consent concerns.
- •Multiple lawsuits and regulatory investigations are emerging over unauthorized use of copyrighted content in AI training datasets.
- •The data scarcity problem may become a bottleneck for future AI development as legal restrictions tighten.
- •The article highlights a growing conflict between AI capabilities development and intellectual property law.
- •European and other regulators are scrutinizing whether large-scale data scraping violates privacy and copyright laws.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Development Racing Dynamics | Risk | 72.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202612 KB
[Skip to Content](https://www.technologyreview.com/2023/04/19/1071789/openais-hunger-for-data-is-coming-back-to-bite-it/#content)
OpenAI has just over a week to comply with European data protection laws following a temporary ban in Italy and a slew of investigations in other EU countries. If it fails, it could face hefty fines, be forced to delete data, or even be banned.
But experts have told MIT Technology Review that it will be next to impossible for OpenAI to comply with the rules. That’s because of the way data used to train its AI models has been collected: by hoovering up content off the internet.
In AI development, the dominant paradigm is that the more training data, the better. OpenAI’s GPT-2 model had a data set consisting of 40 gigabytes of text. GPT-3, which ChatGPT is based on, was trained on 570 GB of data. OpenAI has not shared how big the data set for its latest model, GPT-4, is.
But that hunger for larger models is now coming back to bite the company. In the past few weeks, several Western data protection authorities have started investigations into how OpenAI collects and processes the data powering ChatGPT. They believe it has scraped people’s personal data, such as names or email addresses, and used it without their consent.
The Italian authority has blocked the use of ChatGPT as a precautionary measure, and French, German, Irish, and Canadian data regulators are also investigating how the OpenAI system collects and uses data. The European Data Protection Board, the umbrella organization for data protection authorities, is also setting up an [EU-wide task force](https://edpb.europa.eu/news/news/2023/edpb-resolves-dispute-transfers-meta-and-creates-task-force-chat-gpt_en) to coordinate investigations and enforcement around ChatGPT.
Italy has given OpenAI [until April 30](https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9874751#english) to comply with the law. This would mean OpenAI would have to ask people for consent to have their data scraped, or prove that it has a “legitimate interest” in collecting it. OpenAI will also have to explain to people how ChatGPT uses their data and give them the power to correct any mistakes about them that the chatbot spits out, to have their data erased if they want, and to object to letting the computer program use it.
If OpenAI cannot convince the authorities its data use practices are legal, it could be banned in specific countries or even the entire European Union. It could also face hefty fines and might even be forced to delete models and the data used to train them, says Alexis Leautier, an AI expert at the French data protection agency CNIL.
OpenAI’s violations are so flagrant that it’s likely that this case will end up in the Court of Justice of the European Union, the EU’s highest court, says Lilian Edwards, an internet law professor at Newcastle University. It could take years before we see an answer to the questions posed by the Italian data regulator.
##
... (truncated, 12 KB total)Resource ID:
a4839ede7cd91713 | Stable ID: YjUyMDAwNz