From Fair Use to Data Moats: The Shifting Value of Data in AI
Why yesterday’s “scrapable web” is today’s tightly guarded gold—and what it means for the next AI frontier
In the last few years, AI research labs and startups used internet-scale data to train the LLMs. This led to the creation of large foundational models like GPT-3, Llama, Claude etc., all trained trained on a mix of web-scraped content, copyrighted material, and publicly licensed datasets.
The journey can be seen in the following phases.
Phase 1: Open Data Gold Rush
Companies trained on anything they could scrape. Legal ambiguity was overlooked in favor of capability gain. The data was treated as a public good and the companies operated under the doctrine of fair use, relying on ambiguous (or controversial some would say) interpretations.
During this phase, there have been many notable lawsuits involving AI training data and fair use claims being contested.
Kadrey et al. v. Meta Platforms: the allegation being Meta trained AI models on 81.7TB of pirated books from LibGen, a known repository of copyrighted works, to train its Llama AI models
New York Times v. OpenAI & Microsoft: NYTimes alleged unauthorized use of NYT articles to train ChatGPT, creating a competing information source
Disney & Universal v. Midjourney: the allegation being Midjourney scraped copyrighted characters (e.g., Spider-Man, Shrek) for AI image generation. Damages sought up to $150,000 per infringed work, potentially totaling billions
Kelly McKernan et al. v. Midjourney: artists claim Midjourney replicates their styles without consent, profiting from unpaid labor
Ghibli-style images by OpenAI: while there is currently no lawsuit filed by Studio Ghibli against OpenAI, but the possibility of legal action has been widely discussed due to the viral trend of generating “Ghibli-style" images using OpenAI's tools
In my prior article The Data Dilemma, I have estimated the cost of (copyrighted) data in LLM training taking Meta’s Libgen data as an example to illustrate the huge data costs, if actually accounted for, the training cost economics of foundational model companies would be much more challenging and profitability timelines would be severely impacted.
Phase 2: Data as Moat
But now, as models are mature and high-quality human-generated data dries up (discussed the data saturation part in my article AGI meets Data Wall), differentiation is becoming hard, and companies are flipping the narrative: the very data they claimed was freely usable is now being ringfenced as proprietary moat.
Data is is now being treated as proprietary capital. Data quality, exclusivity, and proprietary access are now core to market power.
Some recent examples of this data lockdown include
Meta is investing ~$15B in Scale AI to get exclusive access to high-quality, labeled datasets. It suggests Meta now treats clean, structured, high-signal data as a strategic advantage and not a shared resource
Salesforce has implemented new restrictions that block rival AI companies from storing or indexing Slack messages, even with customer permission. Third-party AI apps can only access Slack data temporarily and must delete it after use, making it impossible for competitors to train models or build long-term analytics on Slack data
OpenAI’s recent deals with Reddit, News Corp, Stack Overflow is all about exclusive access to the kind of high-quality data that used to be "scraped" before
Phase 3: Regulatory & Legal Reckoning?
Lawsuits from publishers, creators, and governments are pushing the ecosystem toward licensed or synthetic data. But even synthetic data often requires real, human-created content for grounding. However, as I have discussed AI citation issues in my prior article, synthetic data could lead to challenges related to compounding effect of AI model’s ability to accurately attribute and cite original sources. This could further complicate the data citation and legal issues going forward.
Few more legal precedents.
Thomson Reuters v. Ross case sets a precedent where the court found that training on copyrighted works without a license was not fair use, especially when the use was commercial and directly competed with the copyright holder
The Anthropic (Claude) settlement with music publishers is a notable recent example of a generative AI company agreeing to restrict training on copyrighted content
International lawsuits are growing, such as Indian publishers suing OpenAI for using their literary works without permission
The P.M. v. OpenAI lawsuit also alleges privacy and property rights violations from scraping personal data for training5
Implications for the market: A Balkanized Data Future?
Rising barriers: Barriers to entry are rising. New entrants will find it hard to access Reddit/Slack/YouTube/Books. That means only a few players can afford to train competitive LLMs. This means startups will struggle to match model performance unless they can pay for licenses, create their own data pipelines, or partner with data-rich platforms
Vertical integration: Foundational model providers are turning into platforms. They want vertical integration, and control over the model, the data, the interface, and even the use cases (e.g. agents, tools)
Regulatory lag / inconsistency: Lack of consistent legal frameworks creates uncertainty for builders and new entrants. Legal clarity may not arrive fast enough. Until regulation catches up, companies will continue both exploiting and protecting data in ways that suit their business models
Federated data: There may be increased interest in federated domain-specific data (e.g., science, health), but general web-scale data may become locked down
Conflicting narratives: Companies would want fair use when building, but exclusivity when defending
The Common Pile v0.1
The Common Pile v0.1 represents a major effort to address growing copyright controversies in LLM training by creating an 8TB dataset of public domain and openly licensed text. Developed by researchers including EleutherAI collaborators, it aims to provide a legally compliant alternative to datasets containing unlicensed copyrighted material. It is built from openly licensed or public domain sources.
While this is a commendable effort, 8TB while large, may not match the scale or semantic diversity of copyrighted internet data. So far, no top-tier foundation model has emerged solely from Common Pile–type sources, but academic efforts are on.
Conclusion
The very web data that helped create the modern LLM revolution is now being walled off. Data has gone from a wild-open resource to a core strategic moat. Proprietary access to data is now the battleground - not just models or compute.
But what does this mean for innovation? Smaller players will find it harder to break through.
If everyone builds behind closed doors, who gets left out?
Who gets to decide who gets to shape the future?
And as those whose data powered this revolution are excluded from its economic upside, many may turn to courts to claim their share.