The Data Dilemma: Navigating the Economics and Complexities of AI Training
Exploring the Intersection of Data Privacy, Cost, and Innovation in the Age of LLMs
The recent lawsuit against Meta for using copyrighted data to train its AI models puts a spotlight on how we should think about our data in the age of AI and LLMs.
Meta allegedly downloaded 81.7TB of pirated books from LibGen, a known repository of copyrighted works, to train its Llama AI models. I do not want to focus on the legality and merits of the case, or the "fair use" doctrine that Meta is claiming. However, it got me thinking about the economics of LLMs and implications from a data security perspective.
Training cost economics
To think about the economics of data for training, I asked AI to help me do a cursory costing estimate of 81.7TB of data.
Average cost of a copyrighted academic/professional book - $100 (it ranges from $50-$150, so picking $100 as an average)
Let’s say average size of a digital book - 2MB (ranges from 1MB to 5MB)
1TB would contain approx. 500,000 books
Total number of books in 81.7TB: 81.7 x 500,000 = 40,850,000 books
Cost of the data = 40,850,000 x $100 = approx. $4.08B
Plus any recurring costs such as usage based licensing, outcome based royalties etc. might increase the data costs further
If we assume all foundational models are being trained on a similar corpus of data from the internet, we may not be too far off the mark. In this context, it was interesting to note the cost of data in OpenAI's projected 2024 costs was $500M. The data costs are smaller than compute cost of $3B (train) + $2B (run), and significantly lower than our estimated value of copyrighted data calculated above. OpenAI expected to lose about $5B after operational costs in 2024. Without speculating on specific training practices, it's worth noting that if full market-rate data costs were factored in for any AI company, the break-even point and profitability timeline could shift significantly into the future, than currently projected.
Enter Distillation
Distillation has been a hot topic since DeepSeek claimed to have trained their model on a shoe-string budget of $6M compared to an order of magnitude higher training costs for rival foundational models.
Distillation techniques have significant impacts on LLM training and associated costs.
Distillation can significantly reduce the corpus size and cost of training data needed to train LLMs
It may reduce the need for copyrighted material for the smaller models as distillation transfers knowledge vs exact data
However, the larger original teacher models may still (need to) be trained on copyrighted data.
Implications for enterprises
Enterprises would need to grapple with
Data privacy concerns - inadvertent exposure of sensitive information, PII exposure etc.
Risk of proprietary information being incorporated into public models
AI privacy and security is an ongoing endeavor, and enterprises will embrace new technologies and architectures to solve for these concerns. I see enterprises embracing private models (or a combination of public and private models) as a practical approach.
Among the many important lessons from DeepSeek, the one that's relevant in this context is the fact that companies can now train their private models picking from open source, and by using a combination of larger foundational models (distillation) and fine tuning with their private data at much lower costs.
Recent developments in AI research have shown significant progress in replicating and deploying LLMs at much lower costs and on smaller devices.
Researchers at UC Berkeley, led by PhD candidate Jiayi Pan, claim to have reproduced the core technology of DeepSeek R1-Zero for just $30 - used 3B parameter model, acquired self-verification and search capabilities through reinforcement learning
Several efforts have demonstrated the feasibility of running LLMs on devices as small as a Raspberry Pi
Cloud-hosted models offer economies of scale, while maintaining private infrastructure could be prohibitively expensive for many organizations. However, these developments indicate a shift towards more accessible and private AI deployments, challenging the dominance of large-scale, cloud-based AI services.
Conclusion
While the concerns around copyright, data privacy, and model control are significant, they are unlikely to lead to a wholesale shift to privately hosted models. However, enterprises may use a hybrid approach with cloud hosted public LLMs for general tasks and fine tune private models with sensitive data and create domain-specific expertise. Then synthesize results from both cloud and local models for the task at hand. In addition, cloud providers will likely develop more robust privacy and security features to address enterprise concerns going forward.
The copyrighted data would have a potential impact on authors and the publishing industry going forward. A debate about the legality, “fair-use” doctrine, and ethics of using copyrighted material for AI training will be part of the discourse in the future, with potential future regulations that could arise from these controversies.