Will the AI Gold Rush Last? Market Outlook for Investors

Artificial intelligence systems such as ChatGPT may soon run into a critical shortage of the very material that has driven their rapid improvement: the enormous volume of human-written text published online. A recent study by the research group Epoch AI warns that technology companies could exhaust the supply of publicly available text data used to train large language models sometime between 2026 and 2032.

The report compares the surge of data collection to a literal gold rush that mines finite resources. As the easily accessible reserves of human-generated writing shrink, AI developers face a growing challenge to sustain the same pace of progress without tapping into new or restricted sources of content.

AI companies race to secure high-quality training data

In the near term, major AI firms are hustling to obtain and sometimes pay for dependable sources of text. Companies developing large language models have made deals to access steady streams of user-generated content and journalism in order to keep feeding their training pipelines. Those negotiations reflect a broader scramble to find high-quality, up-to-date material that models can learn from.

Longer-term concerns center on the possibility that new blogs, news articles and social media posts will no longer be produced at a rate sufficient to support continued, straightforward scaling of model size and performance. That pressure could push organizations toward training on sensitive or private materials—such as email or text message archives—or leaning more heavily on automatically generated “synthetic” data produced by AI systems themselves.

“There is a serious bottleneck here,” says Tamay Besiroglu, one of the study’s authors. “If you start hitting constraints on how much data you have, you can’t scale up models efficiently anymore. And scaling up has been one of the most important ways to expand capabilities and improve output quality.”

Epoch AI first published projections along these lines two years ago in a working paper that anticipated an earlier data cutoff near 2026. Since then, researchers and engineers have developed techniques to reuse existing datasets more effectively, including methods that at times involve overtraining on the same sources. Those advances have pushed back the timeline, but not removed the underlying limits.

When will public training data run out?

After further analysis, the Epoch team now estimates that public text data suitable for training large language models will be depleted within the next two to eight years. Their updated study has undergone peer review and is scheduled to be presented at an international machine learning conference this summer.

Epoch is a nonprofit research institute hosted by Rethink Priorities and supported by donors interested in mitigating the most serious risks of advanced AI. The group’s projections trace back to a key insight: for more than a decade, AI researchers have relied on rapidly expanding two main inputs—compute power and massive collections of internet text—to boost system capabilities.

Epoch’s analysis finds that the volume of text data available for training has historically grown at roughly 2.5 times per year, while available computing resources have grown faster, at around four times per year. As a result, models have been trained on ever-larger token counts; for example, recent model families have been reported to train on trillions of tokens of text.

Are ever-larger models necessary?

How urgently to worry about the data bottleneck remains a matter of debate. Some researchers emphasize that continual progress need not rely only on training ever-larger, general-purpose models. Specialization—designing models targeted at particular tasks—can improve performance without consuming enormous new volumes of broad, general training data.

However, several experts also caution against extensive use of AI-generated content to fill the gap. Training models on their own outputs risks a phenomenon sometimes described as “model collapse,” in which quality degrades as errors and biases are amplified across generations of synthetic text. One researcher likened it to photocopying a photocopy: each pass loses information and fidelity.

Those negative effects include the reinforcement of existing mistakes, prejudices and distortions that are already present in the information ecosystem. Consequently, many viewed as custodians of valuable human-created data—sites like Wikipedia, major forum platforms, and publishers—are rethinking how their content is accessed and used.

Wikimedia’s chief product and technology officer has noted the oddity of having “natural resource” conversations about human-created data. While Wikipedia historically placed few restrictions on reuse of its volunteer-written articles, the organization expresses concern about preserving incentives for contributors and about a rising tide of low-quality, automatically generated material that could devalue the web as a training resource.

From the AI developer perspective, Epoch’s study suggests that contracting millions of people to produce the raw text AI needs would be prohibitively expensive and unlikely to scale as a technical solution. That has led some companies to experiment with generating synthetic datasets for subsequent training rounds. But even industry leaders acknowledge limits to that approach and remain cautious about relying solely on synthetic text to advance model capabilities.

Ultimately, the study highlights a tension at the heart of modern AI development: models benefit from vast amounts of human-written data, yet that supply is not infinite. As public data becomes scarcer and the web sees more automatically produced content, developers, platforms and the public will need to reconcile questions of access, compensation, data quality and the long-term sustainability of training resources.

AI companies race to secure high-quality training data

When will public training data run out?

Are ever-larger models necessary?

Read more about artificial intelligence: