Rethinking the Data Foundations of AI

Stéphan Willemse
Jun 13
4 min read

Rethinking the Data Foundations of AI

A recurring claim from major AI companies is that large-scale models can’t be built without unrestricted access to massive amounts of online content—including copyrighted books, news, academic publications, and creative works.

The argument goes something like this: the only way to train competitive large language models (LLMs) is to scrape everything available, sort it later and invoke fair use as legal cover. OpenAI has stated that training GPT models would be “impossible” without copyrighted material. Meta, Google, and others have made similar assertions, either explicitly or through their training practices.

This perspective has largely shaped the current landscape of AI development. And yet, it’s being challenged in increasingly compelling ways.

One of the most significant examples is the recent release of Common Pile v0.1—a large, openly licensed dataset created specifically for training foundation models. It offers a working counterpoint to the idea that ethical constraints are a barrier to capability.

What Is Common Pile v0.1?

Common Pile v0.1 is a curated 8TB corpus of text, compiled from around 30 sources including open-access research, public government data, Creative Commons–licensed books, permissively licensed code, encyclopaedias, technical documentation and task-oriented corpora. Every component of the dataset is either public domain, CC-licensed, or otherwise legally and ethically usable for model training.

The project emphasizes quality, transparency, and permission. Rather than simply filtering LAION-style web scrapes for CC markers, the team behind Common Pile implemented a detailed pipeline:

Language and quality detection
Deduplication
Toxicity filtering
Boilerplate and spam removal
License validation per source

In short, it’s a deliberate attempt to build a high-performing training dataset without cutting legal or ethical corners.

Performance Without Copyright Infringement

To test whether this approach could produce competitive models, the creators trained two LLMs:

Comma v0.1–1T: trained on 1 trillion tokens from the Common Pile
Comma v0.1–2T: trained on 2 trillion tokens from the same corpus

These are 7-billion parameter models, similar in size to Meta’s LLaMA 1 and 2. According to the release benchmarks, Comma models perform on par with or better than LLaMA models across a range of tasks—including general reasoning, code generation, and knowledge-intensive benchmarks.

This is a noteworthy result. It suggests that a carefully curated, fully licensed dataset can yield results that rival or surpass models trained on unlicensed web content.

In practice, this means we may not need to compromise on data ethics in order to reach high levels of model performance—at least not for 7B-scale LLMs.

Why This Matters

Common Pile v0.1 doesn’t just provide a high-quality dataset—it shifts the terms of the debate.

For years, the dominant narrative has been that scraping the internet is a technical necessity. That it’s unfortunate but unavoidable. That there’s no realistic way to build scalable models without crossing ethical and legal boundaries.

Common Pile shows that isn’t true. At minimum, it demonstrates that there are viable alternatives. In doing so, it disrupts the “no alternative” justification that many companies have relied on, particularly when faced with criticism from authors, publishers, artists and journalists whose work was ingested without consent.

Where This Leaves the Broader Ecosystem

For Regulators:

Common Pile sets a precedent. If performant models can be trained without infringing IP, then courts and lawmakers may be less sympathetic to the argument that fair use is required for competitiveness. It provides a working example of due diligence that becomes ever more important, especially when major institutions such as the UK Parliament is debating AI regulation.

For Creators:

The existence of opt-in datasets like this supports calls for licensing frameworks and compensation models. It also increases the leverage of those who wish to control how their work is used.

For Smaller Labs:

Access to a high-quality, open dataset reduces dependency on opaque web-scraped corpora. It opens the door for more responsible, transparent model development—especially in academia, public interest research and open-source communities.

For Industry:

The release adds pressure to disclose training data sources, especially for models claiming to be "open." It also invites a broader conversation about how we define responsible innovation, beyond performance metrics.

Limits and Considerations

Common Pile v0.1 is not a complete solution. It’s a v0.1 release with limited domain diversity compared to fully scraped corpora. It also doesn’t solve the challenges facing generative models in other modalities—image, video, audio—where licensed data is even harder to come by.

Additionally, the models trained on Common Pile have yet to be tested at the largest parameter scales (e.g. 30B, 65B, 100B+). It’s still an open question whether this approach will scale without hitting new limitations in expressiveness or generalisation.

But these are constraints of implementation, not principle.

The core idea holds: ethical data pipelines are not inherently incompatible with high performance.

Broader Implications

Open mining in Zambia. Source: Ecohubmap

This release is part of a slow but significant shift in how AI research treats the concept of data governance. Instead of viewing the web as raw material to be mined, Common Pile treats it as a commons to be curated and respected. This reframing aligns more closely with how we think about sustainability, consent and collective infrastructure in other domains—like climate, urban planning or public health.

It also introduces the possibility of AI ecosystems grounded in accountability, rather than opacity. Where people can trace how models were trained, where data came from, and how those decisions shape the outputs and risks.

That kind of ecosystem isn’t just legally safer. It’s also more inclusive. It invites collaboration from people and communities who have, until now, had little control over how their contributions to the internet are being repurposed.

Closing Thought

Common Pile v0.1 isn’t perfect, but it’s timely. It offers a tangible, well-executed example of what’s possible when we align capability with ethics, openness with rigour. In a moment where the stakes of AI development are only growing, that feels like a direction worth paying attention to.