What Large Language Models Are Actually Trained On: A Comprehensive Audit of LLM Training Data

TL;DR · Key Takeaways

Only a small set of frontier models publish full source-level training data breakdowns. GPT-4, all Claude models, all Gemini models, and all Mistral models do not.
Common Crawl or filtered derivatives constitute 50–80% of pre-training tokens across documented LLM corpora.
The 2025 Bartz v. Anthropic ruling established that training on lawfully purchased books is fair use; pirated acquisition is not. A proposed $1.5 billion settlement has been reported as the largest in a U.S. copyright case — confirm final court approval before citing as settled.
Stanford's Foundation Model Transparency Index shows the industry average fell from 58/100 (May 2024) to 40.69/100, rounded to 41 (2025) — transparency is regressing as model capability advances.
Synthetic data is now a dominant share of post-training (Llama 3) and a significant share of pre-training (Phi-4: 40%); model-collapse risk grows correspondingly (Shumailov et al., Nature, 2024).

Evidence note Training data disclosures for GPT-4, Claude, Gemini, and Mistral models are largely unavailable; figures attributed to these models are drawn from leaked analyses, court filings, or academic estimates. Legal case status (Bartz v. Anthropic settlement, NYT v. OpenAI) should be verified before citing as concluded. Stanford FMTI scores are sourced from the 2025 report. EU AI Act dates: GPAI obligations entered application 2 August 2025; fines enter application 2 August 2026. Last reviewed: May 2026.

The Transparency Asymmetry: More Capable Models, Less Disclosure

The training data behind frontier large language models ranges from 300 billion to 40 trillion tokens. Most developers refuse to disclose what those tokens contain. Only a handful of model families — notably Meta's Llama, EleutherAI's open datasets, and Microsoft's Phi series — provide detailed source-level breakdowns. OpenAI, Anthropic, Google, and Mistral treat data composition as proprietary, a position that has hardened as litigation risk has grown.

Across the industry, Common Crawl web data dominates pre-training corpora at 50–80% of tokens, with code, books, scientific papers, and Wikipedia making up the remainder in varying proportions. Post-training increasingly relies on synthetic data generated by prior-generation models — a recursive pattern with unsettled implications for data provenance and quality degradation over time.

The most striking structural finding from this audit is the inverse relationship between model capability and data transparency. GPT-3 published a complete source breakdown; GPT-4 published nothing. LLaMA 1 named every dataset; Llama 4 incorporated Meta platform data without specifying proportions.

The legal landscape is shifting in parallel. A proposed $1.5 billion settlement in Bartz v. Anthropic (2025) — reported as the largest in a U.S. copyright case — ongoing litigation in NYT v. OpenAI, and the EU AI Act's GPAI training data transparency requirements — obligations entered application 2 August 2025, enforcement powers (fines) entering application 2 August 2026 — are forcing an industry reckoning over data sourcing that has been opaque by design.

Key Definitions

The following terms are used with specific technical meanings throughout this article:

Pre-training corpus: The large-scale dataset used to train a model's base statistical representations of language, prior to any instruction tuning or alignment work.
Token: The unit of text processed by an LLM. A token is roughly 0.75 words in English; 1 trillion tokens represents approximately 750 billion words.
Post-training: The set of procedures applied after pre-training — including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimisation (DPO) — that shape model behaviour.
Synthetic data: Text generated by an AI model rather than written by humans, increasingly used in both pre-training and post-training pipelines.
Upsampling: Repeating higher-quality data sources more than once within a training epoch to increase their influence relative to their raw token count.
Stanford FMTI: The Stanford Foundation Model Transparency Index — the primary systematic benchmark for evaluating AI developer disclosure practices across training, deployment, and governance dimensions.

Pre-Training Data: The Raw Ingredients at Scale

The core of every LLM is its pre-training corpus — the vast collection of text, and increasingly images, audio, and video, that teaches the model statistical patterns of language. Token counts have grown by two orders of magnitude in five years, from GPT-3's 300 billion tokens (2020) to Llama 4 Scout's approximately 40 trillion tokens (2025). This section documents what is publicly known about each major model family's data composition.

40T

Approximate token count for Llama 4 Scout's pre-training corpus — a 133× increase over GPT-3's 300 billion tokens in just five years.Meta, April 2025

GPT-3: The Last OpenAI Model With a Published Training-Mixture Breakdown

GPT-3 remains the only major OpenAI model with a complete, published data breakdown. As documented in Brown et al. (2020), the model trained on 300 billion tokens drawn from a 499-billion-token corpus across five sources, with higher-quality sources upsampled to compensate for their smaller raw volume:

Common Crawl (filtered): 410B raw tokens, 60% training weight, 0.44 epochs
WebText2: 19B raw tokens, 22% training weight, ~3.4 epochs
Books1: 12B raw tokens, 8% training weight, ~2.0 epochs
Books2: 55B raw tokens, 8% training weight, ~0.44 epochs
Wikipedia (English): 3B raw tokens, 3% training weight, ~3.0 epochs

The upsampling strategy is instructive: Wikipedia was seen approximately three times and WebText2 approximately 3.4 times, while the larger but noisier Common Crawl data was sampled less than once. Filtering used a logistic regression classifier trained to distinguish high-quality text from raw crawl data. Fuzzy deduplication was performed at document level using MinHash LSH. The corpus was 93% English.

GPT-4 and GPT-4o: Deliberate Opacity

OpenAI's GPT-4 technical report (March 2023) explicitly stated it would contain "no further details about the architecture, hardware, training compute, dataset construction, training method, or similar." This marked a decisive shift toward deliberate non-disclosure. The report confirms only that GPT-4 used "publicly available data and data licensed from third-party providers."

Details from a widely cited SemiAnalysis report (July 2023) — unconfirmed by OpenAI but considered credible by industry analysts — suggest approximately 13 trillion total training tokens (5–6 trillion unique), a Mixture-of-Experts architecture with 1.8 trillion parameters, and code data trained for 4 epochs versus 2 for text. GPT-4o (May 2024) disclosed even less: it is described as an end-to-end multimodal model trained jointly across text, vision, and audio using "public web data, proprietary data from partnerships, and multimodal data." No token counts or source breakdowns exist for either model.

Meta's Llama Family: The Industry Transparency Benchmark

Meta has published the most detailed training data documentation among frontier model developers, making the Llama family the industry's primary reference point for understanding data composition.

LLaMA 1 (February 2023) trained on 1.0–1.4 trillion tokens exclusively from publicly available datasets, with a fully documented breakdown: Common Crawl 67%, C4 15%, GitHub 4.5%, Wikipedia 4.5%, Books (Gutenberg and Books3) 4.5%, ArXiv 2.5%, and StackExchange 2%. Wikipedia and Books were upsampled to approximately 2.2–2.5 epochs.

Llama 2 (July 2023) increased to 2 trillion tokens but omitted source-level breakdowns — a partial retreat from LLaMA 1's transparency. The paper confirmed the corpus was 89.7% English, with factual sources upsampled to reduce hallucinations.

Llama 3/3.1 (2024) scaled to 15.6 trillion tokens with the most sophisticated data curation pipeline yet publicly documented:

General web data: ~50%
Mathematical and reasoning content: ~25%
Code: ~17%
Multilingual content (30+ languages): ~8%

The pipeline introduced three levels of deduplication (URL-level, document-level MinHash, and line-level), model-based quality filtering using a DistilRoberta classifier distilled from Llama 2's quality judgments, and scaling-law experiments on small proxy models to optimise the data mix before full training. Code data increased 4× over Llama 2. Post-training used six iterative rounds of SFT, rejection sampling, and DPO, with approximately 2.7 million synthetic code dialogues and extensive synthetic math and reasoning data.

Llama 4 (April 2025) marked two firsts: Meta's first Mixture-of-Experts model and the first Llama model to incorporate Meta user data (publicly shared posts from Instagram and Facebook, plus interactions with Meta AI). Scout trained on approximately 40 trillion tokens and Maverick on approximately 22 trillion tokens, both natively multimodal and supporting 200 languages — a 10× expansion in multilingual coverage over Llama 3.

Google's Gemini Lineage: Scale Without Specifics

Google has never disclosed token counts or source proportions for any Gemini model. The clearest window into Google's data philosophy comes from predecessor model PaLM (2022): 780 billion tokens comprising social media conversations (~50%), filtered webpages (~27%), books (~13%), code (~5%), Wikipedia (~4%), and news (~1%). PaLM 2 reportedly scaled to 3.6 trillion tokens (per leaked CNBC documents), with a higher proportion of non-English and code data.

For Gemini 1.0 (December 2023), the technical report confirms training on "web documents, books, and code" plus image, audio, and video data, with YouTube transcripts specifically confirmed by The Information. Data mix was determined through ablations on smaller proxy models, with composition altered during training via staged approaches. Gemini 1.5 and 2.0/2.5 progressively added more languages (reaching 400+ for Gemini 2.5) and more code data, but no quantitative breakdowns exist. The open-weight Gemma 3 27B model, sharing Gemini's data infrastructure, trained on 14 trillion tokens — suggesting Gemini 2.x models likely exceed this substantially.

DeepSeek: Technical Detail Without Source Attribution

DeepSeek's technical reports provide token counts and high-level composition without naming specific datasets. DeepSeek-V2 trained on 8.1 trillion tokens, with Chinese tokens approximately 12% more numerous than English. DeepSeek-V3 scaled to 14.8 trillion tokens — described as "plain web pages and e-books, without incorporating any synthetic data" in pre-training — with enhanced ratios of mathematical and programming content and expanded multilingual coverage. The Fill-in-Middle strategy was applied to code at a rate of 0.1.

$5.6M

DeepSeek-V3's reported training cost. The technical report cites 2.664 million H800 GPU hours for pre-training on 14.8T tokens; the broader figure of 2.788M includes additional training phases. Orders of magnitude below Western frontier model estimates.DeepSeek-V3 Technical Report, 2024

DeepSeek-R1 used the same V3 base but applied a novel pure reinforcement learning approach (R1-Zero) before a four-stage post-training pipeline producing approximately 800,000 SFT samples (600K reasoning, 200K non-reasoning).

Mistral and Anthropic: The Opacity Spectrum

Mistral AI has explicitly stated: "We do not disclose the datasets used to train our models." No token counts, source breakdowns, or filtering details exist for Mistral 7B, Mixtral 8x7B, or Mistral Large. The Stanford Foundation Model Transparency Index flagged Mistral's disclosure level as potentially below EU AI Act requirements — GPAI obligations entered application 2 August 2025; enforcement powers (fines) enter application 2 August 2026.

Anthropic is similarly opaque about Claude's pre-training data. The Claude 3 model card describes "a proprietary mix of publicly available information on the Internet, non-public data from third parties, data provided by data labeling services, and data we generate internally." No token counts, proportions, or named datasets are provided. Court filings in Bartz v. Anthropic revealed that Anthropic downloaded more than 7 million pirated books from Library Genesis and the Pirate Library Mirror. A separate program called "Project Panama" (unsealed January 2026) involved purchasing and destructively scanning 500,000 to 2 million physical used books. The Stanford FMTI awarded Anthropic a score of 0 on virtually all data disclosure indicators.

Comparative Training Data Overview: All Major Model Families

The table below synthesises the best available quantitative data on pre-training corpora across all major model families. Where figures are unconfirmed, the source is noted.

GPT-3 (2020): 300B tokens (from 499B pool) — Full breakdown published — CC 60%, WebText2 22%, Books 16%, Wiki 3%
GPT-4 (2023): ~13T tokens (leaked, unconfirmed via SemiAnalysis) — No breakdown — "Public + licensed data"
GPT-4o (2024): Undisclosed — No breakdown — "Public web, partnerships, multimodal"
Claude 3 (2024): Undisclosed — No breakdown — "Proprietary mix" of web, third-party, internal
Claude 4 (2025): Undisclosed — No breakdown — Same plus opt-in user data
LLaMA 1 (2023): 1.4T tokens — Full breakdown published — CC 67%, C4 15%, GitHub 4.5%, Wiki 4.5%, Books 4.5%, ArXiv 2.5%, SE 2%
Llama 2 (2023): 2T tokens — Partial (English 89.7%) — "Publicly available" — no named sources
Llama 3/3.1 (2024): 15.6T tokens — Category-level breakdown — Web ~50%, math/reasoning ~25%, code ~17%, multilingual ~8%
Llama 4 Scout (2025): ~40T tokens — Partial — Public + licensed + Meta platform data
Gemini 1.0 (2023): Undisclosed — No breakdown — Web, books, code, images, audio, video
Gemini 2.5 (2025): Undisclosed — No breakdown — Same, 400+ languages
Mistral 7B (2023): Undisclosed — No breakdown — "Open Web" — nothing further disclosed
DeepSeek-V2 (2024): 8.1T tokens — Partial (language split) — Chinese ~56%, English ~44%
DeepSeek-V3 (2024): 14.8T tokens — Partial — "Plain web pages and e-books" only
DeepSeek-R1 (2025): 14.8T tokens (same base) — Partial — Same base + ~800K post-training samples
Falcon 180B (2023): 3.5T tokens — Breakdown published — RefinedWeb (CC): ~80%, curated sources ~20%
Qwen 3 (2025): 36T tokens — Partial — Web, code, STEM, books; 119 languages
Phi-4 (2024): ~3.3T effective tokens — Category-level breakdown — Synthetic 40%, web rewrites 15%, filtered web 15%, code 20%, academic 10%
Yi-34B (2024): 3.1T tokens — Partial — Common Crawl, bilingual EN/ZH
Grok-1 (2024): Undisclosed — No breakdown — "Public internet repositories"

Post-Training Pipelines: Convergence on Synthetic Data

The industry's post-training approaches have converged significantly since InstructGPT introduced the SFT → Reward Model → PPO pipeline in 2022. The defining evolution since then is the explosive growth of synthetic data and the declining role of human annotation.

From 13,000 Human Labels to Zero: The Trajectory

InstructGPT (2022) used approximately 13,000 human-written SFT demonstrations, ~33,000 reward-model prompts with human rankings, and ~31,000 PPO prompts — all produced by roughly 40 contractors. This small dataset unlocked GPT-3's instruction-following capabilities, establishing that post-training is far more about data quality than raw quantity.

Llama 2 (2023) scaled human annotation substantially: 27,540 human-written SFT examples and over 1 million human preference comparisons collected across five iterative RLHF rounds with separate helpfulness and safety reward models.

By Llama 3 (2024), Meta had pivoted away from human annotation entirely. According to Meta's post-training lead, the pipeline "doesn't have any human-written answers basically, almost — it's just leveraging pure synthetic data from Llama 2." The pipeline generated approximately 2.7 million synthetic code dialogues alone, plus extensive synthetic math, reasoning, and multilingual data.

DeepSeek-R1 (2025) demonstrated the most radical approach to date: R1-Zero applied zero human demonstration data, using pure reinforcement learning (GRPO) directly on the pre-trained base model with only binary accuracy rewards. The model spontaneously developed self-verification, reflection, and extended chain-of-thought reasoning. The full R1 pipeline then used approximately 800,000 SFT samples, most generated through rejection sampling from the RL-trained model itself.

Constitutional AI and Synthetic Preference Labels

Anthropic's Constitutional AI (CAI) replaced human harmlessness labeling with AI self-critique against a set of constitutional principles, generating approximately 182,000 AI preference labels for harmlessness, mixed with approximately 135,000 human labels for helpfulness. The January 2026 version of Anthropic's model constitution expanded to 23,000 words — a document written for the model itself, explaining not just behavioural rules but the reasoning behind them.

Microsoft Phi: Synthetic Data in Pre-Training

Microsoft's Phi series pushed synthetic data furthest into pre-training. Phi-1 used approximately 1 billion tokens of GPT-3.5-generated "textbook" content. Phi-4's training mixture was 40% synthetic data plus 15% synthetic web rewrites, demonstrating that small models trained on carefully constructed synthetic data can outperform much larger models trained on raw web crawls. Phi-4 surpassed its teacher model GPT-4 on STEM benchmarks — evidence that synthetic data generation has moved beyond simple knowledge distillation.

<$0.01

Cost per label for AI feedback (RLAIF), versus $1–$10+ for human preference data — the economic driver behind synthetic data's rapid displacement of human annotation.Industry estimates, 2024

The Humans Behind the Data Labels

Behind every aligned LLM are thousands of human annotators whose work is essential yet largely invisible in model documentation. The data labeling industry supporting AI training is valued at $3.8 billion annually (2024), with projections exceeding $17 billion within five years.

Labor Conditions and the Bifurcated Pay Structure

Scale AI (valued at approximately $14 billion) has served as OpenAI's primary data partner, operating through subsidiaries Remotasks and Outlier. Scale AI also serves Google, Microsoft, Meta, and Nvidia. Remotasks workers in Kenya and the Philippines have been documented earning as little as $0.01 per task for work requiring hours of effort. In March 2024, Remotasks abruptly shut down its Kenya operations, leaving thousands of workers without recourse.

The most extensively documented case involves Sama (formerly Samasource) and OpenAI. A January 2023 TIME investigation revealed that Kenyan workers labeling toxic content for ChatGPT's safety filter earned $1.32–$1.44 per hour after tax, while processing 150–250 passages of graphic content per nine-hour shift — including descriptions of child sexual abuse, bestiality, and murder. OpenAI paid Sama $12.50 per hour per worker; workers received approximately one-ninth of that amount. Multiple workers reported lasting psychological harm, with documented cases of PTSD, insomnia, and substance dependency.

Anthropic requires subcontractors to pay data labelers a minimum of $16 per hour — the only major AI company with a documented wage floor. For expert-level annotation in medicine, law, and advanced coding, rates have increased substantially: Scale AI's Outlier subsidiary advertises $30–$50 per hour for domain experts, and PhD-level specialists can earn $250–$1,000 per hour for specialised annotation tasks.

The bifurcation of the AI labeling industry — subsistence wages for content moderation in the Global South, premium rates for expert knowledge in the Global North — is not an accident of market dynamics. It is a structural feature of how frontier AI systems are built.

Human Annotation Data Volumes: A Declining Trend

InstructGPT SFT (2022): 13,000 prompts — 40 contractors via Upwork and Scale AI
InstructGPT reward model (2022): 33,000 prompts — same team
Anthropic HH-RLHF — human (2022): ~135,000 helpfulness examples — crowdworkers
Anthropic CAI — synthetic (2022): ~182,000 harmlessness examples — AI-generated
Llama 2 preference data (2023): >1 million binary comparisons — internal annotators
DeepSeek-R1 SFT (2025): ~800,000 samples — 600K rejection-sampled from RL model, 200K reused

Legal Battles Rewriting the Rules of Data Sourcing

The legal framework for AI training data is being defined by a series of cases that will determine data sourcing norms for years. Three cases are currently most consequential.

Bartz v. Anthropic: Fair Use and the Piracy Distinction

In June 2025, Judge William Alsup ruled that AI training on lawfully acquired copyrighted works is "spectacularly transformative" and qualifies as fair use — the first major fair use ruling favouring an AI company in the United States. However, Alsup simultaneously held that Anthropic's downloading of more than 7 million pirated books was "inherently, irredeemably infringing." A proposed $1.5 billion settlement — reported as the largest ever in a U.S. copyright case, covering approximately 482,460 books at roughly $3,000 per work — established a critical legal distinction: how training data is acquired matters as much as how it is used. (Reuters, April 2026; update if final court approval has since been confirmed.)

$1.5B

Proposed settlement in Bartz v. Anthropic (June 2025) — reported as the largest ever in a U.S. copyright case, covering ~482,460 books at ~$3,000 per work. Pending final court approval as of April 2026.U.S. District Court, N.D. Cal., 2025

NYT v. OpenAI: Ongoing Discovery

Filed in December 2023, NYT v. OpenAI remains the industry's highest-profile active case. As of early 2026, it is in expert discovery with summary judgment briefing concluding April 2026. Key developments include a court order compelling OpenAI to produce 20 million anonymised ChatGPT conversation logs, and a ruling that ChatGPT-generated summaries could constitute "substantially similar" reproductions of copyrighted articles.

Kadrey v. Meta and the Books3 Dataset

The Kadrey v. Meta case (the Sarah Silverman lawsuit) resulted in a narrow summary judgment for Meta on fair use in June 2025, but the judge emphasised the ruling was fact-specific and that "plaintiffs will often win" with stronger evidence of market harm. A separate claim regarding Meta's distribution of pirated books via BitTorrent remains unresolved.

The Books3 dataset — 196,000 books scraped from the pirate site Bibliotik, compiled as part of EleutherAI's The Pile — has become the focal point of multiple lawsuits. Meta admitted using Books3 for LLaMA. Nvidia internal emails revealed in January 2026 showed employees contacting pirate sites seeking "high-speed access" to training material. Books3 was removed from Hugging Face in October 2023.

The Emerging Legal Framework

Three principles are taking shape across these cases:

Training on legally acquired copyrighted works likely qualifies as fair use under U.S. law, per Alsup's ruling in Bartz.
Acquiring training data through piracy creates catastrophic liability regardless of how it is subsequently used — as the proposed Anthropic $1.5 billion settlement illustrates.
The growing licensing market — OpenAI has deals worth $5–70 million annually with Axel Springer, News Corp, Condé Nast, and the Associated Press — increasingly undermines fair use arguments by demonstrating that a commercial market for this content exists.

Opt-out mechanisms remain largely ineffective. A 2025 Duke University study found many AI crawlers do not check robots.txt files. Google's VP testified that the Google-Extended opt-out is not honoured when Google uses Gemini to power AI Overviews — meaning publishers cannot block AI training use without blocking Google Search indexing entirely. The EU's TDM Reservation Protocol (TDMRep) is emerging as a more robust standard.

What Academic Research Reveals About Training Data

A growing body of independent research has developed tools for understanding training data composition without relying on company disclosures.

Foundational Open Datasets

The Pile (EleutherAI, 2020) was the first thoroughly documented multi-source LLM dataset: 825 GiB across 22 sub-datasets including Pile-CC (18.1%), PubMed Central (14.4%), Books3 (12.1%), OpenWebText2 (10.0%), ArXiv (9.0%), GitHub (7.6%), and 16 smaller components. Its explicit upsampling strategy — Wikipedia to 3 epochs, higher-quality sources to 2–2.5 epochs — established patterns adopted by subsequent datasets industry-wide.

RefinedWeb (Penedo et al., NeurIPS 2023) challenged prevailing assumptions by demonstrating that properly filtered web-only data outperforms curated multi-source corpora at 5 trillion tokens from Common Crawl. This finding, validated by Falcon models' benchmark performance, influenced data strategies across the industry.

FineWeb (HuggingFace, 2024) scaled to 15 trillion tokens from 96 Common Crawl snapshots. Its educational subset, FineWeb-Edu — filtered using a Llama-3-70B-Instruct-trained classifier — achieves comparable performance to the full dataset with 10× fewer tokens, empirically demonstrating that data quality filtering can substitute for scale.

RedPajama-V2 (Together AI) provides the largest openly annotated web corpus at 30 trillion raw tokens with 40+ pre-computed quality signals, designed as a filterable pool rather than a fixed training dataset.

Training Data Extraction and Memorisation

Carlini et al.'s work on training data extraction has demonstrated that LLMs memorise and can reproduce verbatim training data at scale. Their 2021 paper extracted hundreds of verbatim sequences from GPT-2, including personal information. A 2023 follow-up developed a "divergence attack" that caused ChatGPT to emit memorised training data at 150× the normal rate, extracting gigabytes of text — demonstrating that RLHF alignment does not prevent memorisation. Larger models memorise proportionally more; deduplication reduces memorisation by approximately 10× (Lee, Ippolito, Carlini et al., ACL 2022).

The Epoch Constraint: How Many Times Can Data Be Reused?

Muennighoff et al. (NeurIPS 2023) established that training with up to four epochs of repeated data yields negligible performance degradation, while returns approach zero beyond approximately 16 epochs. This finding has become a key constraint on data-limited scaling, explaining the industry-wide push toward ever-larger and more diverse corpora rather than intensive reuse of existing datasets.

The Transparency Crisis: By the Numbers

The Stanford Foundation Model Transparency Index (FMTI) provides the most systematic cross-industry assessment of disclosure practices. Scores have evolved from an average of 37/100 (October 2023) to a peak of 58/100 (May 2024), before declining to 40.69/100 (rounded to 41) in 2025 as multiple companies reduced voluntary disclosures — largely in response to escalating litigation.

Data access transparency rate in the Stanford FMTI by 2024, down from 20% in 2023 — a decline driven by litigation risk, not capability constraints.Stanford FMTI, 2024

IBM's Granite 3.3 achieved the highest-ever FMTI score of 95/100 in 2025. Meta dropped 29 points and Mistral dropped 37 points year-over-year. The Data Provenance Initiative found that 72% or more of popular datasets have missing or erroneous licence information, and over 80% of source content in widely-used datasets is restrictively licensed when tracing full derivation chains.

Voluntary disclosure peaked in 2024 and has since declined — a pattern that is structurally rational for companies facing litigation, even as it is epistemically harmful for every stakeholder trying to evaluate these systems.

Three Structural Forces Reshaping LLM Training Data

1. Synthetic Data Is Becoming the Dominant Post-Training Signal

DeepSeek-R1-Zero demonstrated that pure reinforcement learning with no human demonstrations can produce frontier reasoning capabilities. Phi-4 demonstrated that a 40%-synthetic pre-training mixture can outperform its teacher model on STEM benchmarks. The recursive nature of this trend — models training on outputs from previous models — raises unsettled questions about quality degradation and provenance accountability over successive generations.

2. The Legal Framework Is Crystallising Around an Acquisition Distinction

Courts have signalled that transformative training on legally obtained data likely qualifies as fair use, but piracy-to-training pipelines face catastrophic liability — as the proposed Anthropic settlement illustrates. This distinction will accelerate the growing licensing market while potentially concentrating the frontier AI industry among companies with sufficient capital to pay for data access at scale.

3. Regulatory Mandates Are Replacing Voluntary Transparency

The EU AI Act's GPAI training data summary requirements entered application on 2 August 2025; the Commission's enforcement powers — fines of up to €15 million or 3% of global revenue — enter application on 2 August 2026 (older models have until 2 August 2027 to comply). This will compel disclosure that companies have resisted for years. Whether the resulting summaries prove substantive or become compliance artefacts will determine whether the public can meaningfully evaluate the data foundations of systems that increasingly mediate access to information, education, and economic opportunity.

Last reviewed: May 2026 by Christopher Foster-McBride, Digital Human Assistants. Legal case status and transparency scores change frequently — check the blog for updates.

Common questions

Frequently Asked Questions

Which large language models have published full training data breakdowns?

As of 2025, only a small number of frontier models have published complete source-level breakdowns. OpenAI's GPT-3 (2020) remains the last major OpenAI model with a full published breakdown. Meta's LLaMA 1 (2023) named every dataset with precise percentages. Microsoft's Phi-4 (2024) published category-level composition including synthetic data proportions. Falcon 180B and EleutherAI's Pile-based models also provide detailed documentation. GPT-4, all Claude models, all Gemini models, and all Mistral models have published no source-level breakdowns.

What is Common Crawl and why does it dominate LLM training data?

Common Crawl is a non-profit organisation that maintains a freely available archive of web content, updated monthly, currently exceeding 250 billion pages. It dominates LLM pre-training because it is the largest freely accessible text corpus in existence, providing the token scale required for frontier model training. Across documented LLM corpora, Common Crawl or filtered derivatives typically constitute 50–80% of pre-training tokens. Its primary limitation is quality: raw crawl data contains spam, boilerplate, and low-quality text, requiring extensive filtering. RefinedWeb (NeurIPS 2023) and FineWeb (HuggingFace, 2024) are the leading filtered Common Crawl derivatives used by researchers.

What did the Bartz v. Anthropic ruling establish for AI training data law?

The June 2025 ruling by Judge William Alsup in Bartz v. Anthropic established two distinct legal principles. First, that AI training on lawfully acquired copyrighted works is 'spectacularly transformative' and qualifies as fair use under U.S. copyright law — the first major fair use ruling favouring an AI developer. Second, and simultaneously, that acquiring training data through piracy (Anthropic downloaded more than 7 million books from Library Genesis and the Pirate Library Mirror) is 'inherently, irredeemably infringing' regardless of how the data is subsequently used. A proposed $1.5 billion settlement — reported as the largest ever in a U.S. copyright case — confirmed that acquisition method and training use are separate legal questions with separate liability. (Pending final court approval as of April 2026.)

What is the Stanford Foundation Model Transparency Index and what does it measure?

The Stanford Foundation Model Transparency Index (FMTI) is a systematic benchmark produced by the Stanford Center for Research on Foundation Models, evaluating major AI developers across more than 100 indicators spanning training data disclosure, model architecture, deployment practices, and governance. On training data specifically, indicators include whether token counts, data sources, data filtering methods, and licence information are disclosed. The industry average peaked at 58/100 in May 2024 before declining to 40.69/100 (rounded to 41) by 2025. IBM's Granite 3.3 achieved the highest-ever score of 95/100 in 2025. Anthropic scored 0 on virtually all data disclosure indicators. The FMTI is widely cited in AI policy discussions and EU AI Act compliance assessments.

How much are human data labelers paid, and is that changing?

Pay varies dramatically by task type and geography. Content moderation and basic labeling — predominantly performed by workers in Kenya, the Philippines, and other Global South countries — has been documented at $1.32–$1.44 per hour (TIME investigation, January 2023, for Sama/OpenAI workers) and as low as $0.01 per task for some Remotasks assignments. Expert annotation in medicine, law, and advanced coding commands $30–$50 per hour via Scale AI's Outlier platform, with PhD-level specialists earning $250–$1,000 per hour. Anthropic is the only major AI company with a documented minimum wage floor for subcontractors ($16/hour). The broader trend is declining demand for human annotation as synthetic data and AI feedback (RLAIF) — costing below $0.01 per label — displace bulk human labeling.

What is synthetic training data and what are the risks of models training on it recursively?

Synthetic training data is text (or other content) generated by an AI model, used as input to train another model — or the next version of the same model. Its use has grown from isolated post-training applications (Anthropic's Constitutional AI, 2022) to dominant proportions of post-training pipelines (Llama 3's ~2.7 million synthetic code dialogues) and significant shares of pre-training (Phi-4's 40% synthetic pre-training mixture). The recursive risk — often called 'model collapse' in the research literature — is that errors, biases, or distributional gaps in one generation's output can be amplified in the next. Shumailov et al. (Nature, 2024) demonstrated that iterative training on model-generated data leads to progressive quality degradation, particularly affecting low-frequency content. Practical mitigation strategies include mixing synthetic data with verified human-authored or real-world data at known proportions, a discipline that becomes harder to enforce as the provenance of web-crawled data itself becomes uncertain.

What does the EU AI Act require regarding LLM training data disclosure?

The EU AI Act's GPAI (General Purpose AI) model obligations entered application on 2 August 2025. The Commission's enforcement powers — fines of up to €15 million or 3% of global annual revenue, whichever is higher — enter application on 2 August 2026. Older models have until 2 August 2027 to comply. GPAI providers must publish a sufficiently detailed summary of training data used, including the types of data, sources, and data collection procedures. For models with systemic risk (those exceeding 10^25 FLOPs of training compute), requirements extend to adversarial testing results and cybersecurity measures. The Stanford FMTI flagged Mistral's current disclosure level as potentially non-compliant. The EU TDM Reservation Protocol (TDMRep) provides a parallel mechanism for content publishers to signal opt-out from AI training use.

About the Author

Christopher Foster-McBride is the Founder of AI Knowledge Signal and Digital Human Assistants. He works with organisations on structuring their knowledge so AI systems can accurately select, cite, and represent them in generated answers. He is the author of the AI Knowledge Signal Framework — a 6-phase methodology for AI visibility — and writes the weekly Signal newsletter on AI knowledge, GEO, and ASO.

Find out how AI systems represent you — then fix it.

The free AI Knowledge Signal Audit scores any public URL across five AI training readiness dimensions and returns a Corpus Survival Likelihood rating. The AI Knowledge Signal Framework — and the AI Knowledge Signal Chrome and Edge extension — give you the structure, audit, and re-score loop to fix what the audit finds.

Run the Free Audit Get the Extension — from $20/month