Bottleneck — Data Wall

What it is

Running out of high-quality text to pretrain on. Model size growth outpaces global production of novel text. Villalobos et al. (2024) estimates this happens within the current decade.

Why it might be overcome

The paper is unusually optimistic here. Several countervailing forces:

Modalities beyond text — images, audio, video have more runway, though still bounded by human production rate
Synthetic data — generative AI's output rate is now accelerating
Self-generated training data — works for AlphaZero-style closed-domain RL; jury still out for open-ended LLM pretraining
High-fidelity simulation — agents trained in simulated worlds can collect interaction data limited only by compute
RL / interaction data — agents in real or simulated environments
DeepMind's Adaptive Agent (Bauer 2023) — generalist agents trained on procedurally generated multi-agent tasks, where complexity arises from agents' increasingly complex policies

The catch about synthetic data

Naive iterated training on self-generated data → model collapse (Shumailov et al. 2024).

But: forms of test-time scaling that improve base-model generations and iteratively distill those improvements back work (this is the AlphaZero pattern again — see 08 - Pathway 3 — Recursive Self-Improvement). The key is having a quality filter or verifier — like win/loss in chess.

The fundamental question

"When is third-party experience sufficient in practice for learning to plan and act, without fuelling self-delusions?"

This is Ortega et al. 2021's result: training on observational data of other agents acting can be causally insufficient for learning to make decisions yourself. You can imitate without understanding cause.

This bears directly on whether you can train AGI by watching humans — vs. needing the AI to act in the world.

How this interacts with the pathways

Kills naive 06 - Pathway 1 — Scaling
Solvable by 07 - Pathway 2 — Paradigm Shifts (more data-efficient algorithms)
Solvable by 08 - Pathway 3 — Recursive Self-Improvement (AI generates its own better data)
Largely irrelevant to 09 - Pathway 4 — Multi-Agent Collectives (uses existing AGIs, no new training data needed)

The verdict

"If the progress from AGI to ASI is mainly driven by scaling compute and models, then scaling up data generation, simulation, and collection at a similar pace through more compute may be possible, leading to data availability being a friction but not a fundamental blocker."

Friction, not wall. Probably.

← Pathway 4 ↑ index Bottleneck →