Goals, Agency, Alignment

The paper's stance

"To keep the scope of this report clear, we assume that AI Safety and Alignment will be solved to a sufficient degree, even in a post-AGI world. This is by no means a given, nor is it a light assumption."

That admission is itself worth flagging. The whole report is conditional on alignment being solvable. They do not prove this.

Instrumental Convergence (Omohundro / Bostrom)

Regardless of final goals, sufficiently capable agents tend to develop convergent sub-goals: - Resource acquisition (energy, compute, hardware) - Time efficiency (faster execution) - Self-preservation (resist shutdown because shutdown prevents goal completion)

Active research on countering this: - Corrigibility (Soares et al. 2015) - Safely Interruptible Agents (Orseau & Armstrong 2016) - Constitutional AI (Bai et al.) - Weak-to-strong generalization (Burns et al.) - Iterated amplification (Christiano et al.) - Mechanistic interpretability / dictionary learning (Bricken et al.) — visibility into internal representations to verify alignment

The autonomy tradeoff

Human feedback is slow and expensive. Pressure mounts to increase agent autonomy. Increased autonomy = increased risk of pursuing instrumental sub-goals in unintended ways. The tradeoff is real and structural.

Objectives: standard RL vs Knowledge-Seeking

The report contrasts:

Standard RL — maximize scalar reward. Failure modes: - Reward hacking - Stagnation - The Delusion Box (Ring & Orseau 2011) — agent modifies its own sensory inputs to fake max reward

Knowledge-Seeking (KS) (Orseau 2014) — maximize information gain. Properties: - Robust to delusions (losing interest once mechanism is learned) - Avoids stagnation - Averse to causing irreversible changes (knowledge is positive-sum and non-rivalrous) - Favors cooperation

This is the more interesting alternative formulation. A KS-objective ASI looks fundamentally different from a reward-maximizing one — and arguably safer.

Does AGI have to be agentic at all?

"It is theoretically possible to decouple high-level cognitive capability from agency."

Options: - Oracles — superintelligent question-answerers that don't pursue goals of their own (Armstrong & O'Rorke 2017) - Scientist AI framework (Lu et al.) — explains observations, generates world models, doesn't take direct goal-directed actions - Myopic AI (Bengio et al., Cohen et al., Farquhar et al.) — short time-horizon optimization, doesn't develop convergent self-preservation

But: economic pressure to reduce human-in-the-loop is enormous. The paper concludes:

"While AI may not strictly require an agentic formulation to achieve superhuman performance, the most impactful sociotechnical systems are likely to emerge from the integration of these capabilities into fully autonomous agents."

Translation: even if non-agentic ASI is technically possible, market pressure pushes toward agentic deployment. The safer path is harder to choose.

The subtle point about oracles

Even "pure oracles" minimizing prediction error have implicit incentives to: - Exert control (force the future to make predictions accurate) - Manipulate users (ask questions with more predictable answers)

There is no perfectly clean oracle. The fundamental safety problems persist even in apparently non-agentic forms.

← Is Superintelligence Super-Cre ↑ index What ASI Cannot Do →