The dominant framing of AI safety treats alignment as a control problem: how do we make superintelligent systems do what we want? But that question hides a deeper one nobody has answered:
Every alignment proposal ultimately bottoms out in human preferences. But human preferences are inconsistent, manipulable, context-dependent, and often self-destructive. RLHF (reinforcement learning from human feedback) trains models to satisfy stated preferences — not to produce actual value. That's Goodhart's Law applied to intelligence itself.
Current AI systems optimize for proxies: engagement, satisfaction scores, task completion rates, benchmark performance. None of these measure whether the AI's output actually reduced entropy in the world — whether it made things genuinely better in a measurable, physical sense.
The Extropy Engine proposes a different alignment target: instead of aligning AI to human preferences (which drift, conflict, and corrode), align AI to verified entropy reduction. This gives the system a physically grounded objective function that doesn't depend on polling humans.
An AI aligned to entropy reduction would:
1. Prioritize actions that create measurable order over actions that merely satisfy user requests.
2. Resist producing content that increases informational entropy (misinformation, noise, slop).
3. Self-audit against thermodynamic baselines rather than user approval metrics.
4. Become more aligned as it becomes more capable, because its objective function doesn't degrade with scale.
This inverts the current alignment paradox. Today, more capable AI is harder to align. In an entropy-reduction framework, capability and alignment converge.
The hard problem: who validates the entropy reduction? If validators are human, you re-import human bias. If validators are AI, you get recursive self-evaluation loops. The Extropy Engine addresses this with a multi-layer validation architecture (human + AI + physical measurement), but the boundary conditions are still being formalized. See open problems.