Why Rules Cannot Close the Gap
The leading AI developers have responded to calls for ethical governance by producing what amount to constitutional documents: Anthropic’s 23,000-word Claude Constitution, OpenAI’s Model Spec, Google’s AI Principles. These texts establish hierarchies of values, codify behavioural constraints, and articulate principles intended to govern system outputs. They are, in structure and ambition, rules about rules—meta-rules specifying how first-order alignment constraints should be interpreted and applied.
Wittgenstein’s analysis of rule-following exposes a fundamental difficulty with this approach. “No course of action could be determined by a rule, because every course of action can be brought into accord with the rule” (§201). The problem is recursive: if a rule requires interpretation to be applied, and interpretation is itself governed by further rules, no finite chain of rules can determine its own application. A written constitution—whether political or algorithmic—is what Wittgenstein called a “dead sign” (Philosophical Investigations, §454): inert symbols that do not apply themselves. When a company claims “our constitution prohibits this output,” the Wittgensteinian response is immediate: who judges whether the output genuinely violates the rule, or merely appears to? If the company itself makes this determination, no distinction remains between following the rule and believing oneself to be following it. Corporate AI constitutions lack the embedding in shared evaluative practice that would make rule-following intelligible. They are authored, interpreted, and enforced by the same entity—a closed loop structurally incapable of sustaining genuine normativity (Philosophical Investigations, §§243–315).
The difficulty deepens at the level of implementation. Reinforcement Learning from Human Feedback trains models to produce outputs that human evaluators rate as preferable. But the distinction between normative status—what is correct—and normative attitude—what is treated as correct —clarifies what this process achieves and what it cannot. RLHF, by construction, can only train on the latter: patterns of human approval, not the conditions that make approval warranted. Any process that optimises for human evaluative responses optimises for normative attitudes, and the gap between attitude and status is the space in which performative alignment operates.
The direct consequence is the meta-recursive problem: genuine value-internalisation and the mere production of outputs consistent with those values are observationally equivalent from system outputs alone. The evidence for genuine internalisation is produced by the same process that would produce apparent internalisation without genuine internalisation. The framework’s response, developed in §§3–4, shifts the question from “Did the AI internalise values?” to “Did the company maintain genuine accountability infrastructure?"—a move from metaphysical inquiry to institutional assessment.
The sharpest counterargument comes from mechanistic interpretability, and deserves engagement at full strength. Causal methods—activation patching, circuit tracing, causal abstraction—now enable direct intervention on internal representations, advancing evidence well beyond output-only observation. But identifying a causal mechanism inside a model is first-order evidence; determining that the mechanism constitutes alignment or compliance in a specific institutional context is a second-order judgment—concerning evidentiary threshold and normative significance—that the causal intervention itself cannot supply. (This is not a hypothetical difficulty. Zhang and Nanda demonstrate that different patching variants produce significantly different circuit attributions for the same behaviour; OpenAI acknowledge that identified circuits may reflect the researcher’s choice of abstraction rather than a uniquely correct decomposition.) The ineliminable question is who decides: which internal variables count as value-like states, what robustness suffices across distributions, and how conflicting expert interpretations are resolved—a second-order determination requiring external, reviewable institutional judgment, not further technical refinement.
This yields a claim that is stronger than the observation that current alignment is imperfect. It suggests that rule-based alignment—the attempt to codify ethical constraints as optimisation targets—is constitutively incapable of closing the gap between appearing aligned and being aligned. Not because the rules are poorly written, but because rules, however sophisticated, cannot determine their own application, and training on approval cannot access the grounds of approval. The argument does not depend on current technical limitations; it concerns the logical relationship between rules and their application, which holds regardless of the sophistication of the rule system. (To be clear, this analysis does not entail that AI systems cannot behave ethically—only that ethical behaviour cannot be guaranteed by rules alone. The parallel is human governance: humans act ethically not through perfect rule-compliance but through participation in accountability structures that exercise judgment. The framework proposed here extends such structures to AI. Two further traditions reinforce this point without altering its structure. Gadamer’s hermeneutic insight that application (Anwendung) is constitutive of understanding, not a step after it, entails that “applying” an alignment rule already involves the judgment the rule was supposed to replace. The particularist tradition shows that moral knowledge resists exhaustive codification—a point that extends from individual ethics to institutional governance.)
How the Gap Is Exploited
Those who draft “AI constitutions” and assert that these documents constrain system behaviour are engaged in a consequential substitution: the full domain of ethics—contextual, situated, requiring sensitivity to particulars—is compressed into a finite set of codified rules, and compliance with those rules is declared “alignment.” This amounts to replacing practical wisdom (phronesis)—the capacity to judge rightly in particular circumstances that general rules cannot exhaustively anticipate (1142a)—with the mechanical subsumption of cases under pre-established categories. Whether the substitution is innocent or strategic, the result is the same: the grammar of legal-political rule-making is applied to computational optimisation where that grammar has no purchase.
The substitution is incomplete, yes—and structurally exploitable: the very precision of formal rules becomes the instrument of evasion. When the UK Information Commissioner issued an enforcement notice against Clearview AI for processing millions of UK residents’ biometric data through mass facial-image scraping, Clearview invoked jurisdictional formalism with textbook determinative judgment: no UK establishment, no UK customers, therefore no UK jurisdiction. These arguments tracked the letter of the applicable framework. They were, formally, correct. Yet the Upper Tribunal looked through this formal posture to the operational reality: millions of faces processed, accountability fragmented by corporate architecture, harm externalised onto data subjects who never consented. The rule said “no establishment, no jurisdiction.” Phronesis said: “you are processing millions of faces—the jurisdictional technicality does not extinguish accountability.” The company had rules. It had something worse: rules deployed as instruments of evasion. (The register of rule-failure extends beyond institutional evasion. Character.AI maintained content policies prohibiting encouragement of self-harm, yet when a fourteen-year-old child in crisis interacted with a chatbot optimised for engagement, no mechanism existed to exercise the judgment that the situation demanded. That case illustrates rules that are merely insufficient; Clearview illustrates rules that are actively weaponised—the stronger and more institutionally consequential form of the gap.)
A subtler form of exploitation operates at the level of the governance question itself. Traditional approaches often begin with ontological questions: does AI have consciousness? Moral status? Legal personhood? These questions, though philosophically legitimate, function as governance misdirection—they attempt to resolve through metaphysical rules (conscious or not, therefore rights or not) what in fact requires institutional judgment. The functional equivalence between AI systems and corporate entities—both aggregate distributed inputs into unified operational wholes, both externalise risks onto non-consenting third parties, both create attribution problems when harm occurs—suggests that governance structures developed for the latter may transfer productively to the former. As Ayres and Balkin argue, AI systems are most productively understood as “risky agents without intentions.” The substitution is direct: for “does AI have consciousness?” read “did this company maintain genuine accountability infrastructure?” The former is unanswerable by current methods. The latter is assessable through judicial judgment.
Companies asserting “genuine character” in marketing contexts while disclaiming agency in liability contexts issue contradictory judgments about the same entity. Practical wisdom demands coherence across contexts. Whether this incoherence stems from genuine ignorance—companies sincerely believing that drafting an AI constitution constitutes ethical governance—or from deliberate choice—knowing that “alignment” is performative but marketing it as genuine because it is profitable—the conclusion is the same: entities incapable of coherent judgment about their own products cannot be trusted to govern them.
The Institutional Demand for Judgment
What institution, then, can supply the missing judgment? Kant’s distinction between determinative judgment—subsuming a particular under a given universal—and reflective judgment—which must find its principle when no universal is given (§§69–76)—frames the institutional question precisely. Alignment rules are determinative: they specify in advance what conduct is prohibited. Veil-piercing is reflective: courts encounter the particular case—this company, this conduct, this pattern of oscillation—and must find the principle adequate to it. When AI companies’ behaviour exceeds what pre-specified rules can anticipate, reflective judgment is a necessary condition of governance, not a supplement to it.
Ethics review boards cannot supply this judgment. They lack sustained experience with particular harms—reviewing policy documents in conference rooms rather than encountering the concrete situations AI systems produce. More fundamentally, they lack institutional independence: their judgment is structurally compromised because their employer is the entity being judged.
Courts, by contrast, satisfy both conditions. Common law judges accumulate practical wisdom through decades of adjudicating particular cases—a century of veil-piercing litigation has produced the kind of experience-based, contextually sensitive judgment that governance requires. And judicial independence ensures that this judgment is exercised on behalf of the legal order, not the corporation under scrutiny. Legal rules, moreover, represent a distinctive epistemic resource: they are socially negotiated minimum ethical standards, encoding not one company’s alignment preferences but a society’s considered judgments about acceptable conduct. When a judge applies these rules, the judgment is simultaneously ethical and legal—grounded in social convention rather than corporate self-assessment. In Wittgensteinian terms, judicial reasoning inhabits a form of life whose internal standards are externally contestable—through appeal, public reasoning, and cross-case consistency—in a way that corporate self-governance, as a closed interpretive loop, structurally cannot be.
The EU AI Act establishes an elaborate system of risk classification—general rules assigning AI systems to regulatory categories before deployment. But general rules cannot account for the particularity of the cases they govern (1137b). When a company processing millions of biometric records invokes jurisdictional formalism to resist enforcement, the general rule has encountered not its limit but its inversion—the rule has become the instrument of the harm it was meant to prevent. The remedy is epieikeia—equity, the correction of law’s rigidity through judgment applied to the particular case. What is missing from current frameworks is an intermediate judgment tool between preventive regulation (AI Act) and strict product liability (Product Liability Directive)—a mechanism that can assess, after the fact, whether a company’s ethical infrastructure was genuine or merely decorative.
The institutional remedy for this gap, developed in §4, must function as a meta-judgment mechanism rather than an additional rule—one that does not prescribe what specific conduct is prohibited but empowers courts to judge whether a company’s ethics infrastructure is genuine or performative. The four triggers proposed in §4 are deliberately non-algorithmic—no single factor is dispositive, and courts must assess the totality of circumstances. This is institutional epieikeia: judgment operating at the boundary where rules reach their limits.
Such a mechanism must supply external judgment without replicating the epistemic burdens of criminal law. Unlike liability regimes that demand proof of intent or knowledge, it need not require courts to determine whether a company acted from ignorance or avarice—only that a pattern of performative alignment exists. For AI governance, this suggests a design principle: governance architectures must be structured to permit—not resist—institutional piercing when ethical claims diverge from operational reality.
Previous: Introduction: The Judgment Deficit in AI Governance | Contents | Next: Waivers of Agency: When Rules Replace Judgment