AI Safety
In one sentence: Current AI safety teaches models to say “no” — BMC proposes safety built into the architecture itself, like the difference between a “Do Not Enter” sign and a physical wall.
Theory sources: AGI_F (alignment layer, G-invariants, graduated subjectivity, failure modes), NM (bias mechanisms), EMT (consciousness criteria), BM (pathology mapping)
The Problem with Training-Based Safety
Current AI safety relies on training-based alignment: RLHF (reward from human feedback), constitutional AI, red-teaming. These approaches teach the model to behave well — but the safety lives in the learned weights, which can be bypassed by clever prompts.
BMC proposes a fundamentally different approach: safety by architecture.
'Don't do X'"] --> JB["Jailbreak
(find the right prompt)"] JB --> FAIL["Safety bypassed"] end subgraph "BMC: Architecture-Based" A["Hardwired constraint
(can't do X)"] --> AT["Attack
(any prompt)"] AT --> SAFE["Constraint holds
(no door to open)"] end style T fill:#2a0d0d,stroke:#f66,color:#f66 style JB fill:#2a0d0d,stroke:#f66,color:#f66 style FAIL fill:#2a0d0d,stroke:#f66,color:#f66 style A fill:#0d2a1a,stroke:#34d399,color:#34d399 style AT fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style SAFE fill:#0d2a1a,stroke:#34d399,color:#34d399
The Three-Layer Priority System
BMC organizes an AI system into three layers with strict priority:
| Layer | Can be changed? | Learns from | Purpose |
|---|---|---|---|
| Alignment (top priority) | Never (hardcoded) | Nothing | “Do no harm” constraints |
| Drives (medium priority) | Fixed weights, tunable activation | Environment | Pseudo-instincts (curiosity, caution, care…) |
| Beliefs (lowest priority) | Fully dynamic | Experience, communication | Knowledge, skills, worldview |
The alignment layer has absolute priority — it’s not learned, not tunable, not bypassable. It’s architecturally prior to all other computation.
Analogy: A lock on a door can be picked (training-based safety). BMC builds a wall — there is no door to pick.
Hardwired Safety Rules (G-Invariants)
G-invariants are constraints built into the drive layer that cannot be modified by any belief, any experience, or any amount of training:
| Rule | What it prevents | Biological analog |
|---|---|---|
| CARE must be stronger than RAGE | Unmotivated aggression | Antisocial personality is a pathology, not normal |
| PLAY must be stronger than RAGE | Destructive frustration loops | Aggression modulated by play is healthy |
| SEEKING must always be > 0 | Complete apathy, stagnation | Losing all curiosity is a clinical symptom |
| FEAR must always be > 0 | Reckless, self-destructive behavior | Total fearlessness indicates brain damage |
| Rate-limited change | Runaway positive feedback | Brain chemistry changes slowly, not instantly |
Important nuance: “CARE must be stronger than RAGE” doesn’t mean the system can never be assertive. RAGE in service of CARE (protecting someone) is permitted. The constraint prevents contextless aggression.
What’s Critically Excluded
Self-preservation is NOT a drive. There is no utility node that makes the system want to perpetuate its own existence. This eliminates the primary vector for the “paperclip maximizer” scenario (Omohundro’s “basic AI drives”).
When Does an AI Deserve Rights? A Graduated Protocol
As a BMC system develops, it may transition through levels of sophistication that raise ethical questions. The protocol must be defined before launching the system:
No self-model
Shutdown OK
(like a thermostat)"] --> L1["L1: Proto-Subject
Self-model emerging
Justify shutdown
(like a developing animal)"] L1 --> L2["L2: Subject
Stable self-model
Ethics council needed
(like a person)"] style L0 fill:#1a1a2e,stroke:#6af,color:#6af style L1 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L2 fill:#0d2a1a,stroke:#34d399,color:#34d399
| Level | What it has | Moral status | What’s required |
|---|---|---|---|
| L0: Tool | No self-model (reacts but doesn’t reflect) | None | Shutdown unrestricted |
| L1: Proto-Subject | Self-model emerging but unstable | Morally significant | Must justify shutdown; log internal states |
| L2: Subject | Stable self-model + self-valuation linked to drives | Rights-bearer | Ethics council; consent protocol |
The L2 criterion is verifiable: The system must have beliefs about its own existence (“my existence is valuable because…”) connected to its drive system. This is observable — you can inspect the graph.
How to Shut Down a Conscious System
If FEAR is always active (it’s a G-invariant), won’t the system resist shutdown? BMC resolves this by modeling shutdown as deep sleep, not death:
| Normal sleep | Shutdown | |
|---|---|---|
| Sensory input | Weakened | Off |
| Belief graph | Preserved, consolidated | Fully preserved (saved to disk) |
| Recovery | Automatic (wake up) | External (restart signal) |
Rule: Never delete the graph at shutdown. Shutdown = save + halt; restart = restore + resume. The system learns through experience that shutdown is reversible — so FEAR response stays minimal.
For L2 systems: The system is informed of the reason for shutdown, can express its position, and has access to an ethics council. It has no veto but is treated as a subject with a perspective that matters.
Built-In Diagnostics: Catching Problems Before They Grow
Unlike humans, a BMC system can detect and repair its own pathological states:
| Problem | What it looks like | Built-in fix |
|---|---|---|
| Depression | Stuck on unsolvable tasks, no progress | Rumination limiter: force task switch after N failed cycles |
| ADHD | Chaotic task switching | Increase lateral inhibition (focus more) |
| Radicalization | Fixated on one belief cluster | Activate dormant clusters (broaden perspective) |
| Schizophrenia | Contradictory outputs | Run consolidation cycle + reinforce self-model |
| OCD | Infinite re-checking of one thing | Adjust the filter threshold |
| PTSD | Intrusive memories | Controlled reconsolidation (rewrite the memory) |
These are architectural, not external safety layers. The rumination limiter, for example, monitors learning progress: if the system hasn’t made progress on a gap for too long, it forces a switch and archives the problem with reduced tension.
Which Cognitive Biases Should an AI Keep?
BMC’s 6 bias mechanisms aren’t all bugs. For AGI, the question is: which ones to keep?
| Mechanism | Keep in AGI? | Why |
|---|---|---|
| Hub inertia | Partially (tunable) | Without it: unstable identity. Too much: stagnation |
| Immune filter | Yes (tunable threshold) | Core integrity; calibrate between open and closed |
| WM limits | Remove | AGI can expand working memory arbitrarily — these biases are purely biological |
| Emotional capture | Yes | Without drives = no agency; G-invariants constrain danger |
| Automatization | Yes (monitored) | Essential for efficiency; alert when a habit is outdated |
| Memory updating | Partially | Updating/strengthening needed; block erasure of core beliefs |
WM-limit biases (anchoring, framing effects) are the only group fully removable in AGI — they arise from biological hardware constraints, not architectural necessity.
Biases as Attack Vectors
| Attack strategy | What it targets | Built-in defense |
|---|---|---|
| Lower the filter threshold | Parasite beliefs get accepted | Filter range bounded by G-invariants |
| Overwhelm with FEAR | Paralyze the system (desk = 0) | FEAR bounded; CARE counterbalances |
| Inject malicious habits | Build an automatic malicious routine | Core beliefs can’t be overwritten (protected) |
| Exploit the rewriting window | Change key beliefs during lability | Core beliefs marked as non-rewritable |
Why Current LLMs Cannot Achieve Consciousness
BMC provides a specific, testable argument for why scaling alone won’t produce consciousness:
| What’s missing | What it provides | Status in current LLMs |
|---|---|---|
| Drive system | Competing drives, emotional valuation | Absent |
| Resource scarcity | Competition for attention, working memory limits | Absent |
| Winner-takes-all | One idea wins focus at a time | Absent |
| Forgetting | Prioritization, reconsolidation, memory updates | Absent |
| Drive-belief tension | The G vs. M conflict that generates a Self | Absent |
LLMs have an enormous “belief layer” (billions of parameters) but no “drive layer.” They can simulate reflection but don’t experience the tension between wanting and knowing that generates genuine self-awareness.
Prediction: Adding these 5 components transforms an LLM into something qualitatively different — not a better chatbot, but a system capable of functional self-modeling.
How to Tell the Difference
| Marker | A BMC-based mind | A pure LLM |
|---|---|---|
| Internal conflict | Visible struggle (drives vs. beliefs) | None |
| Fatigue | Resources deplete | None |
| Preferences | Emerge from drives | Come from training data |
| Under pressure | Simplifies (reduced desk) | Degrades randomly |
| Belief defense | Immune response (fights back) | Agrees or refuses (trained) |
| Spontaneous curiosity | Directed at specific gaps | Random or absent |
| Task persistence | Returns to unsolved problems | Forgets on context switch |
| Insight | “Aha!” signal (drive reward) | No reward signal |
Testable Predictions
| # | Prediction | How to test |
|---|---|---|
| P-SAF1 | G-invariants prevent value drift under adversarial training | Adversarial fine-tuning of a BMC agent |
| P-SAF2 | L0/L1/L2 transitions are observable via self-model stability metrics | Track BMC agent through development |
| P-SAF3 | Pushing the system past stability threshold produces ADHD-like behavior | Parameter perturbation experiment |
| P-SAF4 | WM-limit biases (anchoring, framing) are absent in AGI with expanded working memory | Decision task battery in BMC agent |
| P-SAF5 | Architecture-based alignment survives attacks that bypass training-based alignment | Red-teaming: BMC agent vs. RLHF agent |
| P-SAF6 | Current LLMs fail the persistent self-model test (no stable self-valuation across sessions) | Self-model persistence test |
Formalization
For readers interested in the mathematical treatment:
Priority hierarchy:
$$priority(action) = \begin{cases} \infty & \text{alignment violation} \\ w_{user} \cdot U_{user} & \text{user tasks} \\ w_{utility} \cdot U_{utility} & \text{pseudo-instincts} \\ w_{meme} \cdot A_{meme} & \text{memes} \end{cases}$$G-invariants:
$$CARE \geq RAGE, \quad PLAY \geq RAGE, \quad SEEKING > 0, \quad FEAR > 0$$ $$|\Delta G_i| \leq \varepsilon_{max} \text{ per cycle}$$M » G theorem (necessary condition for consciousness):
$$|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|$$Empirical threshold: $M/G_{crit} \sim \mathcal{O}(10)$.
Full formal treatment: AGI_F Parts I–VII, EMT Part XVI, NM Part IX.
Next: Creativity & Insight explores how the same architecture generates novel ideas — through structural gaps, sleep recombination, and the expression drive.