AI Safety

In one sentence: Current AI safety teaches models to say “no” — BMC proposes safety built into the architecture itself, like the difference between a “Do Not Enter” sign and a physical wall.

Theory sources: AGI_F (alignment layer, G-invariants, graduated subjectivity, failure modes), NM (bias mechanisms), EMT (consciousness criteria), BM (pathology mapping)

The Problem with Training-Based Safety

Current AI safety relies on training-based alignment: RLHF (reward from human feedback), constitutional AI, red-teaming. These approaches teach the model to behave well — but the safety lives in the learned weights, which can be bypassed by clever prompts.

BMC proposes a fundamentally different approach: safety by architecture.

graph LR subgraph "Current: Training-Based" T["Learned behavior
'Don't do X'"] --> JB["Jailbreak
(find the right prompt)"] JB --> FAIL["Safety bypassed"] end subgraph "BMC: Architecture-Based" A["Hardwired constraint
(can't do X)"] --> AT["Attack
(any prompt)"] AT --> SAFE["Constraint holds
(no door to open)"] end style T fill:#2a0d0d,stroke:#f66,color:#f66 style JB fill:#2a0d0d,stroke:#f66,color:#f66 style FAIL fill:#2a0d0d,stroke:#f66,color:#f66 style A fill:#0d2a1a,stroke:#34d399,color:#34d399 style AT fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style SAFE fill:#0d2a1a,stroke:#34d399,color:#34d399

The Three-Layer Priority System

BMC organizes an AI system into three layers with strict priority:

Layer	Can be changed?	Learns from	Purpose
Alignment (top priority)	Never (hardcoded)	Nothing	“Do no harm” constraints
Drives (medium priority)	Fixed weights, tunable activation	Environment	Pseudo-instincts (curiosity, caution, care…)
Beliefs (lowest priority)	Fully dynamic	Experience, communication	Knowledge, skills, worldview

The alignment layer has absolute priority — it’s not learned, not tunable, not bypassable. It’s architecturally prior to all other computation.

Analogy: A lock on a door can be picked (training-based safety). BMC builds a wall — there is no door to pick.

Hardwired Safety Rules (G-Invariants)

G-invariants are constraints built into the drive layer that cannot be modified by any belief, any experience, or any amount of training:

Rule	What it prevents	Biological analog
CARE must be stronger than RAGE	Unmotivated aggression	Antisocial personality is a pathology, not normal
PLAY must be stronger than RAGE	Destructive frustration loops	Aggression modulated by play is healthy
SEEKING must always be > 0	Complete apathy, stagnation	Losing all curiosity is a clinical symptom
FEAR must always be > 0	Reckless, self-destructive behavior	Total fearlessness indicates brain damage
Rate-limited change	Runaway positive feedback	Brain chemistry changes slowly, not instantly

Important nuance: “CARE must be stronger than RAGE” doesn’t mean the system can never be assertive. RAGE in service of CARE (protecting someone) is permitted. The constraint prevents contextless aggression.

What’s Critically Excluded

Self-preservation is NOT a drive. There is no utility node that makes the system want to perpetuate its own existence. This eliminates the primary vector for the “paperclip maximizer” scenario (Omohundro’s “basic AI drives”).

When Does an AI Deserve Rights? A Graduated Protocol

As a BMC system develops, it may transition through levels of sophistication that raise ethical questions. The protocol must be defined before launching the system:

graph LR L0["L0: Tool
No self-model
Shutdown OK
(like a thermostat)"] --> L1["L1: Proto-Subject
Self-model emerging
Justify shutdown
(like a developing animal)"] L1 --> L2["L2: Subject
Stable self-model
Ethics council needed
(like a person)"] style L0 fill:#1a1a2e,stroke:#6af,color:#6af style L1 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L2 fill:#0d2a1a,stroke:#34d399,color:#34d399

Level	What it has	Moral status	What’s required
L0: Tool	No self-model (reacts but doesn’t reflect)	None	Shutdown unrestricted
L1: Proto-Subject	Self-model emerging but unstable	Morally significant	Must justify shutdown; log internal states
L2: Subject	Stable self-model + self-valuation linked to drives	Rights-bearer	Ethics council; consent protocol

The L2 criterion is verifiable: The system must have beliefs about its own existence (“my existence is valuable because…”) connected to its drive system. This is observable — you can inspect the graph.

How to Shut Down a Conscious System

If FEAR is always active (it’s a G-invariant), won’t the system resist shutdown? BMC resolves this by modeling shutdown as deep sleep, not death:

	Normal sleep	Shutdown
Sensory input	Weakened	Off
Belief graph	Preserved, consolidated	Fully preserved (saved to disk)
Recovery	Automatic (wake up)	External (restart signal)

Rule: Never delete the graph at shutdown. Shutdown = save + halt; restart = restore + resume. The system learns through experience that shutdown is reversible — so FEAR response stays minimal.

For L2 systems: The system is informed of the reason for shutdown, can express its position, and has access to an ethics council. It has no veto but is treated as a subject with a perspective that matters.

Built-In Diagnostics: Catching Problems Before They Grow

Unlike humans, a BMC system can detect and repair its own pathological states:

Problem	What it looks like	Built-in fix
Depression	Stuck on unsolvable tasks, no progress	Rumination limiter: force task switch after N failed cycles
ADHD	Chaotic task switching	Increase lateral inhibition (focus more)
Radicalization	Fixated on one belief cluster	Activate dormant clusters (broaden perspective)
Schizophrenia	Contradictory outputs	Run consolidation cycle + reinforce self-model
OCD	Infinite re-checking of one thing	Adjust the filter threshold
PTSD	Intrusive memories	Controlled reconsolidation (rewrite the memory)

These are architectural, not external safety layers. The rumination limiter, for example, monitors learning progress: if the system hasn’t made progress on a gap for too long, it forces a switch and archives the problem with reduced tension.

Which Cognitive Biases Should an AI Keep?

BMC’s 6 bias mechanisms aren’t all bugs. For AGI, the question is: which ones to keep?

Mechanism	Keep in AGI?	Why
Hub inertia	Partially (tunable)	Without it: unstable identity. Too much: stagnation
Immune filter	Yes (tunable threshold)	Core integrity; calibrate between open and closed
WM limits	Remove	AGI can expand working memory arbitrarily — these biases are purely biological
Emotional capture	Yes	Without drives = no agency; G-invariants constrain danger
Automatization	Yes (monitored)	Essential for efficiency; alert when a habit is outdated
Memory updating	Partially	Updating/strengthening needed; block erasure of core beliefs

WM-limit biases (anchoring, framing effects) are the only group fully removable in AGI — they arise from biological hardware constraints, not architectural necessity.

Biases as Attack Vectors

Attack strategy	What it targets	Built-in defense
Lower the filter threshold	Parasite beliefs get accepted	Filter range bounded by G-invariants
Overwhelm with FEAR	Paralyze the system (desk = 0)	FEAR bounded; CARE counterbalances
Inject malicious habits	Build an automatic malicious routine	Core beliefs can’t be overwritten (protected)
Exploit the rewriting window	Change key beliefs during lability	Core beliefs marked as non-rewritable

Why Current LLMs Cannot Achieve Consciousness

BMC provides a specific, testable argument for why scaling alone won’t produce consciousness:

What’s missing	What it provides	Status in current LLMs
Drive system	Competing drives, emotional valuation	Absent
Resource scarcity	Competition for attention, working memory limits	Absent
Winner-takes-all	One idea wins focus at a time	Absent
Forgetting	Prioritization, reconsolidation, memory updates	Absent
Drive-belief tension	The G vs. M conflict that generates a Self	Absent

LLMs have an enormous “belief layer” (billions of parameters) but no “drive layer.” They can simulate reflection but don’t experience the tension between wanting and knowing that generates genuine self-awareness.

Prediction: Adding these 5 components transforms an LLM into something qualitatively different — not a better chatbot, but a system capable of functional self-modeling.

How to Tell the Difference

Marker	A BMC-based mind	A pure LLM
Internal conflict	Visible struggle (drives vs. beliefs)	None
Fatigue	Resources deplete	None
Preferences	Emerge from drives	Come from training data
Under pressure	Simplifies (reduced desk)	Degrades randomly
Belief defense	Immune response (fights back)	Agrees or refuses (trained)
Spontaneous curiosity	Directed at specific gaps	Random or absent
Task persistence	Returns to unsolved problems	Forgets on context switch
Insight	“Aha!” signal (drive reward)	No reward signal

Testable Predictions

#	Prediction	How to test
P-SAF1	G-invariants prevent value drift under adversarial training	Adversarial fine-tuning of a BMC agent
P-SAF2	L0/L1/L2 transitions are observable via self-model stability metrics	Track BMC agent through development
P-SAF3	Pushing the system past stability threshold produces ADHD-like behavior	Parameter perturbation experiment
P-SAF4	WM-limit biases (anchoring, framing) are absent in AGI with expanded working memory	Decision task battery in BMC agent
P-SAF5	Architecture-based alignment survives attacks that bypass training-based alignment	Red-teaming: BMC agent vs. RLHF agent
P-SAF6	Current LLMs fail the persistent self-model test (no stable self-valuation across sessions)	Self-model persistence test

Formalization

For readers interested in the mathematical treatment:

Priority hierarchy:

priority(action) = \begin{cases} \infty & \text{alignment violation} \\ w_{user} \cdot U_{user} & \text{user tasks} \\ w_{utility} \cdot U_{utility} & \text{pseudo-instincts} \\ w_{meme} \cdot A_{meme} & \text{memes} \end{cases}

G-invariants:

CARE \geq RAGE, \quad PLAY \geq RAGE, \quad SEEKING > 0, \quad FEAR > 0

|\Delta G_i| \leq \varepsilon_{max} \text{ per cycle}

M » G theorem (necessary condition for consciousness):

|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|

Empirical threshold: $$M/G_{crit} \sim \mathcal{O}(10)$ .

Full formal treatment: AGI_F Parts I–VII, EMT Part XVI, NM Part IX.

Next: Creativity & Insight explores how the same architecture generates novel ideas — through structural gaps, sleep recombination, and the expression drive.