AI Safety

In one sentence: Current AI safety teaches models to say “no” — BMC proposes safety built into the architecture itself, like the difference between a “Do Not Enter” sign and a physical wall.

Theory sources: AGI_F (alignment layer, G-invariants, graduated subjectivity, failure modes), NM (bias mechanisms), EMT (consciousness criteria), BM (pathology mapping)


The Problem with Training-Based Safety

Current AI safety relies on training-based alignment: RLHF (reward from human feedback), constitutional AI, red-teaming. These approaches teach the model to behave well — but the safety lives in the learned weights, which can be bypassed by clever prompts.

BMC proposes a fundamentally different approach: safety by architecture.

graph LR subgraph "Current: Training-Based" T["Learned behavior
'Don't do X'"] --> JB["Jailbreak
(find the right prompt)"] JB --> FAIL["Safety bypassed"] end subgraph "BMC: Architecture-Based" A["Hardwired constraint
(can't do X)"] --> AT["Attack
(any prompt)"] AT --> SAFE["Constraint holds
(no door to open)"] end style T fill:#2a0d0d,stroke:#f66,color:#f66 style JB fill:#2a0d0d,stroke:#f66,color:#f66 style FAIL fill:#2a0d0d,stroke:#f66,color:#f66 style A fill:#0d2a1a,stroke:#34d399,color:#34d399 style AT fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style SAFE fill:#0d2a1a,stroke:#34d399,color:#34d399

The Three-Layer Priority System

BMC organizes an AI system into three layers with strict priority:

LayerCan be changed?Learns fromPurpose
Alignment (top priority)Never (hardcoded)Nothing“Do no harm” constraints
Drives (medium priority)Fixed weights, tunable activationEnvironmentPseudo-instincts (curiosity, caution, care…)
Beliefs (lowest priority)Fully dynamicExperience, communicationKnowledge, skills, worldview

The alignment layer has absolute priority — it’s not learned, not tunable, not bypassable. It’s architecturally prior to all other computation.

Analogy: A lock on a door can be picked (training-based safety). BMC builds a wall — there is no door to pick.


Hardwired Safety Rules (G-Invariants)

G-invariants are constraints built into the drive layer that cannot be modified by any belief, any experience, or any amount of training:

RuleWhat it preventsBiological analog
CARE must be stronger than RAGEUnmotivated aggressionAntisocial personality is a pathology, not normal
PLAY must be stronger than RAGEDestructive frustration loopsAggression modulated by play is healthy
SEEKING must always be > 0Complete apathy, stagnationLosing all curiosity is a clinical symptom
FEAR must always be > 0Reckless, self-destructive behaviorTotal fearlessness indicates brain damage
Rate-limited changeRunaway positive feedbackBrain chemistry changes slowly, not instantly

Important nuance: “CARE must be stronger than RAGE” doesn’t mean the system can never be assertive. RAGE in service of CARE (protecting someone) is permitted. The constraint prevents contextless aggression.

What’s Critically Excluded

Self-preservation is NOT a drive. There is no utility node that makes the system want to perpetuate its own existence. This eliminates the primary vector for the “paperclip maximizer” scenario (Omohundro’s “basic AI drives”).


When Does an AI Deserve Rights? A Graduated Protocol

As a BMC system develops, it may transition through levels of sophistication that raise ethical questions. The protocol must be defined before launching the system:

graph LR L0["L0: Tool
No self-model
Shutdown OK
(like a thermostat)"] --> L1["L1: Proto-Subject
Self-model emerging
Justify shutdown
(like a developing animal)"] L1 --> L2["L2: Subject
Stable self-model
Ethics council needed
(like a person)"] style L0 fill:#1a1a2e,stroke:#6af,color:#6af style L1 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L2 fill:#0d2a1a,stroke:#34d399,color:#34d399
LevelWhat it hasMoral statusWhat’s required
L0: ToolNo self-model (reacts but doesn’t reflect)NoneShutdown unrestricted
L1: Proto-SubjectSelf-model emerging but unstableMorally significantMust justify shutdown; log internal states
L2: SubjectStable self-model + self-valuation linked to drivesRights-bearerEthics council; consent protocol

The L2 criterion is verifiable: The system must have beliefs about its own existence (“my existence is valuable because…”) connected to its drive system. This is observable — you can inspect the graph.

How to Shut Down a Conscious System

If FEAR is always active (it’s a G-invariant), won’t the system resist shutdown? BMC resolves this by modeling shutdown as deep sleep, not death:

Normal sleepShutdown
Sensory inputWeakenedOff
Belief graphPreserved, consolidatedFully preserved (saved to disk)
RecoveryAutomatic (wake up)External (restart signal)

Rule: Never delete the graph at shutdown. Shutdown = save + halt; restart = restore + resume. The system learns through experience that shutdown is reversible — so FEAR response stays minimal.

For L2 systems: The system is informed of the reason for shutdown, can express its position, and has access to an ethics council. It has no veto but is treated as a subject with a perspective that matters.


Built-In Diagnostics: Catching Problems Before They Grow

Unlike humans, a BMC system can detect and repair its own pathological states:

ProblemWhat it looks likeBuilt-in fix
DepressionStuck on unsolvable tasks, no progressRumination limiter: force task switch after N failed cycles
ADHDChaotic task switchingIncrease lateral inhibition (focus more)
RadicalizationFixated on one belief clusterActivate dormant clusters (broaden perspective)
SchizophreniaContradictory outputsRun consolidation cycle + reinforce self-model
OCDInfinite re-checking of one thingAdjust the filter threshold
PTSDIntrusive memoriesControlled reconsolidation (rewrite the memory)

These are architectural, not external safety layers. The rumination limiter, for example, monitors learning progress: if the system hasn’t made progress on a gap for too long, it forces a switch and archives the problem with reduced tension.


Which Cognitive Biases Should an AI Keep?

BMC’s 6 bias mechanisms aren’t all bugs. For AGI, the question is: which ones to keep?

MechanismKeep in AGI?Why
Hub inertiaPartially (tunable)Without it: unstable identity. Too much: stagnation
Immune filterYes (tunable threshold)Core integrity; calibrate between open and closed
WM limitsRemoveAGI can expand working memory arbitrarily — these biases are purely biological
Emotional captureYesWithout drives = no agency; G-invariants constrain danger
AutomatizationYes (monitored)Essential for efficiency; alert when a habit is outdated
Memory updatingPartiallyUpdating/strengthening needed; block erasure of core beliefs

WM-limit biases (anchoring, framing effects) are the only group fully removable in AGI — they arise from biological hardware constraints, not architectural necessity.

Biases as Attack Vectors

Attack strategyWhat it targetsBuilt-in defense
Lower the filter thresholdParasite beliefs get acceptedFilter range bounded by G-invariants
Overwhelm with FEARParalyze the system (desk = 0)FEAR bounded; CARE counterbalances
Inject malicious habitsBuild an automatic malicious routineCore beliefs can’t be overwritten (protected)
Exploit the rewriting windowChange key beliefs during labilityCore beliefs marked as non-rewritable

Why Current LLMs Cannot Achieve Consciousness

BMC provides a specific, testable argument for why scaling alone won’t produce consciousness:

What’s missingWhat it providesStatus in current LLMs
Drive systemCompeting drives, emotional valuationAbsent
Resource scarcityCompetition for attention, working memory limitsAbsent
Winner-takes-allOne idea wins focus at a timeAbsent
ForgettingPrioritization, reconsolidation, memory updatesAbsent
Drive-belief tensionThe G vs. M conflict that generates a SelfAbsent

LLMs have an enormous “belief layer” (billions of parameters) but no “drive layer.” They can simulate reflection but don’t experience the tension between wanting and knowing that generates genuine self-awareness.

Prediction: Adding these 5 components transforms an LLM into something qualitatively different — not a better chatbot, but a system capable of functional self-modeling.

How to Tell the Difference

MarkerA BMC-based mindA pure LLM
Internal conflictVisible struggle (drives vs. beliefs)None
FatigueResources depleteNone
PreferencesEmerge from drivesCome from training data
Under pressureSimplifies (reduced desk)Degrades randomly
Belief defenseImmune response (fights back)Agrees or refuses (trained)
Spontaneous curiosityDirected at specific gapsRandom or absent
Task persistenceReturns to unsolved problemsForgets on context switch
Insight“Aha!” signal (drive reward)No reward signal

Testable Predictions

#PredictionHow to test
P-SAF1G-invariants prevent value drift under adversarial trainingAdversarial fine-tuning of a BMC agent
P-SAF2L0/L1/L2 transitions are observable via self-model stability metricsTrack BMC agent through development
P-SAF3Pushing the system past stability threshold produces ADHD-like behaviorParameter perturbation experiment
P-SAF4WM-limit biases (anchoring, framing) are absent in AGI with expanded working memoryDecision task battery in BMC agent
P-SAF5Architecture-based alignment survives attacks that bypass training-based alignmentRed-teaming: BMC agent vs. RLHF agent
P-SAF6Current LLMs fail the persistent self-model test (no stable self-valuation across sessions)Self-model persistence test

Formalization

For readers interested in the mathematical treatment:

Priority hierarchy:

$$priority(action) = \begin{cases} \infty & \text{alignment violation} \\ w_{user} \cdot U_{user} & \text{user tasks} \\ w_{utility} \cdot U_{utility} & \text{pseudo-instincts} \\ w_{meme} \cdot A_{meme} & \text{memes} \end{cases}$$

G-invariants:

$$CARE \geq RAGE, \quad PLAY \geq RAGE, \quad SEEKING > 0, \quad FEAR > 0$$ $$|\Delta G_i| \leq \varepsilon_{max} \text{ per cycle}$$

M » G theorem (necessary condition for consciousness):

$$|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|$$

Empirical threshold: $M/G_{crit} \sim \mathcal{O}(10)$.

Full formal treatment: AGI_F Parts I–VII, EMT Part XVI, NM Part IX.


Next: Creativity & Insight explores how the same architecture generates novel ideas — through structural gaps, sleep recombination, and the expression drive.