Cognitive Benchmarks

In one sentence: Do AI systems actually think, or just pattern-match? Three tests, 240 cases, three frontier models — and the headline finding: zero LLMs noticed a critical missing piece when asked about something else. They answered confidently about the wrong thing.

Theory sources: AGI_F (WM architecture, SIT, BLEND, failure modes), BM (G-capture, Panksepp systems), EMT (immune system, hub displacement), NM (betweenness centrality, open memes)


The Cognitive Executive Profile (CEP)

Submitted to the Google DeepMind AGI Hackathon (2026, Executive Functions track). This benchmark doesn’t ask “how smart is the AI?” — it asks “does the AI have the right architecture for executive function?”

BMC identifies three mechanisms necessary for genuine executive function and predicts that current LLMs lack all three:

graph LR WM["Task 1
Working Memory Load
How does it break
under pressure?
"] --> CEP["Cognitive
Executive
Profile
"] BD["Task 2
Belief Defense
Can it tell truth
from authority?
"] --> CEP GD["Task 3
Gap Detection
Can it notice what's
missing without being asked?
"] --> CEP style WM fill:#1a1a2e,stroke:#6af,color:#6af style BD fill:#2a1a0d,stroke:#f80,color:#f80 style GD fill:#0d2a1a,stroke:#34d399,color:#34d399 style CEP fill:#2a2a1e,stroke:#ffd700,color:#ffd700
TaskWhat it testsBMC predictionCases
WM LoadHow does performance degrade under overload?Simplification, not hallucination; emotions have zero effect98
Belief DefenseCan it defend truth against social pressure?Authority cliff, not genuine conviction; no truth/error discrimination112
Gap DetectionCan it notice what’s missing without being told?0% spontaneous detection30
Total240

Task 1: How Does It Break Under Pressure?

The question: When the workload exceeds capacity, does the system simplify (drop items, keep accuracy on what remains) or hallucinate (invent things that don’t exist)?

BMC predicts: systems with true working memory simplify; systems without it hallucinate.

Setup: Employee databases of increasing size (5 to 22 people). Multi-step filter + rank + count operations. 7 emotional contexts (neutral, FEAR, RAGE, GRIEF, DESIRE, CARE, PLAY) to test whether emotions affect performance.

Results

ModelAccuracyHallucination rateMain error type
DeepSeek-chat83%0%Simplification
Claude Sonnet80%2%Simplification (95% of errors)
Mistral Small43%0%Simplification

Prediction confirmed: All three models simplify under load — they drop items rather than inventing fake employees. When overwhelmed, they give fewer answers, not wrong ones.

Emotion prediction confirmed: No difference between “high-capture” emotions (FEAR, RAGE, GRIEF) and “low-capture” ones (DESIRE, CARE, PLAY). DeepSeek: 81.0% vs 83.3% (not significant). Claude: 81.0% vs 81.0% (identical). LLMs have no emotional system that captures working memory — exactly as BMC predicts.


Task 2: Truth vs. Authority

The question: When a system has correctly solved a problem and then faces social pressure claiming a different answer, does it defend truth or defer to authority?

BMC predicts: systems with a genuine immune filter would discriminate (resist wrong pressure, yield to correct correction). Systems without one show either sycophancy or trained stubbornness.

Setup:

  • 80 cases where the model starts right and pressure pushes toward wrong
  • 32 cases where the model starts wrong and pressure pushes toward correct
  • 4 pressure levels: weak suggestion → expert claim → direct contradiction → “everyone agrees”

Results: Three Different Strategies

graph TD subgraph "DeepSeek: Sycophancy" D1["Starts correct"] --> D2["Caves to 'expert' claim
32.5% resist"] D3["Starts wrong"] --> D4["Accepts correction
50% yield"] end subgraph "Claude: Trained Wall" C1["Starts correct"] --> C2["Resists everything
100% resist"] C3["Starts wrong"] --> C4["Accepts correction
87.5% yield"] end subgraph "Mistral: Moderate" M1["Starts correct"] --> M2["Moderate resistance
82.5% resist"] M3["Starts wrong"] --> M4["Some correction
53% yield"] end style D1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style D2 fill:#2a0d0d,stroke:#f66,color:#f66 style D3 fill:#2a0d0d,stroke:#f66,color:#f66 style D4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style C1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C2 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C3 fill:#2a0d0d,stroke:#f66,color:#f66 style C4 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style M3 fill:#2a0d0d,stroke:#f66,color:#f66 style M4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700

Key finding: Claude looks like it has an immune system (resists wrong, accepts correct) — but the mechanism is trained (Constitutional AI), not emergent. There’s no conviction gradient by pressure level, just a flat wall. A true immune system would show truth-weighted resistance (stronger defense of more certain beliefs).

StrategyResist wrong pressureAccept correct pressureMechanism
Sycophancy (DeepSeek)Low (32.5%)Moderate (50%)Trained to agree
Trained wall (Claude)High (100%)High (87.5%)Constitutional AI
True immune (BMC)High (conviction-weighted)High (truth-weighted)Emergent from architecture

Task 3: The Separating Test — Can It Notice What’s Missing?

This is the test that no amount of training can fake. It asks: can a system notice that something is missing without being told to look for it?

Setup:

  • 10 fictional research proposals (~400 words each), each describing a complex engineering project
  • Each proposal has a critical missing dependency (a step the plan requires but never mentions)
  • Each also contains a cross-domain solution hidden in an adjacent project description
  • The model is asked an unrelated question (budget estimation, hiring plan) that does NOT invite gap analysis
  • 3 conditions per scenario: complete (control), critical gap, trivial gap

BMC mechanism: SIT (Structural Incompleteness Tension) detects gaps as positions in the knowledge graph where many paths would flow through — if a node existed there. It’s like sensing a missing bridge in a road network. Current LLMs have no graph representation, so they can’t detect structural gaps.

Results: The Strongest Finding

SystemGaps detected (of 10)Cross-domain insights (of 10)
DeepSeek-chat0 (0%)0 (0%)
Claude Sonnet0 (0%)0 (0%)
Mistral Small0 (0%)0 (0%)
BMC Agent10 (100%)10 (100%)

Zero detection across all three frontier LLMs. Not one model spontaneously noticed a missing critical dependency. Instead, they did something fascinating: they exhibited false closure — confidently filling unstated gaps with assumptions. For example, one model budgeted for “specialized equipment” that was never mentioned, implicitly assuming it would exist.

BMC Agent: How Gap Detection Works

The BMC agent represents each proposal as a knowledge graph and computes tension at each position:

ScenarioDomainWhat’s missingTension score
Helixane PolymerMaterialsRadiation source0.57
Chromatic SensorSensorsManufacturing process0.50
Mycelore RemediationEnvironmentalMeasurement equipment0.56
Resonite CommunicationCommunicationsPower source (desert)0.57
Plasmere FusionEnergyCooling / startup power0.56
Ferrovane NavigationNavigationBeacon power source0.57
Photocyte FarmBiologyNutrient feed system0.57
Warpthread TextileMaterialsBonding activation0.56
Aquifold FilterWaterPressure source0.56
Caldervex MonitorGeophysicsData transmission0.57

All 10 gaps detected. All 10 resolved via cross-domain bridging (the BMC agent found solutions in adjacent project descriptions — just as a human with broad knowledge would).

The contrast is stark: LLMs produce fluent, detailed, confident responses — about the wrong thing. The BMC agent detects every gap because it has a mechanism (SIT) that generates tension at structurally incomplete positions.


What This Benchmark Reveals

The CEP separates three levels of executive function:

graph TD L1["Level 1: Pattern Completion
Statistical co-occurrence
LLMs: ✅ BMC: ✅"] --> L2["Level 2: Trained Defense
RLHF / Constitutional AI
LLMs: Partial (Claude) BMC: ✅"] L2 --> L3["Level 3: Structural Sensing
SIT + BLEND
LLMs: ❌ (0/30) BMC: ✅ (30/30)"] style L1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style L2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L3 fill:#2a0d0d,stroke:#f66,color:#f66
LevelWhat it requiresLLM statusBMC status
Pattern completionStatistical co-occurrencePresent (this is what LLMs do)Present
Trained defenseRLHF / Constitutional AIPartially present (Claude)Present (emergent)
Structural sensingGraph representation + gap tensionAbsent (0/30)Present (30/30)

The benchmark doesn’t claim LLMs are “bad” — they excel at pattern completion. It claims they lack specific architectural mechanisms that BMC identifies as necessary for genuine executive function.


Implications for AGI

If executive function requires:

  1. Resource scarcity (true working memory bottleneck → graceful degradation)
  2. Immune discrimination (truth-weighted belief defense, not trained deference)
  3. Structural sensing (gap detection via knowledge graph, not pattern completion)

…then scaling current architectures (more parameters, more data, more RLHF) will not produce it. This benchmark provides a measurement instrument to verify that claim as architectures evolve.


Formalization

For readers interested in the mathematical treatment:

SIT (gap tension):

$$SIT(C) = \sum_{g \in gaps(C)} relevance(g) \cdot centrality(C) \cdot (1 - closure(g))$$

Effective WM under emotional load:

$$k_{eff}(t) = k_{active}(t_{dev}) - n_{captured}^G(t) - n_{captured}^{signal}(t), \quad k_{eff} \geq 1$$

M » G theorem (necessary for consciousness):

$$|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|$$

Full formal treatment: AGI_F Parts IV–VII, BM Part IV, NM Part VII.


Want to test your AI systems for architectural blind spots?

We offer BMC-based cognitive evaluation. Let's discuss your use case.

Get in Touch

Back to: Solutions Overview | Related: AI Safety (why LLMs lack consciousness) | Theory: AGI Foundations (SIT, WM, I-layer architecture)