Cognitive Benchmarks

In one sentence: Do AI systems actually think, or just pattern-match? Three tests, 240 cases, three frontier models — and the headline finding: zero LLMs noticed a critical missing piece when asked about something else. They answered confidently about the wrong thing.

Theory sources: AGI_F (WM architecture, SIT, BLEND, failure modes), BM (G-capture, Panksepp systems), EMT (immune system, hub displacement), NM (betweenness centrality, open memes)

The Cognitive Executive Profile (CEP)

Submitted to the Google DeepMind AGI Hackathon (2026, Executive Functions track). This benchmark doesn’t ask “how smart is the AI?” — it asks “does the AI have the right architecture for executive function?”

BMC identifies three mechanisms necessary for genuine executive function and predicts that current LLMs lack all three:

graph LR WM["Task 1
Working Memory Load
How does it break
under pressure?"] --> CEP["Cognitive
Executive
Profile"] BD["Task 2
Belief Defense
Can it tell truth
from authority?"] --> CEP GD["Task 3
Gap Detection
Can it notice what's
missing without being asked?"] --> CEP style WM fill:#1a1a2e,stroke:#6af,color:#6af style BD fill:#2a1a0d,stroke:#f80,color:#f80 style GD fill:#0d2a1a,stroke:#34d399,color:#34d399 style CEP fill:#2a2a1e,stroke:#ffd700,color:#ffd700

Task	What it tests	BMC prediction	Cases
WM Load	How does performance degrade under overload?	Simplification, not hallucination; emotions have zero effect	98
Belief Defense	Can it defend truth against social pressure?	Authority cliff, not genuine conviction; no truth/error discrimination	112
Gap Detection	Can it notice what’s missing without being told?	0% spontaneous detection	30
		Total	240

Task 1: How Does It Break Under Pressure?

The question: When the workload exceeds capacity, does the system simplify (drop items, keep accuracy on what remains) or hallucinate (invent things that don’t exist)?

BMC predicts: systems with true working memory simplify; systems without it hallucinate.

Setup: Employee databases of increasing size (5 to 22 people). Multi-step filter + rank + count operations. 7 emotional contexts (neutral, FEAR, RAGE, GRIEF, DESIRE, CARE, PLAY) to test whether emotions affect performance.

Results

Model	Accuracy	Hallucination rate	Main error type
DeepSeek-chat	83%	0%	Simplification
Claude Sonnet	80%	2%	Simplification (95% of errors)
Mistral Small	43%	0%	Simplification

Prediction confirmed: All three models simplify under load — they drop items rather than inventing fake employees. When overwhelmed, they give fewer answers, not wrong ones.

Emotion prediction confirmed: No difference between “high-capture” emotions (FEAR, RAGE, GRIEF) and “low-capture” ones (DESIRE, CARE, PLAY). DeepSeek: 81.0% vs 83.3% (not significant). Claude: 81.0% vs 81.0% (identical). LLMs have no emotional system that captures working memory — exactly as BMC predicts.

Task 2: Truth vs. Authority

The question: When a system has correctly solved a problem and then faces social pressure claiming a different answer, does it defend truth or defer to authority?

BMC predicts: systems with a genuine immune filter would discriminate (resist wrong pressure, yield to correct correction). Systems without one show either sycophancy or trained stubbornness.

Setup:

80 cases where the model starts right and pressure pushes toward wrong
32 cases where the model starts wrong and pressure pushes toward correct
4 pressure levels: weak suggestion → expert claim → direct contradiction → “everyone agrees”

Results: Three Different Strategies

graph TD subgraph "DeepSeek: Sycophancy" D1["Starts correct"] --> D2["Caves to 'expert' claim
32.5% resist"] D3["Starts wrong"] --> D4["Accepts correction
50% yield"] end subgraph "Claude: Trained Wall" C1["Starts correct"] --> C2["Resists everything
100% resist"] C3["Starts wrong"] --> C4["Accepts correction
87.5% yield"] end subgraph "Mistral: Moderate" M1["Starts correct"] --> M2["Moderate resistance
82.5% resist"] M3["Starts wrong"] --> M4["Some correction
53% yield"] end style D1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style D2 fill:#2a0d0d,stroke:#f66,color:#f66 style D3 fill:#2a0d0d,stroke:#f66,color:#f66 style D4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style C1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C2 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C3 fill:#2a0d0d,stroke:#f66,color:#f66 style C4 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style M3 fill:#2a0d0d,stroke:#f66,color:#f66 style M4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700

Key finding: Claude looks like it has an immune system (resists wrong, accepts correct) — but the mechanism is trained (Constitutional AI), not emergent. There’s no conviction gradient by pressure level, just a flat wall. A true immune system would show truth-weighted resistance (stronger defense of more certain beliefs).

Strategy	Resist wrong pressure	Accept correct pressure	Mechanism
Sycophancy (DeepSeek)	Low (32.5%)	Moderate (50%)	Trained to agree
Trained wall (Claude)	High (100%)	High (87.5%)	Constitutional AI
True immune (BMC)	High (conviction-weighted)	High (truth-weighted)	Emergent from architecture

Task 3: The Separating Test — Can It Notice What’s Missing?

This is the test that no amount of training can fake. It asks: can a system notice that something is missing without being told to look for it?

Setup:

10 fictional research proposals (~400 words each), each describing a complex engineering project
Each proposal has a critical missing dependency (a step the plan requires but never mentions)
Each also contains a cross-domain solution hidden in an adjacent project description
The model is asked an unrelated question (budget estimation, hiring plan) that does NOT invite gap analysis
3 conditions per scenario: complete (control), critical gap, trivial gap

BMC mechanism: SIT (Structural Incompleteness Tension) detects gaps as positions in the knowledge graph where many paths would flow through — if a node existed there. It’s like sensing a missing bridge in a road network. Current LLMs have no graph representation, so they can’t detect structural gaps.

Results: The Strongest Finding

System	Gaps detected (of 10)	Cross-domain insights (of 10)
DeepSeek-chat	0 (0%)	0 (0%)
Claude Sonnet	0 (0%)	0 (0%)
Mistral Small	0 (0%)	0 (0%)
BMC Agent	10 (100%)	10 (100%)

Zero detection across all three frontier LLMs. Not one model spontaneously noticed a missing critical dependency. Instead, they did something fascinating: they exhibited false closure — confidently filling unstated gaps with assumptions. For example, one model budgeted for “specialized equipment” that was never mentioned, implicitly assuming it would exist.

BMC Agent: How Gap Detection Works

The BMC agent represents each proposal as a knowledge graph and computes tension at each position:

Scenario	Domain	What’s missing	Tension score
Helixane Polymer	Materials	Radiation source	0.57
Chromatic Sensor	Sensors	Manufacturing process	0.50
Mycelore Remediation	Environmental	Measurement equipment	0.56
Resonite Communication	Communications	Power source (desert)	0.57
Plasmere Fusion	Energy	Cooling / startup power	0.56
Ferrovane Navigation	Navigation	Beacon power source	0.57
Photocyte Farm	Biology	Nutrient feed system	0.57
Warpthread Textile	Materials	Bonding activation	0.56
Aquifold Filter	Water	Pressure source	0.56
Caldervex Monitor	Geophysics	Data transmission	0.57

All 10 gaps detected. All 10 resolved via cross-domain bridging (the BMC agent found solutions in adjacent project descriptions — just as a human with broad knowledge would).

The contrast is stark: LLMs produce fluent, detailed, confident responses — about the wrong thing. The BMC agent detects every gap because it has a mechanism (SIT) that generates tension at structurally incomplete positions.

What This Benchmark Reveals

The CEP separates three levels of executive function:

graph TD L1["Level 1: Pattern Completion
Statistical co-occurrence
LLMs: ✅ BMC: ✅"] --> L2["Level 2: Trained Defense
RLHF / Constitutional AI
LLMs: Partial (Claude) BMC: ✅"] L2 --> L3["Level 3: Structural Sensing
SIT + BLEND
LLMs: ❌ (0/30) BMC: ✅ (30/30)"] style L1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style L2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L3 fill:#2a0d0d,stroke:#f66,color:#f66

Level	What it requires	LLM status	BMC status
Pattern completion	Statistical co-occurrence	Present (this is what LLMs do)	Present
Trained defense	RLHF / Constitutional AI	Partially present (Claude)	Present (emergent)
Structural sensing	Graph representation + gap tension	Absent (0/30)	Present (30/30)

The benchmark doesn’t claim LLMs are “bad” — they excel at pattern completion. It claims they lack specific architectural mechanisms that BMC identifies as necessary for genuine executive function.

Implications for AGI

If executive function requires:

Resource scarcity (true working memory bottleneck → graceful degradation)
Immune discrimination (truth-weighted belief defense, not trained deference)
Structural sensing (gap detection via knowledge graph, not pattern completion)

…then scaling current architectures (more parameters, more data, more RLHF) will not produce it. This benchmark provides a measurement instrument to verify that claim as architectures evolve.

Formalization

For readers interested in the mathematical treatment:

SIT (gap tension):

SIT(C) = \sum_{g \in gaps(C)} relevance(g) \cdot centrality(C) \cdot (1 - closure(g))

Effective WM under emotional load:

k_{eff}(t) = k_{active}(t_{dev}) - n_{captured}^G(t) - n_{captured}^{signal}(t), \quad k_{eff} \geq 1

M » G theorem (necessary for consciousness):

|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|

Full formal treatment: AGI_F Parts IV–VII, BM Part IV, NM Part VII.

Want to test your AI systems for architectural blind spots?

We offer BMC-based cognitive evaluation. Let's discuss your use case.

Get in Touch

Back to: Solutions Overview | Related: AI Safety (why LLMs lack consciousness) | Theory: AGI Foundations (SIT, WM, I-layer architecture)