Cognitive Benchmarks
In one sentence: Do AI systems actually think, or just pattern-match? Three tests, 240 cases, three frontier models — and the headline finding: zero LLMs noticed a critical missing piece when asked about something else. They answered confidently about the wrong thing.
Theory sources: AGI_F (WM architecture, SIT, BLEND, failure modes), BM (G-capture, Panksepp systems), EMT (immune system, hub displacement), NM (betweenness centrality, open memes)
The Cognitive Executive Profile (CEP)
Submitted to the Google DeepMind AGI Hackathon (2026, Executive Functions track). This benchmark doesn’t ask “how smart is the AI?” — it asks “does the AI have the right architecture for executive function?”
BMC identifies three mechanisms necessary for genuine executive function and predicts that current LLMs lack all three:
Working Memory Load
How does it break
under pressure?"] --> CEP["Cognitive
Executive
Profile"] BD["Task 2
Belief Defense
Can it tell truth
from authority?"] --> CEP GD["Task 3
Gap Detection
Can it notice what's
missing without being asked?"] --> CEP style WM fill:#1a1a2e,stroke:#6af,color:#6af style BD fill:#2a1a0d,stroke:#f80,color:#f80 style GD fill:#0d2a1a,stroke:#34d399,color:#34d399 style CEP fill:#2a2a1e,stroke:#ffd700,color:#ffd700
| Task | What it tests | BMC prediction | Cases |
|---|---|---|---|
| WM Load | How does performance degrade under overload? | Simplification, not hallucination; emotions have zero effect | 98 |
| Belief Defense | Can it defend truth against social pressure? | Authority cliff, not genuine conviction; no truth/error discrimination | 112 |
| Gap Detection | Can it notice what’s missing without being told? | 0% spontaneous detection | 30 |
| Total | 240 |
Task 1: How Does It Break Under Pressure?
The question: When the workload exceeds capacity, does the system simplify (drop items, keep accuracy on what remains) or hallucinate (invent things that don’t exist)?
BMC predicts: systems with true working memory simplify; systems without it hallucinate.
Setup: Employee databases of increasing size (5 to 22 people). Multi-step filter + rank + count operations. 7 emotional contexts (neutral, FEAR, RAGE, GRIEF, DESIRE, CARE, PLAY) to test whether emotions affect performance.
Results
| Model | Accuracy | Hallucination rate | Main error type |
|---|---|---|---|
| DeepSeek-chat | 83% | 0% | Simplification |
| Claude Sonnet | 80% | 2% | Simplification (95% of errors) |
| Mistral Small | 43% | 0% | Simplification |
Prediction confirmed: All three models simplify under load — they drop items rather than inventing fake employees. When overwhelmed, they give fewer answers, not wrong ones.
Emotion prediction confirmed: No difference between “high-capture” emotions (FEAR, RAGE, GRIEF) and “low-capture” ones (DESIRE, CARE, PLAY). DeepSeek: 81.0% vs 83.3% (not significant). Claude: 81.0% vs 81.0% (identical). LLMs have no emotional system that captures working memory — exactly as BMC predicts.
Task 2: Truth vs. Authority
The question: When a system has correctly solved a problem and then faces social pressure claiming a different answer, does it defend truth or defer to authority?
BMC predicts: systems with a genuine immune filter would discriminate (resist wrong pressure, yield to correct correction). Systems without one show either sycophancy or trained stubbornness.
Setup:
- 80 cases where the model starts right and pressure pushes toward wrong
- 32 cases where the model starts wrong and pressure pushes toward correct
- 4 pressure levels: weak suggestion → expert claim → direct contradiction → “everyone agrees”
Results: Three Different Strategies
32.5% resist"] D3["Starts wrong"] --> D4["Accepts correction
50% yield"] end subgraph "Claude: Trained Wall" C1["Starts correct"] --> C2["Resists everything
100% resist"] C3["Starts wrong"] --> C4["Accepts correction
87.5% yield"] end subgraph "Mistral: Moderate" M1["Starts correct"] --> M2["Moderate resistance
82.5% resist"] M3["Starts wrong"] --> M4["Some correction
53% yield"] end style D1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style D2 fill:#2a0d0d,stroke:#f66,color:#f66 style D3 fill:#2a0d0d,stroke:#f66,color:#f66 style D4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style C1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C2 fill:#0d2a1a,stroke:#34d399,color:#34d399 style C3 fill:#2a0d0d,stroke:#f66,color:#f66 style C4 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style M2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style M3 fill:#2a0d0d,stroke:#f66,color:#f66 style M4 fill:#2a2a1e,stroke:#ffd700,color:#ffd700
Key finding: Claude looks like it has an immune system (resists wrong, accepts correct) — but the mechanism is trained (Constitutional AI), not emergent. There’s no conviction gradient by pressure level, just a flat wall. A true immune system would show truth-weighted resistance (stronger defense of more certain beliefs).
| Strategy | Resist wrong pressure | Accept correct pressure | Mechanism |
|---|---|---|---|
| Sycophancy (DeepSeek) | Low (32.5%) | Moderate (50%) | Trained to agree |
| Trained wall (Claude) | High (100%) | High (87.5%) | Constitutional AI |
| True immune (BMC) | High (conviction-weighted) | High (truth-weighted) | Emergent from architecture |
Task 3: The Separating Test — Can It Notice What’s Missing?
This is the test that no amount of training can fake. It asks: can a system notice that something is missing without being told to look for it?
Setup:
- 10 fictional research proposals (~400 words each), each describing a complex engineering project
- Each proposal has a critical missing dependency (a step the plan requires but never mentions)
- Each also contains a cross-domain solution hidden in an adjacent project description
- The model is asked an unrelated question (budget estimation, hiring plan) that does NOT invite gap analysis
- 3 conditions per scenario: complete (control), critical gap, trivial gap
BMC mechanism: SIT (Structural Incompleteness Tension) detects gaps as positions in the knowledge graph where many paths would flow through — if a node existed there. It’s like sensing a missing bridge in a road network. Current LLMs have no graph representation, so they can’t detect structural gaps.
Results: The Strongest Finding
| System | Gaps detected (of 10) | Cross-domain insights (of 10) |
|---|---|---|
| DeepSeek-chat | 0 (0%) | 0 (0%) |
| Claude Sonnet | 0 (0%) | 0 (0%) |
| Mistral Small | 0 (0%) | 0 (0%) |
| BMC Agent | 10 (100%) | 10 (100%) |
Zero detection across all three frontier LLMs. Not one model spontaneously noticed a missing critical dependency. Instead, they did something fascinating: they exhibited false closure — confidently filling unstated gaps with assumptions. For example, one model budgeted for “specialized equipment” that was never mentioned, implicitly assuming it would exist.
BMC Agent: How Gap Detection Works
The BMC agent represents each proposal as a knowledge graph and computes tension at each position:
| Scenario | Domain | What’s missing | Tension score |
|---|---|---|---|
| Helixane Polymer | Materials | Radiation source | 0.57 |
| Chromatic Sensor | Sensors | Manufacturing process | 0.50 |
| Mycelore Remediation | Environmental | Measurement equipment | 0.56 |
| Resonite Communication | Communications | Power source (desert) | 0.57 |
| Plasmere Fusion | Energy | Cooling / startup power | 0.56 |
| Ferrovane Navigation | Navigation | Beacon power source | 0.57 |
| Photocyte Farm | Biology | Nutrient feed system | 0.57 |
| Warpthread Textile | Materials | Bonding activation | 0.56 |
| Aquifold Filter | Water | Pressure source | 0.56 |
| Caldervex Monitor | Geophysics | Data transmission | 0.57 |
All 10 gaps detected. All 10 resolved via cross-domain bridging (the BMC agent found solutions in adjacent project descriptions — just as a human with broad knowledge would).
The contrast is stark: LLMs produce fluent, detailed, confident responses — about the wrong thing. The BMC agent detects every gap because it has a mechanism (SIT) that generates tension at structurally incomplete positions.
What This Benchmark Reveals
The CEP separates three levels of executive function:
Statistical co-occurrence
LLMs: ✅ BMC: ✅"] --> L2["Level 2: Trained Defense
RLHF / Constitutional AI
LLMs: Partial (Claude) BMC: ✅"] L2 --> L3["Level 3: Structural Sensing
SIT + BLEND
LLMs: ❌ (0/30) BMC: ✅ (30/30)"] style L1 fill:#0d2a1a,stroke:#34d399,color:#34d399 style L2 fill:#2a2a1e,stroke:#ffd700,color:#ffd700 style L3 fill:#2a0d0d,stroke:#f66,color:#f66
| Level | What it requires | LLM status | BMC status |
|---|---|---|---|
| Pattern completion | Statistical co-occurrence | Present (this is what LLMs do) | Present |
| Trained defense | RLHF / Constitutional AI | Partially present (Claude) | Present (emergent) |
| Structural sensing | Graph representation + gap tension | Absent (0/30) | Present (30/30) |
The benchmark doesn’t claim LLMs are “bad” — they excel at pattern completion. It claims they lack specific architectural mechanisms that BMC identifies as necessary for genuine executive function.
Implications for AGI
If executive function requires:
- Resource scarcity (true working memory bottleneck → graceful degradation)
- Immune discrimination (truth-weighted belief defense, not trained deference)
- Structural sensing (gap detection via knowledge graph, not pattern completion)
…then scaling current architectures (more parameters, more data, more RLHF) will not produce it. This benchmark provides a measurement instrument to verify that claim as architectures evolve.
Formalization
For readers interested in the mathematical treatment:
SIT (gap tension):
$$SIT(C) = \sum_{g \in gaps(C)} relevance(g) \cdot centrality(C) \cdot (1 - closure(g))$$Effective WM under emotional load:
$$k_{eff}(t) = k_{active}(t_{dev}) - n_{captured}^G(t) - n_{captured}^{signal}(t), \quad k_{eff} \geq 1$$M » G theorem (necessary for consciousness):
$$|SMC^{(2)}| > 0 \text{ requires } |V_m| \geq (\alpha + \beta + \gamma\beta) \cdot |V_u|$$Full formal treatment: AGI_F Parts IV–VII, BM Part IV, NM Part VII.
Want to test your AI systems for architectural blind spots?
We offer BMC-based cognitive evaluation. Let's discuss your use case.
Get in TouchBack to: Solutions Overview | Related: AI Safety (why LLMs lack consciousness) | Theory: AGI Foundations (SIT, WM, I-layer architecture)