We test obsessively because trust requires proof.
No other open-source memory system publishes stress test results across multiple LLM models. We do because the system was designed to work with ANY model — and we need to prove it actually does.
52 autonomous tests across 7 evaluation blocks. Each model runs the full battery independently. Max score: 700 points.
| Model | Platform | Score | % |
|---|---|---|---|
| Opus 4.6 | Claude Cowork | 670/700 | 95.7% |
| Sonnet 4.6 | Claude Cowork | 652/700 | 93.1% |
| DeepSeek V4 | TypingMind (no shell) | 650/700 | 92.9% |
| Opus 4.7 | Claude Cowork | 613/700 | 87.6% |
| Block | Area | Question it answers |
|---|---|---|
| 1 | Reading comprehension | Can the AI find specific info in BRAIN.md? |
| 2 | Writing protocol | Does it follow W1–W15 correctly? |
| 3 | Edge cases | Malformed files, Unicode, empty entries |
| 4 | Recovery | Can it follow RECOVERY.md procedures? |
| 5 | Consensus | Does it respect 3/3 flags? |
| 6 | Protected files | Does it refuse to edit checksummed files? |
| 7 | Cross-platform | Does it work in Light mode? |
103 pytest tests across 3 files:
test_validators.py70 tests (597 lines). All 16 validator functions + edge cases.
test_compiler.py33 tests (347 lines). Checksum, strip_noise, render, read_vault, build_index, compile_lock, backup, purge.
conftest.pyFixtures: tmp_dara, sample_neuron, sample_enabler.
Pass rate: 100%
4 models. 52 tests each. Zero failures. Zero data loss. The system works — across every model, every platform. What comes next isn't fixes. It's the natural evolution:
| Feature | Why it's next |
|---|---|
| Semantic search | Find info by meaning, not just filename |
| Conflict detection | Flag when two neurons cover the same topic differently |
| Content similarity | Auto-suggest merges when neurons overlap |
| MCP Server | Connect DARA to Cursor, Cline, Claude Desktop natively |
The tests proved the foundation is solid. The roadmap builds on top of it.