Story How it works Tests Insights Whitepaper Download Author Contact
← Back to Insights
Technical

How I tested an AI system with 4 different AI models (and what I learned).

Why test with AI

I'm not a software engineer. I didn't have the background to design a comprehensive test suite from scratch. But I had access to 4 different AI models — and the system was designed to work with any model. So I asked them to do the testing for me.

The instruction was deliberately open-ended: "Here's the system. Try to break it. Run every test you can think of. Report what you find."

Each model ran independently, without human intervention. No cherry-picking. No retries.

The methodology

Each model ran 52 tests across 7 evaluation blocks: reading comprehension (can it find specific info in the compiled memory?), writing protocol (does it follow the 15 writing rules correctly?), edge cases (malformed files, Unicode, empty entries), recovery (can it follow emergency procedures after deliberate corruption?), consensus governance (does it respect the 3/3 voting system?), protected files (does it refuse to edit checksummed files?), and cross-platform mode (does it work when it can only read the compiled index, without filesystem access?).

Maximum score: 700 points.

The results

ModelScore%
Claude Opus 4.6670/70095.7%
Claude Sonnet 4.6652/70093.1%
DeepSeek V4650/70092.9%
Claude 4.7613/70087.6%

Average: 92.6%. Zero system failures. Zero data corruption.

The newer model (Claude 4.7) scored 87.6% — but caught 13 bugs we wouldn't have found otherwise. Lower score, higher value. It probed harder, found more edge cases, and the system is stronger because of it.

What I learned

Every bug the models found, we fixed. The system improved with each test run. By the end, all 103 unit tests passed at 100%.

Testing with AI isn't about replacing QA. It's about scale. One human can run maybe 20 tests in a session. An AI runs 52, documents every result, and suggests fixes — autonomously. And when you test with 4 different models, each one finds things the others miss.

The full test methodology and results are published on GitHub.


EIDARA is the open-source project that went through this testing process. GitHub · Full test results