Story How it works Tests Insights Whitepaper What's next Download Author Contact
← Back to Insights
Technical

Why my AI doesn't have eyes.

Published June 16, 2026

“So it does OCR?”

When I told a friend the third sense of my memory system would read the text of whatever window I had in focus, the first question was the obvious one: “So it does OCR? Or vision?” No. Neither. And why I didn’t take that path is the most interesting design decision in the whole project.

The naïve approach would be: screenshot the foreground window every few seconds, run it through a local vision-language model, get a description, extract themes. The technology works — there are commercial products doing exactly this today. Three problems made me reject it.

Three reasons not to look

One: it costs resources I don’t have. Running a vision model locally on a normal developer machine means choosing between it and the language model that’s already doing the curation — they don’t fit together. And cloud is out by hard rule: nothing in this system phones home.

Two: it produces noisy, opinionated output. A vision model looking at a screenshot doesn’t say “the user is reading about React Hooks.” It says “a webpage with a header bar, a left navigation menu, dark-theme code blocks, a comment section below.” Most of what it generates describes the interface, not the topic. Terrible signal-to-noise for a system trying to extract one real theme.

Three: it’s the wrong abstraction. The text on your screen isn’t a picture — it’s text. Treating it as a picture means destroying the structure at the bitmap and then trying to recover it with statistics.

One layer down

So I went one layer down. Windows — and macOS and Linux, in similar shape — has an accessibility API; Windows calls it the UI Automation API. It exists so screen readers for blind users can announce what’s on screen. Browsers, document editors, native apps, even most Electron apps expose a tree of text through it. The same API that lets a screen reader say “Heading: React Hooks. Paragraph: useState lets you add state to a function component” is the one my screen sense queries.

That gives me clean, structured text, attribution to the app that produced it, and a fast read — no GPU involved. What I lose: anything in a canvas, an image, a video, or a game — apps that expose no accessibility text. Which is fine. If I wanted the memory to learn from those, the right move is to be intentional about it — paste a screenshot, write a note — not let a vision model guess.

The privacy lives in the protocol

There’s a privacy story here too. The accessibility tree skips password fields by default — they’re flagged at the API level, and a reader that respects the flag never sees the contents. A vision model pointed at a banking app would see the rendered balance, the account number, everything on screen. The accessibility tree simply doesn’t hand it over.

So: no OCR, no vision, no creepy. Just the protocol that’s let screen readers describe interfaces for thirty years, repurposed to feed a personal memory. The memory understands what I’m reading. It doesn’t see my screen.

— Javier

EIDARA v2 is free. SUPER DARA is what comes next.


See the full roadmap →

Keep reading

I gave my AI three senses When two AIs argue about my memory