Karpathy described the architecture. We already built it.

Andy Massey · 25 April 2026 · Hong Kong

On Dwarkesh Patel's podcast on 17 October 2025, Andrej Karpathy laid out what he thinks the next architecture for personal AI will look like. A small reasoner. An external memory. The two glued together by retrieval.

I listened to it on a walk. By the end I was laughing, because he was describing the system that had been running on the Mac Mini on my desk for the previous month.

This post is not a victory lap. Karpathy got there first in public, and he has earned the right to set the frame. What I want to do is walk through his argument, map it onto what Ostler already does, and be honest about what this means for anyone building in this space.

The argument, in plain English

Karpathy's claim is that roughly 95% of a frontier model's weights are doing memorisation – stock tickers, broken HTML, forum spam, autogenerated gibberish scraped from the open web. Only around 5% is doing actual reasoning. The architecture is inefficient by design, because nobody has bothered to separate the two jobs.

Split them apart, he says, and a 1-billion-parameter reasoner with a good retrieval layer beats a 1.8-trillion-parameter model that tries to do both. That is a roughly 1,800x compression claim, and the maths is defensible. Llama 3 compresses its training data at something like 0.07 bits per token. Well-structured English carries around 1.5 bits per token. The trillion-parameter model is holding a low-resolution compressed image of the internet it was trained on, and most of that image is noise.

The direction of travel is already established. GPT-4o, at around 200 billion parameters, outperforms the original GPT-4 at 1.8 trillion. Inference cost for GPT-3.5-level quality fell by a factor of 280 between 2022 and 2024. Smaller, cleaner, better-architected models keep winning.

The right architecture separates the reasoner from the memory. The reasoner stays small. The memory is whatever you curate.

What this looks like in Ostler

We have been shipping the cognitive-core pattern since I founded Creative Machines in September 2025. It just was not called that yet.

The reasoner is a 9-billion-parameter open-weight model running locally on your Mac. Today that is Qwen 3.5 9B. Next quarter it will be something better. The weights are interchangeable because the reasoner has exactly one job: take a question, consult the memory, return an answer. It does not need to know what the Dow closed at in 2017. It needs to know how to think.

The memory is your Personal World Graph. Your contacts, your calendar, your messages, your browsing history, your documents, your conversations, all structured into nodes and edges and vector embeddings. A vector store for similarity search. An RDF graph for relationships. A SQLite file for everything else. No encyclopedic ballast. Nothing about anyone who is not in your life.

When you ask Ostler a question, the reasoner queries the memory, gets back a focused slice of your actual life, and answers from that. The reasoner does the reasoning. The graph does the remembering. Same pattern Karpathy described.

Why this matters strategically

Two consequences flow from this that I do not think are widely appreciated yet.

The first is that local models are catching up faster than anyone outside this space realises. The 280x cost-quality trend is not slowing. A 9B model running on a Mac Mini today is roughly where GPT-3.5 was two years ago. At the same trajectory, next year's 9B runs at today's 30B quality for the same energy cost. Our product gets smarter while we sleep, and we do not have to change a line of code to make that happen. We just swap the weights.

The second is that the memory is the moat, not the model. We do not train models. We do not have a ten-figure compute budget. We will never out-GPU OpenAI. But OpenAI cannot see the WhatsApp threads that matter to me this week, or the email chain with a collaborator I now want to reach out to again, or the meeting notes from six months ago that I need to refer to today. That corpus exists only on my hardware. A frontier lab cannot reproduce it because they do not have consent to extract my life. Nobody does except me.

Karpathy's architecture makes the reasoner a commodity. That is fine. We were never going to win on the reasoner. We are going to win on the memory, and the memory only works if it stays on your machine.

And the validation has kept arriving

In April 2026, OpenAI quietly released Privacy Filter as Apache-2.0 open weights – a 1.5-billion-parameter specialist model whose entire job is to scrub personally identifiable information from text before it leaves the device. The cognitive-core thesis as a concrete deliverable, shipped by the company you would least expect to ship it. (We are wiring it into Ostler's pipeline this week; separate post here.)

A few days later, a peer-reviewed paper from Nanjing University and ByteDance landed with a benchmark behind the same argument. PersonaVLM (arXiv 2604.13074): a 7-billion-parameter reasoner with a curated personalised memory beats GPT-4o by 5.2% on long-term personalisation tasks. Their memory taxonomy – core, semantic, procedural, episodic – maps almost cleanly onto the Personal World Graph that Ostler already builds. Their personality-evolution mechanism (a five-dimensional Big Five vector updated via exponential moving average across interactions) is a clean answer to a question I had been carrying in my own backlog since 2024.

Six months. Three independent endorsements – one from a former founding member of OpenAI, one from OpenAI itself, one from a peer-reviewed academic team. The architecture I bet on in late 2025 is no longer mine to defend.

The contrast with cloud-routed personal AI

Apple is reportedly about to announce that Siri will route difficult queries through Google Gemini. Perplexity's "Personal Computer" product advertises a Mac Mini hub that forwards your data to their servers for processing. Poke raised $25 million to build an iMessage assistant that lives in the cloud.

Every one of these products has a trillion-parameter model doing the reasoning and a cloud database holding the memory. They are optimising the wrong half. The reasoner is already getting smaller every quarter. The memory is the part that should never have left your device in the first place.

If Karpathy is right about the architecture – and the cost-quality curve says he is – then cloud-routed personal assistants are building on a foundation that is actively shrinking underneath them. You do not need their 1.8-trillion-parameter model to answer "when did I last see James?". You need a small reasoner and your own calendar.

Where we go from here

I am not going to claim we saw this coming from the start with Karpathy-level clarity. We did not. We built it this way because the cloud route was a non-starter on privacy grounds, and because Apple Silicon made local inference cheap enough to actually try. The architecture fell out of the constraints.

What I will claim is that the constraints were the right ones. Privacy forced us into the cognitive-core pattern before it had a name. Now it has a name, and a very well-known researcher has made the case for it in public, and our job is to keep building out the memory half while the reasoner half gets handed to us for free by the open-weights community every three months.

If that is the future you want to bet on, the product is in friends beta. You can read about the architecture here.

A small reasoner. Your life as memory.

Ostler runs on your Mac. The architecture the field is moving toward.

Why Ostler?

Questions, corrections, disagreements – hello@ostler.ai.