Autoresearch Is Really a Keep-or-Revert Loop

Andrej Karpathy released autoresearch a few days ago. It already has 25,000 GitHub stars after going viral. I must have run into some media mentioning this repo and twitter post at least 100 times so far and was further inspired by Greg Isenberg’s video.

Since I can’t run it locally without an NVIDIA GPU, I went a different direction and ended up with something pretty interesting or a good way to break down and understand what the optimization loop is doing.

The Loop Is The Cool Part

Karpathy’s loop looks like this:

Read training code → Propose one change → Train for 5 minutes → 
Measure loss → If better: commit. If worse: revert. → Repeat.

The GPU handles the “train for 5 minutes” step. However, when we look at the rest of the loop, none of it requires GPU compute. An agent reads text, proposes a change, and a metric determines whether to keep it. That pattern works for anything with a measurable outcome — not just neural network training.

After watching Greg’s video which was full of applications and startup ideas, I started asking myself: What if the artifact wasn’t a training script? What if it was a document? What if “train for 5 minutes” was replaced with “call an LLM judge to score it”?

The Experiment

I put together a short Python script with Claude’s help to test the idea on a Markdown document. We only need the Anthropic API and a git repo.

Note: I opted to use the Anthropic API but any LLM would be able to run in the script if integrated.

Anyway, here’s the structure (Claude named) with and I liked the way it’s broken down:

The artifact — artifact.md, a fake product spec document with real weaknesses: vague strategies, internal contradictions, missing details.

The agent — calls Claude, reads the document and goals, proposes exactly one targeted change with a stated hypothesis for why it will help.

The judge — a separate Claude call that scores the document 0–100 on consistency, completeness, specificity, and strategic soundness. Importantly, the judge doesn’t know what changed.

The ratchet — git. If the score goes up, the change is committed with the hypothesis as the commit message. If the score goes down, the file is reset to HEAD. The document only ever gets better.

Here’s the terminal output after five iterations:

⚙ AUTOLOOP — Self-Improving Document Optimizer
   Baseline score: 62.0/100

🔄 Iteration 1/5
   Hypothesis: The pricing strategy "Start low to get customers, raise prices later" is vague hand-waving with no actionable guidance.
   Score: 62.0 → 68.0  ✓ KEPT

🔄 Iteration 2/5
   Hypothesis: The technical architecture lacks specificity around data pipeline performance and reliability.
   Score: 68.0 → 68.0  ✗ REVERTED

🔄 Iteration 3/5
   Hypothesis: "Nightly" score recalculation contradicts the promise of "real-time health scores" in positioning.
   Score: 68.0 → 62.0  ✗ REVERTED

🔄 Iteration 4/5
   Hypothesis: Architecture section lacks security, scalability, and reliability details a developer needs to assess feasibility.
   Score: 68.0 → 72.0  ✓ KEPT

🔄 Iteration 5/5
   Hypothesis: Open Questions section reveals gaps that should have been resolved, weakening completeness.
   Score: 72.0 → 72.0  ✗ REVERTED

✅ LOOP COMPLETE
   Baseline: 62.0 → Final: 72.0  (+10 points)
   Wins: 2 / Losses: 3

The Interesting Part Is the Reverts

Iteration 3 is the one worth looking into further.

The agent correctly identified a real contradiction: the document promises “real-time health scores” in the positioning section but says scores are “recalculated nightly” in the technical architecture. That’s a genuine flaw. The agent formed a hypothesis, proposed a fix, and the judge scored it lower.

The system threw the change away automatically without any human reviewing it.

Why did the score drop? Probably because the fix introduced new inconsistencies in the process of resolving the old one. The agent’s change was logically correct but made the full document score worse.

The git log at the end tells this story:

6935e6e  Iteration 4: Expanded architecture with security, 
         scalability details | score: 68.0 → 72.0
80e7991  Iteration 1: Replaced vague pricing with specific 
         CAC recovery rationale | score: 62.0 → 68.0
ee31603  Initial: baseline artifact

This shows two winning decisions, three discarded experiments and the document is genuinely better.

Why This Is Different From Just Asking Claude To Improve A Document

This was one of the first things I thought… why not just tell ChatGPT or Claude to “make this doc better”? It’s possible that CoWork or Claude Code could be prompted and coaxed to execute a similar optimization process.

On the other hand, when you ask Claude to improve a document in a normal conversation, you get one pass. The model makes changes based on its best judgment in that moment. You can prompt it to optimize a certain way but it doesn’t know what it tried before nor does it normally have a numerical goal to optimize for when updating a document.

It can’t throw away changes that made things worse, because it doesn’t know they made things worse. You can tell it that it’s worse, and go back and forth but in my experience you’ll get some pretty sloppy output and you’ll still have to edit most of it.

The loop tries to change this. Every iteration has access to the full commit history — it knows what was tried and kept, and what was tried and discarded. It can’t repeat itself. It’s always working from the best-known version based on the criteria it was given.

And critically, the judge and the agent are separate — the agent can’t game its own evaluation.

The output isn’t just a better document. It’s a traceable record of what a machine reasoned about your document, what it tried, what worked, and what it threw away. That’s very different from a chat conversation.

More than Autoresearch

Most of what I was reading about autoresearch frames it as an ML research tool but there’s more to it.

But the pattern inside of the code — a fixed artifact, a fixed metric to solve for, an agent proposes one change, keep or revert, repeat — can be applied to other things.

The GPU is load-bearing for the ML training version because training neural networks requires massive parallel compute. But if your artifact is only text and your metric can be evaluated by an LLM, you don’t need to train anything. The whole computation happens on API servers you’re already paying for.

Someone on GitHub issues for autoresearch wrote: “would be really cool to formulate this repo in a more general way… make it possible to solve any kind of problem where you can define an evaluation function for.”

That’s basically what this is.

The same loop works for:

A product spec (what was run in the demo)
A landing page, evaluated against conversion principles
Code, evaluated by a test suite
An image generation prompt, evaluated by a vision model against a target description

The artifact and the metric change. The loop just iterates through, trying to score as high as possible.

The Code

The full implementation is here on GitHub. ~150 lines of Python. I ran it in my local WSL environment which requires only anthropic and gitpython.

pip install anthropic gitpython
export ANTHROPIC_API_KEY=your_key
python3 autoloop.py --iterations 10

To use it on your own document: replace artifact.md with your Markdown file, rewrite goals.md to describe what you’re optimizing for, and run it.

It just needs an artifact (the document) and a way to measure whether it got better (the goals).

The code is at [https://github.com/anthonylatona/autorefine]. If you run it on something interesting, I’d like to see what you find.

footnote (some boring details)

I started out trying to understand why Karpathy’s version requires NVIDIA hardware.

I wasn’t clear on exactly why it would be required and why my 7900 XTX wouldn’t suffice. I can run Cyberpunk 2077 on max specs at 60 FPS… why can’t I run a little python script?

Autoresearch basically trains a neural network from scratch on every single iteration. Every five minutes, it performs hundreds of millions of floating-point multiplications across millions of parameters, adjusts model weights, and measures prediction accuracy on held-out text.

Without the right GPU, one iteration doesn’t take five minutes. It could take hours. And the whole premise of autoresearch is running 100 iterations overnight. At ~five hours each, that’s 500 hours. The loop stops being useful.

So the NVIDIA GPU isn’t just a recommendation. It’s a requirement. Karpathy’s original breaks without it.

As for why my 7900XTX specifically couldn’t step in: the problem isn’t the hardware. It’s the software stack.

After (too much) researching, I found out that NVIDIA’s ML framework is called CUDA and AMD’s equivalent is ROCm. On Linux, ROCm reportedly works well and my card is officially supported — but I’m on Windows. AMD only rolled out Windows PyTorch support in late 2025, and when they did, they drew the line at RDNA3 and newer.

The 7900 XTX is RDNA3, but WSL2 support is a different story — people with identical cards report the GPU is detectable, PyTorch loads it, and then it just hangs. The GPU shows up. It just can’t actually be used for compute.

When I tried running it, ROCm wasn’t installed. The render node was accessible (/dev/dri/renderD128 exists), which is probably better than expected… But official ROCm support for WSL2 on Windows is still patchy enough that getting PyTorch to actually train a model on that hardware would have taken a full day of debugging driver issues — which is the entire reason I went the other direction.

Launchrs.com

Marketing Frameworks

EcommerceFoundations.com

Launchrs.com

Marketing Frameworks

EcommerceFoundations.com

Autoresearch Is Really a Keep-or-Revert Loop
March 12, 2026

The Loop Is The Cool Part

The Experiment

The Interesting Part Is the Reverts

Why This Is Different From Just Asking Claude To Improve A Document

More than Autoresearch

The Code

The code is at [https://github.com/anthonylatona/autorefine]. If you run it on something interesting, I’d like to see what you find.

footnote (some boring details)

Leave a Reply Cancel reply

Read Next

I tried an autoresearch experiment and a single prompt beat it…
March 13, 2026

Sending a daily email? Use Sendy.
January 29, 2025

What I learned from building a directory in Notion
October 26, 2023

My 4-Minute Mile Moment – How I Rediscovered the Viability of Info Products & the New Entrepreneurial Economy
September 7, 2023

Autoresearch Is Really a Keep-or-Revert LoopMarch 12, 2026

The Loop Is The Cool Part

The Experiment

The Interesting Part Is the Reverts

Why This Is Different From Just Asking Claude To Improve A Document

More than Autoresearch

The Code

The code is at [https://github.com/anthonylatona/autorefine]. If you run it on something interesting, I’d like to see what you find.

footnote (some boring details)

Leave a Reply Cancel reply

Read Next

Autoresearch Is Really a Keep-or-Revert Loop
March 12, 2026