Like grad-student descent, but with agents.
I asked Opus 4.6, inside Cursor's CLI agent to write a torch nn.Module, i.e., a neural network along with
an evaluation script that fit for only 3 minutes and reported the test-set accuracy.
Then I asked it to follow the Popperian approach (KPop) to the scientific method
- Hypothesize: Write a falsifiable statement about the problem under study.
- Falsify: Run a test that could falsify the hypothesis.
- After each hypothesis, the agent has a little more information about the problem.
- "Hallucination" is mostly irrelevant in this process. All hypotheses are welcome, as long as they can be falsified. The system grounds itself in reality by running the evaluation. The agent doesn't even need to be all that smart! It just needs to be disciplined and tenacious.
- Restricting the fitting time speeds up evaluation and makes it clear that fitting is done. Compare this to "fit until converged" (which is indeterminately long and hard to even define). Also, if you can get the loss, in general wouldn't you rather get done in less time?
export PYTHONPATH=${PWD}
python experiments/mnist/fit_mnist.pyYou should see something like:
epoch 1/4 train_loss=0.4003 test_acc=0.9785
epoch 2/4 train_loss=0.0643 test_acc=0.9867
epoch 3/4 train_loss=0.0280 test_acc=0.9907
timeout after 180s (during epoch 4)
final test_acc=0.9917
That's on my old Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, so YMMV.
In fact, I get better results on my Macbook Air (still CPU, not GPU):
epoch 1/4 train_loss=0.4049 test_acc=0.9634
epoch 2/4 train_loss=0.0640 test_acc=0.9859
epoch 3/4 train_loss=0.0286 test_acc=0.9910
epoch 4/4 train_loss=0.0124 test_acc=0.9914
final test_acc=0.9914
See exp_log.md and exp_log_2.md for all hypotheses and falsification results. The agent reports:
- Architecture matters most: MLP -> CNN was the biggest single jump (0.981 -> 0.989).
- Speed is accuracy on a budget: On CPU with a time limit, anything that slows epochs (deeper nets, torch.compile overhead, data augmentation) hurts even if it would help given unlimited time.
- Diminishing returns on hyperparams: Once architecture and LR schedule are right, tuning dropout/label smoothing/weight decay yielded <0.1% changes.
- OneCycleLR schedule mismatch helps: Configuring OneCycleLR for more epochs than actually complete keeps the LR from decaying too fast.
The idea here is to construct a controller in simulation that could be tuned with only a few rounds of experimental Bayesian optimization. We don't expect our simulations to match reality. We just hope that they're close enough to (i) put us in the ballpark of reality, and (ii) let us know whether the controller is tunable under a reasonable variety of conditions.
I gave Opus TuRBO-ENN as a CLI tool to optimize a heuristic controller of its design. It followed the "KPop" approach (described above) and found several ways to improve its baseline controller while developing some understanding of what mattered and what didn't. Read more.