Skip to content

Commit 4efdca4

Browse files
committed
.
1 parent f492dc9 commit 4efdca4

File tree

1 file changed

+154
-0
lines changed

1 file changed

+154
-0
lines changed
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Vibe Coding Terminal Editor
2+
3+
I "wrote" [a small tool](https://github.com/matklad/terminal-editor/) for myself as my biannual
4+
routine check of where llms are currently at. I think I've learned a bunch from this exercise. This
5+
is frustrating! I don't want to learn by trial and error, I'd rather read someone's blog post with
6+
lessons learned. Sadly, _most_ of the writing on the topic that percolates to me tends to be
7+
high-level --- easy to nod along while reading, but hard to extract actionable lessons. So this is
8+
what I want to do here, list specific tricks learned.
9+
10+
## Terminal Editor
11+
12+
Let me quickly introduce the project. It's a VS Code extension that allows me to run "shell" inside
13+
my normal editor widget, such that the output is normal text buffer where all standard
14+
motion/editing commands work. So I can "goto definition" on paths printed as a part of backtrace,
15+
use multiple cursors to copy compiler's suggestions, or just [PageUp]{.kbd} / [PageDown]{.kbd} to
16+
scroll the output. If you are familiar with Emacs, it's
17+
[Eshell](https://www.gnu.org/software/emacs/manual/html_mono/eshell.html), just worse:
18+
19+
![](https://github.com/user-attachments/assets/acaf653e-a170-4685-8cce-5ca8dd31b9b4){width=1398 height=1086}
20+
21+
I now use `terminal-editor` to launch most of my compilation commands, as it has several niceties on
22+
top of what my normal shell provides. For example, by default only the last 50 lines of output are
23+
shown, but I can hit tab to fold and unfold full output. Such a simple feature, but such a pain to
24+
implement in a UNIX shell/terminal!
25+
26+
What follows is an unstructured bag of things learned:
27+
28+
## Plan / Reset
29+
30+
I originally tried to use `claude` code normally, by iteratively prompting in the terminal until I
31+
get the output I want. This was frustrating, as it was too easy to miss a good place to commit a
32+
chunk of work, or to rein in a conversation going astray. This "prompting-then-waiting" mode also had
33+
a pattern of mental context switches not matching my preferred style of work. This article suggests
34+
a better workflow: <https://harper.blog/2025/05/08/basic-claude-code/>{.display}
35+
36+
Instead of writing your single prompt in the terminal, you write an entire course of action as a
37+
task list in `plan.md` document, and the actual prompt is then something along the lines of
38+
39+
> Read @plan.md, complete the next task, and mark it with `X`.
40+
41+
After `claude` finishes iterating on a step you look at the diff and interactively prompt for
42+
necessary corrections. When you are happy, `git commit` and `/clear` the conversation, to start the
43+
next step from the clean slate.
44+
45+
The plan pattern reduces context switches, because it allows you to plan several steps ahead, while
46+
you are in the planning mode, even if it makes sense to do the work one step at a time. I often also
47+
work on continuing the plan when `claude` is working on the current task.
48+
49+
## Whiteboard / Agent Metaphor
50+
51+
A brilliant metaphor from another post
52+
<https://crawshaw.io/blog/programming-with-agents>{.display}
53+
is that prompting LLM for some coding task and then expecting it to one-shot a working solution is
54+
quite a bit like asking a candidate to whiteboard an algorithm during the interview.
55+
56+
LLMs are clearly superhuman at whiteboarding, but you can't go far without feedback. "Agentic"
57+
programming like `claude` allows LLMs to iterate on solution.
58+
59+
LLMs are _much_ better at whiteboarding than at iterating. My experience is that, starting with
60+
suboptimal solution, LLM generally can't improve it by itself along the fuzzy aesthetic metrics I
61+
care about. They can make valid changes, but the overall quality stays roughly the same.
62+
63+
However, LLMs are tenacious, and can do a lot of iterations. If you _do_ have a value function, you
64+
can use it to extract useful work from random walk! A _bad_ value function is human judgement.
65+
Sitting in the loop with LLM and pointing out mistakes is both frustrating and slow (you are the
66+
bottleneck). In contrast "make this test green" is very efficient at getting working (≠ good)
67+
code.
68+
69+
## Spec Is Code Is Tests
70+
71+
LLMs are good at "closing the loop", they can make the ends meet. This insight combined with the
72+
`plan.md` pattern gives my current workflow --- spec ↔ code ↔ test loop. Here's the story:
73+
74+
I coded the first version of `terminal-editor` using just the `plan.md` pattern, but at some point I
75+
hit complexity wall. I realized that my original implementation strategy for syntax highlighting was
76+
a dead end, and I needed to change it, but that was hard to do without making a complete mess of the
77+
code. The accumulated `plan.md` reflected a bunch of historical detours, and the tests were too
78+
brittle and coupled to the existing implementation (more on tests later). This worked for
79+
incremental additions, but now I wanted to change something in the middle.
80+
81+
I realized that what I want is not an append-only `plan.md` that reflects history, but rather a
82+
mutable `spec.md` that describes clearly how the software should behave. For normal engineering,
83+
this would have been "damn, I guess I need to throw one out and start afresh" moment. With `claude`,
84+
I added `plan.md` and all the code to the context and asked it to write `spec.md` file in the same
85+
task list format. There are two insights here:
86+
87+
_First_, mutable spec is a good way to instruct LLM. When I want to apply a change to
88+
`terminal-editor` now, I prompt `claude` to update the spec first (unchecking any items that need
89+
re-doing), manually review/touch-up the spec, and use a canned prompt to align the code and tests
90+
with the spec.
91+
92+
_Second_, that you can think of an LLM as a machine translation, which can automatically convert
93+
between working code, specification, and tests. You can treat _any_ of those things as an input, as
94+
if you are coding in [miniKanren](https://minikanren.org)!
95+
96+
## Tests
97+
98+
I did have this idea of closing the loop when I started with `terminal-editor`, so I crafted the
99+
prompts to emphasize testing. You can guess the result! `claude` wrote a lot of tests, following all
100+
the modern "best practices" --- a deluge of unit tests that were just needlessly nailing down
101+
internal API, a jungle of bug-hiding mocks, and a bunch of unfocused integration tests which were
102+
slow, flaky, and contained a copious amount of sleeps to paper over synchronization bugs. Really,
103+
this was eerily similar to a typical test suite you can find in the wild. I am wondering why is
104+
that?
105+
106+
This is perhaps my main take away: if I am vibe-coding anything again, and I want to maintain it and
107+
not just one-shot it, I will think very hard about the testing strategy. Really, to tout my own
108+
horn, I think that perhaps [_How to Test?_](https://matklad.github.io/2021/05/31/how-to-test.html)
109+
is the best article out there about agentic coding. Test iteration is a multiplier for humans, but a
110+
hard requirement for LLMs. Test must be very fast, non-flaky, and should end-to-end test application
111+
_features_, rather than code.
112+
113+
Concretely, I just completely wiped out all the existing tests. Then I added testing strategy to the
114+
spec. There are two functions:
115+
116+
```ts
117+
export async function sync(): Promise<void>
118+
export function snapshot(): string
119+
```
120+
121+
The `sync` function waits for all outstanding async work (like external processes) to finish. This
122+
requires properly threading causality throughout the code. E.g., there's a promise you can `await`
123+
on to join currently running process. The `snapshot` function captures the entire state of the
124+
extension as a single string. There's just one mock for the clock (another improvement on the
125+
usual terminal --- process runtime is always show).
126+
127+
Then, I prompted `claude` with something along the lines of
128+
129+
> Oups, looks like someone wiped out all the tests here, but the code and the spec look decent,
130+
> could you re-create the test suite using `snapshot` function as per @spec.md?
131+
132+
It worked. Again, "throw one away" is very cheap.
133+
134+
## Conclusions
135+
136+
That's it! LLMs obviously can code. You need to hold them right. In particular, you need to engineer
137+
a feedback loop to let LLM iterate at its own pace. You don't want human in the "data plane" of the
138+
loop, only in the control plane.
139+
Learn to [architecture for testing](https://matklad.github.io/2021/05/31/how-to-test.html).
140+
141+
LLM drastically reduce the activation energy for writing custom tools. I wanted something like
142+
`terminal-editor` forever, but it was never the most attractive yak to shave. Well, now I have the
143+
thing, I use it daily.
144+
145+
LLMs don't magically solve all software engineering problems. The biggest time sink with
146+
`terminal-editor` was solving the `pty` problem, but LLMs are not yet at the "give me UNIX, but
147+
without `pty` mess" stage.
148+
149+
LLMs don't solve maintenance. A while ago I wrote about
150+
[_LSP for jj_](https://matklad.github.io/2024/12/13/majjit-lsp.html). I think I can actually code
151+
that up in a day with Claude now? Not the proof of concept, the production version with everything
152+
_I_ would need. But I don't want to _maintain_ that. I don't want to context switch to fix a minor
153+
bug, if I am the only one using the tool. And, well, if I make this for other people, I'd definitely
154+
be on the hook for maintaining it :D

0 commit comments

Comments
 (0)