This doc summarizes relevant prior art on “next shell/CLI command prediction” and distills it into a concrete, lightweight online learning design aligned with zage’s goals:
- Sequence effects (e.g., after
git addsuggestgit commit) but conditioned on workspace/cwd/exit. - Soft semantic similarity / generalization (flag reorder, different filenames, similar subcommands).
- Higher-order context interactions (workspace_root + cwd + exit_status + recent commands; host/user/time/session as low-weight signals).
The intent is to replace “heavy offline training” with an always-on model that learns immediately from record events and stays strongly context-bound.
- Invocation: a row in
shell_history(a single executed command). - Command (unique): a distinct command line (represented in zage by
command_stats.command, derived fromshell_history.expanded_command). - Workspace root: zage’s notion of a project/workspace root (often a git repo root, but not always).
- Context: structured fields + recent history window. Priority (high → low):
workspace_root > cwd > exit_status > host > user > time-of-day > session.
Paper: “Predicting UNIX Command Lines: Adjusting to User Patterns” (AAAI 2000). PDF mirror:
https://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/712/paper_korvemaker00predicting.pdf
Key takeaways:
- A relatively simple adaptive model can predict next commands surprisingly often, but it is difficult to squeeze out large additional gains in an offline setting.
- The target distribution is non-stationary (goals shift; workflows start/stop).
- They explore “mixture of experts” ideas: different “experts” specialize on different slices/signals.
Distilled implication for zage:
- Prefer online adaptation and context partitioning/backoff over a single global model.
- Don’t treat “session boundaries” as hard gates; treat them as features and backoff hints.
- Favor “cheap signals that track non-stationarity” (online weights, decays, per-context statistics).
Thesis landing page (includes DOI):
https://era.library.ualberta.ca/items/25101b5c-a867-401c-adbf-21078e1c6f9a
Key takeaways:
- Reinforces the core difficulty: offline “big model” improvements tend to be marginal because user behavior shifts.
Distilled implication for zage:
- Make “learning” incremental and cheap so we can update continuously and avoid long retrains.
Paper: “Command-line prediction and error correction using generalized command-line” (1995, Elsevier/ScienceDirect):
https://www.sciencedirect.com/science/article/pii/S0921264706801930
Key takeaways:
- Predicting on raw command strings is brittle.
- A generalized command line (command name + argument symbols/templates) helps both prediction and error correction.
Distilled implication for zage:
- Model commands primarily by a stable structural signature:
head:*- position-independent
flag:*set (sorted/dedup) - a bounded set of normalized
arg:*(paths/ids/branches normalized) - optional “template”/slot tokens where appropriate
Paper: “Seq2Seq and Joint Learning Based Unix Command Line Prediction System” (arXiv:2006.11558):
https://arxiv.org/abs/2006.11558
Key takeaways:
- Neural seq2seq can be trained for this problem, but training + inference costs are non-trivial.
Distilled implication for zage:
- We do not want a heavyweight encoder trained offline if we want “instant learning”.
- However, this supports the idea that “learned representations” can help beyond string matching.
Paper: “SimCURL: Simple Contrastive User Representation Learning from Command Sequences” (arXiv:2207.14760):
https://arxiv.org/abs/2207.14760
Key takeaways:
- Even noisy “command sequences” encode intent and can support downstream tasks.
- They introduce session dropout as augmentation (drop some steps to avoid overfitting to exact sequences).
Distilled implication for zage:
- Use windows and backoff rather than rigid dependence on exact last-N.
- Consider “drop some recent commands” as a regularizer for online learning.
Paper: “SHREC: a SRE Behaviour Knowledge Graph Model for Shell Command Recommendations” (arXiv:2408.05592):
https://arxiv.org/abs/2408.05592
Key takeaways:
- Command histories can be structured into a graph of actions/entities/workflows, and the graph can drive real-time recommendations.
Distilled implication for zage:
- Keep the sequence/transition layer (bigrams/trigrams/backoff) as a strong candidate source.
- Don’t rely on graphs alone to solve semantic generalization (flags/args/etc.).
Paper: “CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research” (arXiv:2411.01176; EMNLP 2024):
https://arxiv.org/abs/2411.01176
ACL anthology:
https://aclanthology.org/2024.emnlp-main.1126/
Key takeaways:
- Domain-specific command-line embeddings can outperform generic sentence embeddings on “similar command” tasks.
- These models are still fairly large and not ideal for “always-on online learning”.
Distilled implication for zage:
- We want CmdCaliper-like invariances but in a tiny, privacy-preserving, online-updatable model.
Paper: “ShIOEnv: A CLI Behavior-Capturing Environment …” (arXiv:2505.18374):
https://arxiv.org/abs/2505.18374
Key takeaways:
- Datasets that omit exit status/output miss important behavior signals.
- Grammar constraints help prevent invalid command constructions.
Distilled implication for zage:
- Exit status should be an explicit feature/training signal.
- Keep hard constraints (prefix match, syntax validity, dedupe) outside the model.
These are more implementation-adjacent / “what engineers actually ship” references that match our constraints (local, fast, explainable, privacy-sensitive):
- NeuroShell / Intelligent Unix Shell (2023–2024, open source): hybrid approaches combining n-grams/Markov + similarity scoring for suggestions/typo correction, emphasizing local execution and explainable scoring.
https://github.com/Anshika-111105/Intelligent-Unix-Shell
- ShellSuggest (2023–2024, open source): next-word prediction for Linux terminal commands (LM framing for shell completion).
https://thedotproduct.github.io/shell-suggest-lang-model/
- First Shell Command Prediction Model (Jesse Claven): practical blog write-up of a transformer/time-series approach for shell command prediction.
https://www.j-e-s-s-e.com/blog/first-shell-command-prediction-model
- LOTLDetector (2025, security detection): multi-view command-line embeddings (Token2Vec/Rule2Vec/Similarity2Vec) fused with CNN; evidence that multi-view representations work.
https://link.springer.com/article/10.1186/s42400-025-00531-w
- Netflix Generative Recommenders (2024–2025, recsys practice): sampled-softmax efficiency, continuous training, calibration/metrics thinking (MRR).
https://netflixtechblog.medium.com/towards-generalizable-and-efficient-large-scale-generative-recommenders-a7db648aa257
- DenseRec (2025, recsys): dual pathway embeddings (ID-based + content-based) with probabilistic mixing for cold-start.
https://www.linkedin.com/posts/singhsidhukuldeep_solving-the-cold-start-problem-in-sequential-activity-7407955404971859968-FNZ6
- Two-tower retrieval overview: good background on two-tower training objectives and ANN retrieval tradeoffs.
https://www.emergentmind.com/topics/two-tower-retrieval
- PSAG (online continual learning): projection-based stabilization for online continual learning streams.
https://openreview.net/forum?id=NvXpSvMrXS
- Continual learning overview (practical): stability/plasticity, replay sizing, embedded constraints.
https://runtimerec.com/the-enduring-challenge-of-continual-learning-preventing-catastrophic-forgetting-in-long-lived-ai-systems/
- Lifelong deep learning survey (overview): additional background and pointers across the CL landscape.
https://www.emergentmind.com/topics/lifelong-deep-learning-ldl
- Cold-start overview (auxiliary features / settings): cold-start settings and how side-info helps.
https://www.sciencedirect.com/science/article/abs/pii/S0950705126000249
Even when these aren’t “next shell command prediction” papers, the techniques directly map to our goals (next-item prediction + context + cold-start + online updates).
We want a model that:
- Learns a context vector from (workspace_root, cwd, exit_status, optional host/user/time/session) + last N commands.
- Learns a command vector from a normalized, generalized representation of the command line.
- Scores candidates by dot product (or cosine) plus small bias terms.
- Updates parameters online on every
record(or micro-batches) using negative sampling.
This is intentionally “word2vec/item2vec-like” rather than a Transformer:
- Training per update is ~O((1+K)·D) where K is number of negatives and D is embedding dim.
- No expensive batch×batch logits matrix.
- No GPU required.
This section pins down defaults so we don’t get stuck in “design feels plausible” territory.
- Inference latency (p99): < 15ms end-to-end for
suggestover UDS on a typical dev machine, including DB I/O. - Update latency: < 2ms average per
record(micro-batch updates allowed).
- Embedding dim D: 64 (increase to 96/128 only if needed).
- Negative samples K: 16 (increase to 32 only if needed).
- Context window N: 10 commands (tunable).
We will implement FastText-style subword hashing in v1:
- Character n-grams: 3–6
- Bucket count: 131,072 (2^17), with sign hashing
- Hash function: stable 64-bit hash with good avalanche (reuse zage’s stable hash utility)
Rationale: shell token space is huge (paths/branches/tooling). We must not require “rebuild vocab”.
We will do replay in v1. Defaults:
- Global replay: reservoir sample of 20,000 events.
- Per-workspace replay: ring buffer of 5,000 events for up to 32 most-recently-active workspaces (LRU evict workspaces).
- Each update trains on: 1 new event + 2 replay events (1 global, 1 same-workspace if present).
Store events compactly (no raw strings):
- token IDs / hashed bucket IDs
- lightweight sampling keys (workspace hash, head hash, time bucket)
We will learn a small set of group scalars online (instead of hard-coding all context weights):
- Groups:
workspace_root,cwd,exit,host,user,timebucket,session,phase/workflow,recent_heads,recent_flags. - Init: 1.0; LR: 10× smaller than embeddings; L2 reg toward 1.0.
We will suppress embedding-driven suggestions when uncertain:
- Margin gate: if
top1 - top2 < 0.05, down-weight learned score and rely more on tier-1/frecency. - Absolute gate: if
top1 < 0.0(post-calibration), don’t inject embedding-based candidates.
This ensures “model on” never performs worse than “model off”.
We will blend with priors using a log-linear model:
final = α * model_score + β * log(frecency_prior) + γ * sequence_score + δ * tier1_score
v1 calibration:
- Learn
α/β/γ/δonline using accept-only feedback (exact suggestion accepted). - No implicit negatives or time windows in v1; non-accepts are ignored (no-op).
- Use simple gradient descent on logistic loss; no complex bandits for v1.
Embeddings capture associations, not raw text. They may still leak sensitive patterns statistically, but they should not be directly invertible back into exact command lines without additional side information.
For an invocation (shellname, expanded_command):
- Tokenize using zage’s existing shell-aware tokenizer.
- Extract structured parts:
head:<cmd>flag:<flag>(collect all flags, sort+dedup to make them position-independent)arg:<normalized>(cap to M args; normalize paths/ids/branches, etc.)
- Optional structural tokens (if useful, but not required for v1):
shape:pipe,shape:redir,shape:assign,shape:subshell, …
These tokens are used to compute a command embedding.
Context tokens are:
ctx:workspace_root=<root>(highest priority)ctx:cwd=<cwd>ctx:exit=<exit>(explicitly include failures)ctx:host=<host>(low weight)ctx:user=<user>(low weight)ctx:timebucket=<bucket>(very low weight)ctx:session=<id>(very low weight; soft affinity)- Recent commands (last N invocations), expressed using the same generalized tokens but prefixed, e.g.:
prev:head:*,prev:flag:*,prev:arg:*
We can explicitly implement backoff:
- If
workspace_rootexists: model uses it. - Else if
cwdexists: use cwd. - Else: global.
This backoff is not a hard gate; it just controls which statistics/features dominate.
Two towers that share the same token vocabulary:
E[token] -> R^D(embedding lookup)- Context vector:
v_ctx = normalize(sum(E[t] * w_ctx(t)))
- Command vector:
v_cmd = normalize(sum(E[t] * w_cmd(t)))
- Score:
score(ctx, cmd) = v_ctx · v_cmd + b_cmd + b_head + b_ctx_bucket
Where per-token weights are initial priors, and group scalars are learned online (see “Concrete v1 Decisions”):
ctx:workspace_rootweight >ctx:cwd>ctx:exit>ctx:host>ctx:user>ctx:timebucket>ctx:session.flag:*tokens weight is non-zero and position-independent (flags must matter).
On each new record event we get an example:
- Input: context (last N commands + context fields)
- Positive: the command that was actually executed next
We update embeddings with a binary logistic objective using K negatives:
- Sample negatives from a mixed pool that intentionally includes “hard negatives”:
- same
head:* - same workspace/cwd (if available)
- globally frequent commands
- recent commands
- same
Loss (conceptual):
- For positive command
c+and negativesc-:- maximize
σ(score(ctx, c+)) - minimize
σ(score(ctx, c-))
- maximize
This is the classic “skip-gram with negative sampling” pattern, applied to next-command prediction.
Frecency is still useful as:
- a baseline prior (what you do often/recently within the same workspace/cwd)
- a cold-start mechanism before the online model has enough signal
But frecency should be context-conditioned and should back off:
workspace_root frecency -> cwd frecency -> global frecency
The online model score should be the primary rank signal once warmed up.
This section addresses the major risks called out in review so we don’t ship a design that “sounds plausible” but fails in practice.
Problem:
- Pure online SGD will overfit to the most recent workflow and forget older associations when the user switches projects.
Mitigations (pick at least one; ideally two):
- Replay buffer: keep a reservoir-sampled buffer of past
(context, next_command)events and train each update on:- 1 new event + R replay events (e.g., R=1–8).
- Per-context replay: maintain separate buffers per
workspace_root(and optionally percwd), so switching projects immediately replays that project’s past. - Decayed learning rate / Adagrad: reduces step sizes for frequently-updated parameters; helps stabilize common tokens/commands.
- Weight averaging: maintain EMA (“shadow weights”) of embeddings and use EMA for inference to damp oscillation.
What we are explicitly not doing (v1):
- Heavy continual-learning methods (EWC, LwF) unless replay + Adagrad still isn’t stable enough.
Problem:
- New tools/flags/args appear continuously. A naive vocabulary table yields OOV issues and brittle behavior.
Mitigations (commit to one):
- FastText-style subword hashing:
- Represent each token using character n-grams (e.g., 3–6) hashed into fixed buckets.
- Command embedding becomes the sum of its token n-gram embeddings.
- This makes new commands partially “understandable” immediately.
- Dual pathway (DenseRec-style):
- Maintain both:
- ID-based embedding for known
command_stats.command(fast, high quality when seen) - content-based embedding from tokens/subwords (works for unseen commands)
- ID-based embedding for known
- During training/inference, mix them (learned or heuristic) based on “how much history exists for this command”.
- Maintain both:
We should not require a “rebuild vocab” step.
Problem:
- If we sample “globally frequent commands” as negatives too aggressively, we suppress common-but-correct suggestions.
- If we don’t correct for the sampling distribution, scores become miscalibrated.
Mitigations:
- Use a mixed negative distribution
Qwith explicit weights, e.g.:- 40% from same
workspace_root(or cwd backoff) - 30% from same
head:*(hard negatives) - 20% from recent window (recency hard negatives)
- 10% global, frequency-weighted with exponent 0.75 (word2vec-style)
- 40% from same
- Use a sampled-softmax / log-Q correction:
- Train on
logits = score(ctx, cmd) - log(Q(cmd))so training isn’t biased toward the sampler.
- Train on
Related reading:
- “Correct and Weight” (implicit feedback; negative sampling bias):
https://arxiv.org/abs/2601.04291 - “Correcting the LogQ Correction” (notes subtle bias in standard log-Q correction):
https://arxiv.org/abs/2507.09331
Problem:
- Real workflows are not pure bigrams and are often interleaved.
- “Context → next command” can still learn these, but only if the context representation preserves enough structure.
Mitigations:
- Include last-N commands in context, but avoid collapsing everything into a single bag:
- Add lightweight structure tokens:
prev0:head:*,prev1:head:*, … for heads (or for a small subset like heads + top flags). - Keep flags position-independent within a command, not across the entire window.
- Add lightweight structure tokens:
- Add “phase/workflow” as explicit context tokens (derived from zage phases/sequences), not as a separate heuristic scoring system.
If this still isn’t sufficient, the next step is a tiny sequence encoder (e.g., GRU over the last N command embeddings), still trained online with negative sampling.
Problem:
- Hard-coded
w_ctxis brittle; different users and repos need different context importance.
Mitigations:
- Learn a small set of group scalars online:
- one scalar per context group (
workspace_root,cwd,exit,host, …) - optionally per “phase/workflow” token group
- one scalar per context group (
- Or learn a small “gating” function (tiny linear layer) that weights groups based on the current context.
The key is: keep it tiny enough to update online and store in DB.
Exit status should be treated as context, not “this command was wrong”.
exit!=0often means “retry the same command with flags” or “run the fix step”.- So
ctx:exit=1should allow the model to learn patterns like:cargo test(exit 101) →cargo test -- --nocapturegit push(exit) →git pull --rebase
Recommendations:
- Start with D=64–128 for online.
- Use Adagrad or RMSProp (per-parameter adaption) to avoid constant retuning.
- If using hashing buckets:
- collisions are expected; mitigate by:
- larger bucket count
- sign hashing
- keeping “ID-based” embeddings for frequent commands (dual pathway)
- collisions are expected; mitigate by:
- Use gradient clipping to avoid exploding updates on rare events.
- If we use Adagrad, consider periodic accumulator rescaling or switch to RMSProp to avoid “learning rate goes to ~0 forever” for very frequent features.
Dot products are not probabilities; we need a stable, interpretable blend with priors.
Recommended:
- Use a log-linear blend:
final = α * model_score + β * log(frecency_prior) + γ * sequence_score + …
- Learn
α/β/γ(or at leastα) online using acceptance feedback, or set them conservatively and expose them as config.
Replace “single chronological split” with:
- rolling-window evaluation (prequential): train on past, evaluate on next chunk, slide forward.
- ranking metrics:
- MRR@K
- Recall@K
- Coverage@K (diversity / not always predicting
ls) - Context leakage rate (define as: top-K contains commands whose dominant workspace/cwd != current)
We do not use implicit negatives in v1:
- Only accepted suggestions are logged and used for training.
- No “rejected/ignored” outcomes, and no time-window matching.
- Ctrl-C and other aborts are ignored (no-op).
If we want negatives later, it must be explicit and UX-supported; we will not infer it from timing.
If multiple hosts update the model against the same remote DB:
- Avoid last-write-wins on a single embedding row.
- Prefer one of:
- append-only updates + periodic compaction
- per-host shards (host-scoped embedding deltas) + read-time merge (or scheduled merge)
From shell_history we already capture:
expanded_commandshellnameworking_directoryhostname,usernameexit_status- timestamps
session_id
And zage already computes/derives:
workspace_root/ workspace info (stored as JSON inshell_history.workspace_jsontoday)- sequence statistics (bigrams/trigrams)
- context statistics (tier-1)
Keep a staged pipeline approach:
- Collect candidates from:
- transitions (sequence stats)
- context stats (workspace/cwd/session)
- phase/templates
- optionally vector similarity (sqlite-vec) if we keep it as a candidate source
- Apply hard constraints:
- strict prefix / cursor constraints
- syntax validity
- dedupe
- Rank with the online model score (plus small priors/backoff).
If we keep sqlite-vec similarity search, we should avoid “two embedding spaces” that disagree.
Rule:
- The embeddings indexed in sqlite-vec should be the same embeddings used by the online model for scoring (or a strict subset of the same representation).
- Candidate source identity (transition/context/phase/vector) can still be included as a feature for blending/calibration.
Minimal online model state:
- Token embedding table (or fixed hashed buckets):
token_id -> vec[D]- optimizer accumulators if using Adagrad/RMSProp
- Optional bias tables:
- command bias
- head bias
If we want multi-host sharing via libsql/Turso, storing this model state in the DB makes it syncable. If we want maximum write performance, buffer updates and flush periodically.
Update the model on:
- every
record(single-step update), or - micro-batches (flush every N records), or
- a background worker in the daemon
This should be fast enough on CPU if D is small (e.g., 64–128) and K is modest (8–32).
Given a fixed history, prefer rolling-window evaluation (prequential), because user behavior is non-stationary:
- Evaluate on successive windows (train on [0..t], evaluate on (t..t+Δ], slide forward).
- Report ranking metrics:
- top-1 / top-5 next-command accuracy
- MRR@K
- Recall@K
- Coverage@K
- Context leakage rate
- the same metrics within workspace_root, to ensure context-bound behavior is real
Track:
- acceptance rate (did the user execute a suggested command soon after?)
- time-to-next-command reduction (optional)
- “context leakage” events (suggestions from unrelated workspace/cwd)
- Sequence effects: context vector includes last-N; negative sampling can include “same head” to make sequences discriminative.
- Soft semantic similarity: generalized tokens + learned token embeddings make reorder/different filenames map close.
- Higher-order interactions: context tokens (workspace_root/cwd/exit) are first-class; the model can learn their interactions because they participate in the same embedding space.
The key property is that the model updates immediately from new data and continuously adapts to your workflows.