Real before/after pairs showing how the Prompt Architect hook and skill improve AI/ML requests. Each example includes the domain classification, clarity score, and what changed.
Domain: ML_RESEARCH | Before score: 3/10 | After score: 9/10
help me improve my model's performance
Hook response:
Quick context check:
- What task and dataset are you working on? (e.g., image classification on 10k medical images)
- What's your current validation metric and the target you're trying to hit?
- What's your compute budget? (GPU type, training time available)
My ResNet-50 (torchvision pretrained) fine-tuned on a 10k-sample pathology image dataset
is plateauing at 78% validation accuracy after epoch 8, vs. SOTA 91% on this benchmark.
Config: lr=1e-4, batch=32, AdamW, cosine LR decay, 20 epochs, single A100 40GB.
Diagnose the top 3 reasons for the gap and propose specific fixes.
Each fix: expected improvement, implementation hours, risk level.
Output: ranked list, most impactful first.
Changes: Task + dataset + current metric + target + training config + compute + ranked output format.
Domain: MLOPS | Before score: 2/10 | After score: 9/10
set up deployment for my sentiment model
Hook response:
Quick context check:
- Deployment target? (local, AWS/GCP/Azure, edge device)
- Latency requirement? (e.g., p95 < 200ms at 50 req/s)
- Existing stack to integrate with? (K8s cluster, ECS, bare metal)
Generate production deployment for a PyTorch DistilBERT sentiment classifier (INT8 quantized, 67MB).
Specs:
- Target: AWS ECS Fargate (CPU only — model quantized)
- SLA: p95 latency < 150ms at 100 req/s
- Input: JSON {text: str, max_length: 512}
- Output: JSON {label: str, score: float, latency_ms: float}
- Monitoring: Prometheus metrics + Grafana dashboard
- CI/CD: GitHub Actions — build on merge, deploy to staging, manual gate to prod
Deliver: Dockerfile (multi-stage), app.py, schemas.py, metrics.py, deploy.yml.
Mark all environment-specific values with # CONFIGURE comments.
Domain: AGENTS | Before score: 4/10 | After score: 10/10
research autonomous agents and tell me what's interesting
Hook response:
Quick context check:
- What's the specific angle? (coordination, tool-use reliability, memory, planning)
- When should the research stop? (N sources, specific question answered, time budget)
Research LLM-based multi-agent coordination patterns (2024–present), focusing on
how agents share state and avoid duplicate work in production systems.
Scope: arXiv papers + major lab engineering blogs only.
Stop when: 6 high-quality primary sources covering at least 3 distinct coordination patterns.
Tool budget: max 15 web fetches.
Per source extract: core contribution | key technique | reported benchmark | code available?
Output: comparison table (columns: approach, coordination mechanism, scalability, production-ready?)
+ 150-word synthesis identifying the dominant pattern and one underexplored angle.
Domain: ML_RESEARCH | Before score: 5/10 | After score: 9/10
train a transformer on my text classification dataset and see how it does
Hook response:
Quick context check:
- Dataset size, class count, and imbalance ratio?
- What's the baseline to beat? (TF-IDF+LR? fine-tuned BERT? current prod model?)
Fine-tune DistilBERT-base on my 15k-sample, 8-class customer intent dataset
(class imbalance: max 4:1 ratio).
Baseline to beat: TF-IDF + LogisticRegression at F1-macro 0.71.
Target: F1-macro ≥ 0.82 within 2 hours training on a single RTX 3090.
Use HuggingFace Trainer. Track with W&B. Save best checkpoint by val F1-macro.
Report: per-class F1 breakdown, confusion matrix, training curves.
Domain: AGENTS | Before score: 3/10 | After score: 9/10
build an agent that monitors my ML experiments and fixes issues automatically
Hook response:
Quick context check:
- What counts as an "issue"? (loss spike, OOM, NaN, stalled training?)
- What actions is the agent allowed to take automatically vs. requiring approval?
Design an experiment monitoring agent that watches W&B runs and intervenes on detected issues.
Monitored conditions:
- Loss spike: val_loss increases >20% over 3 consecutive epochs → alert only
- NaN loss: detected at any step → auto-stop run, save checkpoint, notify
- OOM: CUDA OOM in logs → auto-reduce batch size by 50%, resume
- Stall: no metric improvement for 10 epochs → alert + suggest LR adjustment
Autonomy levels:
- AUTO (no approval): stop failing runs, save checkpoints, log issues
- REQUIRES APPROVAL: modify hyperparameters, restart runs, send external notifications
Tools available: W&B API (read/write runs), Bash (read logs), Write (save reports).
Hard constraint: never delete checkpoints or experiment data.
Output: Python agent class with clear action/approval boundary and W&B polling loop.