Skip to content
This repository was archived by the owner on Mar 23, 2026. It is now read-only.

Latest commit

 

History

History
163 lines (118 loc) · 5.32 KB

File metadata and controls

163 lines (118 loc) · 5.32 KB

Prompt Transformation Examples

Real before/after pairs showing how the Prompt Architect hook and skill improve AI/ML requests. Each example includes the domain classification, clarity score, and what changed.


Example 1: ML Training — Vague to Actionable

Domain: ML_RESEARCH | Before score: 3/10 | After score: 9/10

Before

help me improve my model's performance

Hook response:

Quick context check:

  1. What task and dataset are you working on? (e.g., image classification on 10k medical images)
  2. What's your current validation metric and the target you're trying to hit?
  3. What's your compute budget? (GPU type, training time available)

After (user fills in context)

My ResNet-50 (torchvision pretrained) fine-tuned on a 10k-sample pathology image dataset
is plateauing at 78% validation accuracy after epoch 8, vs. SOTA 91% on this benchmark.

Config: lr=1e-4, batch=32, AdamW, cosine LR decay, 20 epochs, single A100 40GB.

Diagnose the top 3 reasons for the gap and propose specific fixes.
Each fix: expected improvement, implementation hours, risk level.
Output: ranked list, most impactful first.

Changes: Task + dataset + current metric + target + training config + compute + ranked output format.


Example 2: MLOps Deployment — Underspecified

Domain: MLOPS | Before score: 2/10 | After score: 9/10

Before

set up deployment for my sentiment model

Hook response:

Quick context check:

  1. Deployment target? (local, AWS/GCP/Azure, edge device)
  2. Latency requirement? (e.g., p95 < 200ms at 50 req/s)
  3. Existing stack to integrate with? (K8s cluster, ECS, bare metal)

After

Generate production deployment for a PyTorch DistilBERT sentiment classifier (INT8 quantized, 67MB).

Specs:
- Target: AWS ECS Fargate (CPU only — model quantized)
- SLA: p95 latency < 150ms at 100 req/s
- Input: JSON {text: str, max_length: 512}
- Output: JSON {label: str, score: float, latency_ms: float}
- Monitoring: Prometheus metrics + Grafana dashboard
- CI/CD: GitHub Actions — build on merge, deploy to staging, manual gate to prod

Deliver: Dockerfile (multi-stage), app.py, schemas.py, metrics.py, deploy.yml.
Mark all environment-specific values with # CONFIGURE comments.

Example 3: Agent System — Missing Termination

Domain: AGENTS | Before score: 4/10 | After score: 10/10

Before

research autonomous agents and tell me what's interesting

Hook response:

Quick context check:

  1. What's the specific angle? (coordination, tool-use reliability, memory, planning)
  2. When should the research stop? (N sources, specific question answered, time budget)

After

Research LLM-based multi-agent coordination patterns (2024–present), focusing on
how agents share state and avoid duplicate work in production systems.

Scope: arXiv papers + major lab engineering blogs only.
Stop when: 6 high-quality primary sources covering at least 3 distinct coordination patterns.
Tool budget: max 15 web fetches.

Per source extract: core contribution | key technique | reported benchmark | code available?

Output: comparison table (columns: approach, coordination mechanism, scalability, production-ready?)
+ 150-word synthesis identifying the dominant pattern and one underexplored angle.

Example 4: Experiment Design — No Baseline

Domain: ML_RESEARCH | Before score: 5/10 | After score: 9/10

Before

train a transformer on my text classification dataset and see how it does

Hook response:

Quick context check:

  1. Dataset size, class count, and imbalance ratio?
  2. What's the baseline to beat? (TF-IDF+LR? fine-tuned BERT? current prod model?)

After

Fine-tune DistilBERT-base on my 15k-sample, 8-class customer intent dataset
(class imbalance: max 4:1 ratio).

Baseline to beat: TF-IDF + LogisticRegression at F1-macro 0.71.
Target: F1-macro ≥ 0.82 within 2 hours training on a single RTX 3090.

Use HuggingFace Trainer. Track with W&B. Save best checkpoint by val F1-macro.
Report: per-class F1 breakdown, confusion matrix, training curves.

Example 5: Agentic Workflow — Unsafe Scope

Domain: AGENTS | Before score: 3/10 | After score: 9/10

Before

build an agent that monitors my ML experiments and fixes issues automatically

Hook response:

Quick context check:

  1. What counts as an "issue"? (loss spike, OOM, NaN, stalled training?)
  2. What actions is the agent allowed to take automatically vs. requiring approval?

After

Design an experiment monitoring agent that watches W&B runs and intervenes on detected issues.

Monitored conditions:
- Loss spike: val_loss increases >20% over 3 consecutive epochs → alert only
- NaN loss: detected at any step → auto-stop run, save checkpoint, notify
- OOM: CUDA OOM in logs → auto-reduce batch size by 50%, resume
- Stall: no metric improvement for 10 epochs → alert + suggest LR adjustment

Autonomy levels:
- AUTO (no approval): stop failing runs, save checkpoints, log issues
- REQUIRES APPROVAL: modify hyperparameters, restart runs, send external notifications

Tools available: W&B API (read/write runs), Bash (read logs), Write (save reports).
Hard constraint: never delete checkpoints or experiment data.

Output: Python agent class with clear action/approval boundary and W&B polling loop.