Project: Interact with the Web Through Natural Language

1. Demo-First Goal

Build a Chrome extension that opens a bottom command bar with a keyboard shortcut, accepts a natural-language instruction, and executes it on the current website through fast, human-like browser actions.

This is a hackathon MVP optimized for a live demo in 12 hours.

Primary demo scenarios:

Hacker News: “Summarize the top 5 Hacker News articles.”
Gmail: “Give me a summary of my last 5 unread emails.”

Priority order:

Human-like but extremely fast navigation (most important for demo wow factor)
Agent speed
Reliable output

2. Product Experience

Trigger

Keyboard shortcut (default): Cmd/Ctrl + Shift + Space
Opens a bottom floating command bar (Spotlight-style).

User flow

User types prompt in command bar.
Extension shows immediate state: Planning... -> Executing... -> Summarizing....
Agent performs visible browser actions (click/open/back/scroll/read).
Final answer appears in the command bar panel with:
- concise summary
- sources visited (titles/links)
- optional “show steps” toggle

UX principle

Fast enough to feel magical.
Human-like motion so actions feel understandable and trustworthy.

3. Scope (MVP vs Not Now)

In scope (must ship)

Bottom command bar UI in content script.
Keyboard shortcut to toggle UI.
Task execution loop for two domains:
- news.ycombinator.com
- mail.google.com
Claude-based planner + step generator.
Deterministic action executor (click, type, open, back, extract text).
Final summarization response.
Basic guardrails + timeout + retry.

Out of scope (skip for hackathon)

General autonomous browsing across all websites.
Multi-tab orchestration beyond what is needed for demo.
Long-term memory, user profiles, auth vaults.
Complex computer vision interaction.

4. High-Level Architecture

Components

Content Script
- Renders bottom command bar UI.
- Captures page state and DOM candidates.
- Runs action executor directly in page context.
Background Service Worker (MV3)
- Orchestrates workflow state.
- Calls Claude API.
- Handles message passing with content script.
Claude Agent Layer
- Produces strict JSON plans/actions.
- Uses a small action vocabulary for reliability.
Summarizer
- Claude pass over collected text chunks.
- Returns concise final output.

Why this architecture

Keeps execution local in content script for speed.
Keeps model calls centralized for simpler control.
Restricts model output format for robust execution.

5. Action System (Reliability + Speed)

Use a constrained action schema instead of free-form instructions.

Supported actions

LIST_ITEMS (find candidate list rows with selector hints)
CLICK
OPEN_IN_SAME_TAB
BACK
WAIT_FOR
EXTRACT_TEXT (main article/email body)
SCROLL
DONE

Execution contract

Claude outputs valid JSON only.
Executor validates each step before running.
On failure:
- quick fallback selector attempt
- one retry
- if still failing, return partial result with explicit warning

Human-like fast navigation

Use “micro-animated cursor” overlay:
- very short curved movement to click target
- 80–180ms move + 40ms click pulse
Real action still executes through direct DOM events for speed.
Tiny randomized delays (20–80ms) to avoid robotic feel while staying fast.

6. Domain Strategies for Demo

A) Hacker News strategy

Prompt example: “Summarize the top 5 hackernews articles.”

Plan:

Ensure on news.ycombinator.com (navigate if needed).
Get top 5 story links from .athing rows.
For each story:
- click/open
- wait for readable content container
- extract article text (first N chars/tokens)
- go back
Summarize all 5 with bullet points + one-line takeaways.

Fallback:

If article extraction is weak, use document.body.innerText truncation.

B) Gmail strategy

Prompt example: “Give me a summary of my last 5 unread emails.”

Plan:

Ensure on mail.google.com.
Filter/select unread threads from inbox list (is:unread behavior via UI or unread row selectors).
For each of first 5 unread:
- open thread
- extract sender + subject + visible body text
- go back to inbox
Return concise email-by-email summary + suggested priorities.

Gmail reliability notes:

Prefer robust selectors based on roles/aria-labels over fragile class names.
Keep extraction tolerant to Gmail DOM changes by using multiple fallback selectors.

7. Claude Integration Design

Model usage

Use Claude for:
1. planning next actions
2. final summarization

Prompting pattern

System instruction should enforce:

output strict JSON with allowed actions only
max steps per iteration (e.g., 3–5)
never invent data
if unsure, request additional extraction step

Iterative loop

Send task + compact page context + current progress.
Receive next action block.
Execute actions.
Feed execution results back.
Repeat until DONE.
Run final summarize pass.

Performance controls

Keep context compact:
- page title
- URL
- top candidate elements
- extracted text snippets only
Hard limits:
- max pages (5 for demo)
- max extraction chars per page
- max loops + hard timeout

8. Technical Stack

Extension: Chrome Extension Manifest V3
Language: TypeScript
UI: Lightweight HTML/CSS + minimal animation (no heavy framework needed)
LLM: Claude API (Anthropic)
Build: Vite (or simple esbuild) for fast iteration

Recommended file layout:

src/background.ts
src/content/ui.ts
src/content/executor.ts
src/content/extractors/hackernews.ts
src/content/extractors/gmail.ts
src/agent/claude.ts
src/agent/schema.ts
src/shared/types.ts

9. 12-Hour Build Plan (Demo Optimized)

Hour 0–1: Skeleton

Set up MV3 extension with background + content script.
Add keyboard shortcut and bottom bar shell.
Basic message passing.

Hour 1–3: Executor Core

Implement action schema + validator.
Implement click/open/back/wait/extract primitives.
Add visible cursor animation layer.

Hour 3–5: Hacker News Vertical Slice

Implement HN selectors/extraction.
End-to-end run for top 5 summaries.
Handle failures + retries.

Hour 5–7: Gmail Vertical Slice

Implement unread thread navigation and extraction.
Add selector fallbacks.
Stabilize back-navigation and thread indexing.

Hour 7–9: Claude Loop + Summarizer

Integrate Claude calls.
Build iterative plan/execute loop.
Add strict JSON parsing and recovery.

Hour 9–10: Speed Pass

Reduce waits/timeouts.
Parallelize where safe (pre-fetch lightweight metadata only).
Cache prompt scaffolding.

Hour 10–11: Reliability Pass

Add hard limits, timeout messaging, partial-success response.
Test both demo scenarios repeatedly.

Hour 11–12: Demo Polish

Refine UI copy/status transitions.
Add “show steps” panel.
Prepare fixed demo script and backup prompts.

10. Demo Script (What to Show Judges)

Demo 1: Hacker News

Open Hacker News front page.
Press shortcut, type: “Summarize the top 5 hackernews articles.”
Let agent visibly click/open/read/back fast.
Show final summary with 5 bullets.

Demo 2: Gmail

Open Gmail inbox.
Press shortcut, type: “Give me a summary of my last 5 unread emails.”
Agent opens threads one by one, extracts context, returns to inbox.
Show concise summary + priority suggestions.

Backup prompt options:

“Summarize top 3 instead of 5.”
“Give me only urgent unread emails summary.”

11. Risk Register + Mitigations

Gmail DOM instability

Mitigation: multi-selector fallback, role/aria-first strategy, partial result mode.

LLM outputs invalid JSON

Mitigation: schema validation + repair pass + retry with strict correction prompt.

Slow run time

Mitigation: hard cap on extracted text, short waits, limited loops, optimized selector queries.

Navigation errors during live demo

Mitigation: pre-demo smoke test checklist + fallback prompts (top 3).

12. Success Criteria for Hackathon

Must-have to win the demo:

Command bar opens instantly with shortcut.
Both HN and Gmail tasks complete end-to-end in a single run.
Agent movement looks human but feels very fast.
Final output is clear, structured, and grounded in extracted content.

Nice-to-have:

Step replay panel.
“Why this summary?” trace with snippet references.

13. Build Philosophy (for the next 12h)

Optimize for a stable, impressive demo, not for full generalization.
Prefer deterministic selectors + constrained agent actions over open-ended autonomy.
If a tradeoff appears, choose what improves:
1. perceived speed
2. visible competence
3. reliability on the two scripted scenarios

FilesExpand file tree

PROJECT.md

Latest commit

History