Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 24 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -576,14 +576,15 @@ flowchart TB
### Prompt Management
- **[`preprocess`](#5-preprocess)**: Preprocesses prompt files, handling includes, comments, and other directives
- **[`split`](#7-split)**: Splits large prompt files into smaller, more manageable ones
- **[`extract`](#21-extract)**: Manage cached LLM extractions for prompt context
- **[`auto-deps`](#15-auto-deps)**: Analyzes and inserts needed dependencies into a prompt file
- **[`detect`](#10-detect)**: Analyzes prompts to determine which ones need changes based on a description
- **[`conflicts`](#11-conflicts)**: Finds and suggests resolutions for conflicts between two prompt files
- **[`trace`](#13-trace)**: Finds the corresponding line number in a prompt file for a given code line

### Utility Commands
- **[`auth`](#18-auth)**: Manages authentication with PDD Cloud
- **[`sessions`](#19-pdd-sessions---manage-remote-sessions)**: Manage remote sessions for `connect`
- **[`auth`](#19-auth)**: Manages authentication with PDD Cloud
- **[`sessions`](#20-pdd-sessions---manage-remote-sessions)**: Manage remote sessions for `connect`

## Global Options

Expand Down Expand Up @@ -2437,7 +2438,7 @@ pdd connect --session-name "my-dev-server"

**When to use**: Use `connect` when you prefer a graphical interface for working with PDD, when demonstrating PDD to others, or when integrating PDD with other tools that can communicate via REST APIs.

### 18. auth
### 19. auth

Manages authentication with PDD Cloud. The `auth` command provides subcommands for signing in, signing out, checking status, and retrieving authentication tokens.

Expand Down Expand Up @@ -2498,7 +2499,7 @@ pdd auth token [OPTIONS]

**When to use**: Use `auth` commands to manage your PDD Cloud authentication state. Use `auth login` to authenticate before using cloud features, `auth status` to verify your current session, and `auth token` when you need to pass credentials to scripts or other tools.

### 19. `pdd sessions` - Manage Remote Sessions
### 20. `pdd sessions` - Manage Remote Sessions

The `sessions` command group allows you to manage remote PDD sessions registered with PDD Cloud. Remote sessions enable you to control PDD instances running on other machines through the web frontend.

Expand Down Expand Up @@ -2536,7 +2537,25 @@ pdd sessions cleanup --all --force

**When to use**: Use `sessions list` to discover available remote sessions, `sessions info` to check session details, and `sessions cleanup` to remove stale or orphaned sessions.

### 20. Firecrawl Web Scraping Cache
### 21. extract

Manage cached extractions generated by the `<extract>` tag in prompts.

```bash
# Refresh extractions for a specific prompt
pdd extract refresh prompts/my_module.prompt

# List all cached extractions
pdd extract list

# Show status of extractions (stale/fresh)
pdd extract status

# Preview extraction without caching
pdd extract preview docs/large_api.md --query "auth flow"
```
Comment on lines +2544 to +2556
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description’s “Next Steps” uses pdd extract --refresh ..., but the README examples here use pdd extract refresh ... (subcommand). Please make the PR description and docs consistent with the actual CLI syntax to avoid user confusion.

Copilot uses AI. Check for mistakes.

### 22. Firecrawl Web Scraping Cache

**Automatic caching** for web content scraped via `<web>` tags in prompts. Reduces API credit usage by caching results for 24 hours by default.

Expand Down
55 changes: 52 additions & 3 deletions docs/prompting_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ Tip: Prefer small, named sections using XML‑style tags to make context scannab

### Special XML Tags: pdd, shell, web

The PDD preprocessor supports additional XML‑style tags to keep prompts clean, reproducible, and self‑contained. Processing order (per spec) is: `pdd` → `include`/`include-many` → `shell` → `web`. When `recursive=True`, `<shell>` and `<web>` are deferred until a non‑recursive pass.
The PDD preprocessor supports additional XML‑style tags to keep prompts clean, reproducible, and self‑contained. Processing order (per spec) is: `pdd` → `include` → `extract` → `include-many` → `shell` → `web`. When `recursive=True`, `<shell>`, `<web>`, `<extract>`, and `<include-many>` are deferred until a non‑recursive pass.

- ``
- Purpose: human‑only comment. Removed entirely during preprocessing.
Expand All @@ -212,9 +212,10 @@ The PDD preprocessor supports additional XML‑style tags to keep prompts clean,

> ⚠️ **Warning: Non-Deterministic Tags**
>
> `<shell>` and `<web>` introduce **non-determinism**:
> `<shell>`, `<web>`, and `<extract>` introduce **non-determinism**:
> - `<shell>` output varies by environment (different machines, different results)
> - `<web>` content changes over time (same URL, different content)
> - `<extract>` relies on LLM interpretation (may vary by model or seed)
>
> **Impact:** Same prompt file → different generations on different machines/times
>
Expand Down Expand Up @@ -429,7 +430,55 @@ Use this pattern when:
- **Shared constraints evolve** (e.g., coding standards, security policies). A single edit to the preamble file updates all prompts.
- **Interface definitions change** (e.g., a dependency's example file). Prompts consuming that example stay current.

*Tradeoff:* Large includes consume context tokens. If only a small portion of a file is relevant, consider extracting that portion into a dedicated include file (e.g., `docs/output_conventions.md` rather than the full `README.md`).
### Selective Includes

To further optimize token usage, PDD supports **Selective Includes**, allowing you to include only specific parts of a file (e.g., a single function, class, or section).

**Syntax:**
```xml
<!-- Python: function/class extraction -->
<include path="src/utils.py" select="def:parse_user_id"/>
<include path="src/models.py" select="class:User"/>

<!-- Markdown: section under heading -->
<include path="docs/config.md" select="section:Environment Variables"/>

<!-- Generic: line range or regex -->
<include path="src/config.py" lines="10-50"/>
<include path="src/constants.py" select="pattern:/^API_.*=/"/>
```

**Interface Mode:**
Use `mode="interface"` to extract only the public API (signatures, docstrings, type hints) of a module, skipping implementation details. This is ideal for large dependencies where you only need the contract.

```xml
<include path="src/billing/service.py" select="class:BillingService" mode="interface"/>
```

**Token Budgeting:**
Enforce token limits on included content using `max_tokens`. The `overflow` attribute determines behavior when the limit is exceeded (`warn` (default), `truncate`, or `error`).

```xml
<include path="docs/api_ref.md" max_tokens="1000" overflow="error"/>
```

### LLM-Powered Extraction (<extract>)

For large, unstructured documents where structural selectors aren't enough (e.g., "find all retry policies in this 50-page PDF"), use the `<extract>` tag. This uses an LLM to semantically extract relevant information.

**Syntax:**

```xml
<extract path="docs/large_api_reference.md">
Authentication flow, JWT token structure, and refresh token handling
</extract>
```

**Behavior:**

- **First run:** PDD asks an LLM to perform the extraction, caches the result in `.pdd/extracts/`, and includes it.
- **Subsequent runs:** PDD uses the cached file (deterministic and fast).
- **Updates:** If the source file changes, PDD warns you. Run `pdd extract refresh` to update the cache.

### Positive over Negative Constraints

Expand Down
6 changes: 6 additions & 0 deletions pdd/commands/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@
from .connect import connect
from .auth import auth_group
from .misc import preprocess
try:
from .extract import extract
except ImportError:
extract = None
Comment on lines +17 to +18
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching broad ImportError when importing .extract can hide real import-time failures inside pdd/commands/extract.py (e.g., missing dependency) and silently omit the command from the CLI. Prefer catching ModuleNotFoundError for the specific module, or re-raising unexpected import errors so packaging/dep issues fail loudly.

Suggested change
except ImportError:
extract = None
except ModuleNotFoundError as exc:
expected_name = __name__.rsplit(".", 1)[0] + ".extract"
if getattr(exc, "name", None) == expected_name:
extract = None
else:
raise

Copilot uses AI. Check for mistakes.
from .sessions import sessions
from .report import report_core
from .templates import templates_group
Expand All @@ -38,6 +42,8 @@ def register_commands(cli: click.Group) -> None:
cli.add_command(crash)
cli.add_command(trace)
cli.add_command(preprocess)
if extract:
cli.add_command(extract)
cli.add_command(report_core)
cli.add_command(install_completion_cmd, name="install_completion")
cli.add_command(verify)
Expand Down
92 changes: 90 additions & 2 deletions pdd/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,17 +217,39 @@ def process_xml_tags(text: str, recursive: bool, _seen: Optional[set] = None) ->
_seen = set()
text = process_pdd_tags(text)
text = process_include_tags(text, recursive, _seen=_seen)
text = process_extract_tags(text, recursive)
text = process_include_many_tags(text, recursive)
text = process_shell_tags(text, recursive)
text = process_web_tags(text, recursive)
return text

def _parse_attrs(attr_str: str) -> dict:
if not attr_str:
return {}
attrs = {}
# Simple attribute parser: key="value" or key='value'
for match in re.finditer(r'(\w+)\s*=\s*["\']([^"\']*)["\']', attr_str):
attrs[match.group(1)] = match.group(2)
return attrs

def process_include_tags(text: str, recursive: bool, _seen: Optional[set] = None) -> str:
if _seen is None:
_seen = set()
pattern = r'<include>(.*?)</include>'
# Support both <include>path</include> and <include path="path" attrs... />
pattern = r'<include(?P<attrs>\s+[^>]*?)?>(?P<content>.*?)</include>|<include(?P<attrs_self>\s+[^>]*?)\s*/>'

def replace_include(match):
file_path = match.group(1).strip()
attrs_str = match.group('attrs') or match.group('attrs_self') or ""
attrs = _parse_attrs(attrs_str)

file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path) or match.group('content') or ""
Comment on lines +245 to +247
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_include_tags currently breaks the legacy <include>path</include> form: when no path attribute is present, file_path stays None and file_path.strip() will raise. Also get_file_path() is being called before the file is opened, and then called again later, which can lead to duplicated ././... paths. Parse file_path as attrs.get('path') or match.group('content'), and only resolve it once (right before opening) to preserve existing behavior.

Suggested change
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path) or match.group('content') or ""
# Support both attribute and content-based include paths
file_path = attrs.get('path') or match.group('content') or ""

Copilot uses AI. Check for mistakes.
file_path = file_path.strip()

if not file_path:
return match.group(0)

try:
full_path = get_file_path(file_path)
resolved = os.path.realpath(full_path)
Expand Down Expand Up @@ -277,6 +299,37 @@ def replace_include(match):
console.print(f"Processing XML include: [cyan]{full_path}[/cyan]")
with open(full_path, 'r', encoding='utf-8') as file:
content = file.read()

# Apply selectors if any
selectors_str = attrs.get('select')
lines_str = attrs.get('lines')
mode = attrs.get('mode', 'full')
max_tokens = attrs.get('max_tokens')
overflow = attrs.get('overflow', 'warn')

if selectors_str or lines_str or mode != 'full' or max_tokens:
selectors = []
if selectors_str:
selectors.extend([s.strip() for s in selectors_str.split(',')])
if lines_str:
selectors.append(f"lines:{lines_str}")

try:
from pdd.content_selector import ContentSelector
selector = ContentSelector()
content = selector.select(
content=content,
selectors=selectors,
file_path=full_path,
mode=mode,
max_tokens=int(max_tokens) if max_tokens else None,
overflow=overflow
)
except ImportError:
console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
Comment on lines +329 to +331
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When selectors/limits are requested (select, lines, mode, max_tokens) but pdd.content_selector is missing or selection errors, the code falls back to including full content. That can silently defeat token-budgeting and change prompt semantics. Consider failing fast (raise/return a visible placeholder) when selection was explicitly requested, rather than including the entire file.

Suggested change
console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")
# When selectors/limits are requested but the content selector
# is unavailable, avoid silently including full content.
# First pass (recursive=True): leave the tag so a later run might resolve it
# Second pass (recursive=False): replace with a visible placeholder
return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
# On selection errors, do not fall back to full content.
# Follow the same recursive/placeholder pattern as for missing files.
return match.group(0) if recursive else f"[Content selection error for: {file_path}]"

Copilot uses AI. Check for mistakes.

if recursive:
child_seen = _seen | {resolved}
content = preprocess(content, recursive=True, double_curly_brackets=False, _seen=child_seen)
Expand Down Expand Up @@ -314,6 +367,41 @@ def replace_include_with_spans(match):
iterations += 1
return current_text

def process_extract_tags(text: str, recursive: bool) -> str:
pattern = r'<extract(?P<attrs>\s+[^>]*?)?>(?P<query>.*?)</extract>'
def replace_extract(match):
attrs_str = match.group('attrs') or ""
attrs = _parse_attrs(attrs_str)
query = match.group('query').strip()
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path)

if not file_path:
console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
return "[Error: <extract> tag missing 'path' attribute]"

if recursive:
return match.group(0)

Comment on lines +372 to +386
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_extract_tags validates path (and prints an error/returns an error string) before checking recursive. In the recursive pass, <extract> tags are supposed to be deferred unchanged; as written, a missing path will be replaced with an error during the recursive pass and cannot be retried later. Move the if recursive: return match.group(0) check to the top of replace_extract before any parsing/validation.

Copilot uses AI. Check for mistakes.
try:
from pdd.llm_extractor import LLMExtractor
extractor = LLMExtractor()
return extractor.extract(file_path=file_path, query=query)
except ImportError:
console.print("[yellow]Warning: pdd.llm_extractor not found. Cannot perform semantic extraction.[/yellow]")
return f"[Error: pdd.llm_extractor not found. Cannot extract from {file_path}]"
except Exception as e:
console.print(f"[bold red]Error in semantic extraction:[/bold red] {e}")
return f"[Error in semantic extraction from {file_path}: {e}]"

code_spans = _extract_code_spans(text)
def replace_extract_with_spans(match):
if _intersects_any_span(match.start(), match.end(), code_spans):
return match.group(0)
return replace_extract(match)
return re.sub(pattern, replace_extract_with_spans, text, flags=re.DOTALL)

Comment on lines 235 to 404
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behaviors in preprocess.py (self-closing <include .../> / attribute-based includes and <extract> deferral/execution) don’t appear to be covered by existing preprocess tests. Add unit tests alongside tests/test_preprocess.py to ensure both old and new <include> syntaxes work and that <extract> is deferred on recursive=True and executed/placeholdered on the second pass.

Copilot uses AI. Check for mistakes.
def process_pdd_tags(text: str) -> str:
pattern = r'<pdd>.*?</pdd>'
# Replace pdd tags with an empty string first
Expand Down
47 changes: 47 additions & 0 deletions pdd/prompts/content_selector_python.prompt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<pdd-reason>Provides precise extraction of file content based on criteria like lines, AST, and regex for selective includes.</pdd-reason>

<pdd-interface>
{
"type": "module",
"module": {
"functions": [
{"name": "ContentSelector.select", "signature": "(content: str, selectors: list[str], ...)", "returns": "str"}
]
}
}
</pdd-interface>

% You are an expert Python engineer. Your goal is to create a module for deterministic content selection from files.

% Role & Scope
The `content_selector` module provides precise extraction of file content based on various criteria (lines, AST, Markdown sections, regex). It is used by the PDD preprocessor to handle selective includes.

% Requirements
1. Implement `ContentSelector` class with a `select(content: str, selectors: list[str], file_path: str = None, mode: str = "full", max_tokens: int = None, overflow: str = "warn") -> str` method.
2. Support `lines` selector: `lines:N-M`, `lines:N-`, `lines:-M`, `lines:N` (1-based indices).
3. Support Python structural selection using `ast` for `.py` files:
- `def:function_name`: Extracts the full function definition, including decorators.
- `class:ClassName`: Extracts the full class definition, including decorators and all members.
- `class:ClassName.method_name`: Extracts a specific method from a class.
4. Support Markdown structural selection for `.md` files:
- `section:Heading`: Extracts all content under the specified heading until the next heading of the same or higher level.
5. Support regex pattern selection: `pattern:/regex/`.
6. Support `mode="interface"` for Python:
- Extract only class/function/method signatures, docstrings, and type hints.
- Remove function/method bodies (replace with `...`).
- Exclude private members (starting with `_` but not `__init__`) by default.
7. Support `max_tokens` (use `tiktoken` if available, otherwise fallback to `len(content) // 4`) with `overflow` options:
- `error`: Raise an error if limit exceeded.
- `truncate`: Truncate content and append a warning.
- `warn`: Include full content but issue a warning (default).
8. Handle multiple selectors (comma-separated string or list) by returning the union of selected parts, preserved in original file order.
9. Use `rich` for formatted error reporting and console status (e.g., warnings for truncation/overflow).
10. Ensure robust error handling for malformed selectors or missing content, providing descriptive error messages.

% Dependencies
<rich_example>
<include>context/core/errors_example.py</include>
</rich_example>

% Deliverables
- Code: `pdd/content_selector.py`
53 changes: 53 additions & 0 deletions pdd/prompts/extract_command_python.prompt
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<pdd-reason>Provides a CLI command for managing LLM-powered semantic extractions from files.</pdd-reason>

<pdd-interface>
{
"type": "cli",
"cli": {
"commands": [
{"name": "pdd extract", "description": "Manage LLM-powered semantic extractions."}
]
}
}
</pdd-interface>

% You are an expert Python engineer. Your goal is to create a PDD CLI command for managing LLM-powered extractions.

% Role & Scope
The `extract` command provides a CLI interface for the `llm_extractor` module, allowing users to refresh, list, and preview semantic extractions.

% Requirements
1. Implement `extract` command in `pdd/commands/extract.py` using `click`.
2. Support subcommands:
- `refresh [PROMPT_FILE]`: Re-runs all `<extract>` tags found in the specified prompt file(s).
- `list`: Lists all cached extractions in `.pdd/extracts/` with metadata (source, query, size).
- `status`: Checks for staleness of all cached extractions relative to their source files.
- `preview PATH --query QUERY`: Performs a one-off extraction and displays it to stdout without caching.
3. Integrate with `LLMExtractor` for extraction logic.
4. Use `rich` for beautiful table outputs and status messages.
5. Register the command in `pdd/commands/__init__.py` (Note: this is a manual integration point, but the module should be designed for it).
6. Support `--force` to skip confirmation prompts where applicable.

% Dependencies
<llm_extractor_interface>
# Expected interface for pdd.llm_extractor (yet to be generated)
class LLMExtractor:
def refresh_extractions(self, prompt_path: str = None) -> list[dict]:
"""Refreshes extractions. Returns list of updated extraction metadata."""
pass

def list_extractions(self) -> list[dict]:
"""Returns list of all cached extractions with metadata."""
pass

def check_status(self) -> list[dict]:
"""Checks staleness. Returns list of extraction status dicts."""
pass

def preview_extraction(self, path: str, query: str) -> str:
"""Performs one-off extraction without caching."""
pass
</llm_extractor_interface>

% Deliverables
- Code: `pdd/commands/extract.py`
Loading