promptdriven · niti-go · Feb 23, 2026 · Feb 23, 2026 · Feb 24, 2026 · Feb 25, 2026
diff --git a/README.md b/README.md
@@ -576,14 +576,15 @@ flowchart TB
 ### Prompt Management
 - **[`preprocess`](#5-preprocess)**: Preprocesses prompt files, handling includes, comments, and other directives
 - **[`split`](#7-split)**: Splits large prompt files into smaller, more manageable ones
+- **[`extract`](#21-extract)**: Manage cached LLM extractions for prompt context
 - **[`auto-deps`](#15-auto-deps)**: Analyzes and inserts needed dependencies into a prompt file
 - **[`detect`](#10-detect)**: Analyzes prompts to determine which ones need changes based on a description
 - **[`conflicts`](#11-conflicts)**: Finds and suggests resolutions for conflicts between two prompt files
 - **[`trace`](#13-trace)**: Finds the corresponding line number in a prompt file for a given code line
 
 ### Utility Commands
-- **[`auth`](#18-auth)**: Manages authentication with PDD Cloud
-- **[`sessions`](#19-pdd-sessions---manage-remote-sessions)**: Manage remote sessions for `connect`
+- **[`auth`](#19-auth)**: Manages authentication with PDD Cloud
+- **[`sessions`](#20-pdd-sessions---manage-remote-sessions)**: Manage remote sessions for `connect`
 
 ## Global Options
 
@@ -2437,7 +2438,7 @@ pdd connect --session-name "my-dev-server"
 
 **When to use**: Use `connect` when you prefer a graphical interface for working with PDD, when demonstrating PDD to others, or when integrating PDD with other tools that can communicate via REST APIs.
 
-### 18. auth
+### 19. auth
 
 Manages authentication with PDD Cloud. The `auth` command provides subcommands for signing in, signing out, checking status, and retrieving authentication tokens.
 
@@ -2498,7 +2499,7 @@ pdd auth token [OPTIONS]
 
 **When to use**: Use `auth` commands to manage your PDD Cloud authentication state. Use `auth login` to authenticate before using cloud features, `auth status` to verify your current session, and `auth token` when you need to pass credentials to scripts or other tools.
 
-### 19. `pdd sessions` - Manage Remote Sessions
+### 20. `pdd sessions` - Manage Remote Sessions
 
 The `sessions` command group allows you to manage remote PDD sessions registered with PDD Cloud. Remote sessions enable you to control PDD instances running on other machines through the web frontend.
 
@@ -2536,7 +2537,25 @@ pdd sessions cleanup --all --force
 
 **When to use**: Use `sessions list` to discover available remote sessions, `sessions info` to check session details, and `sessions cleanup` to remove stale or orphaned sessions.
 
-### 20. Firecrawl Web Scraping Cache
+### 21. extract
+
+Manage cached extractions generated by the `<extract>` tag in prompts.
+
+```bash
+# Refresh extractions for a specific prompt
+pdd extract refresh prompts/my_module.prompt
+
+# List all cached extractions
+pdd extract list
+
+# Show status of extractions (stale/fresh)
+pdd extract status
+
+# Preview extraction without caching
+pdd extract preview docs/large_api.md --query "auth flow"
+```
+
+### 22. Firecrawl Web Scraping Cache
 
 **Automatic caching** for web content scraped via `<web>` tags in prompts. Reduces API credit usage by caching results for 24 hours by default.
 

diff --git a/docs/prompting_guide.md b/docs/prompting_guide.md
@@ -193,7 +193,7 @@ Tip: Prefer small, named sections using XML‑style tags to make context scannab
 
 ### Special XML Tags: pdd, shell, web
 
-The PDD preprocessor supports additional XML‑style tags to keep prompts clean, reproducible, and self‑contained. Processing order (per spec) is: `pdd` → `include`/`include-many` → `shell` → `web`. When `recursive=True`, `<shell>` and `<web>` are deferred until a non‑recursive pass.
+The PDD preprocessor supports additional XML‑style tags to keep prompts clean, reproducible, and self‑contained. Processing order (per spec) is: `pdd` → `include` → `extract` → `include-many` → `shell` → `web`. When `recursive=True`, `<shell>`, `<web>`, `<extract>`, and `<include-many>` are deferred until a non‑recursive pass.
 
 - ``
   - Purpose: human‑only comment. Removed entirely during preprocessing.
@@ -212,9 +212,10 @@ The PDD preprocessor supports additional XML‑style tags to keep prompts clean,
 
 > ⚠️ **Warning: Non-Deterministic Tags**
 >
-> `<shell>` and `<web>` introduce **non-determinism**:
+> `<shell>`, `<web>`, and `<extract>` introduce **non-determinism**:
 > - `<shell>` output varies by environment (different machines, different results)
 > - `<web>` content changes over time (same URL, different content)
+> - `<extract>` relies on LLM interpretation (may vary by model or seed)
 >
 > **Impact:** Same prompt file → different generations on different machines/times
 >
@@ -429,7 +430,55 @@ Use this pattern when:
 - **Shared constraints evolve** (e.g., coding standards, security policies). A single edit to the preamble file updates all prompts.
 - **Interface definitions change** (e.g., a dependency's example file). Prompts consuming that example stay current.
 
-*Tradeoff:* Large includes consume context tokens. If only a small portion of a file is relevant, consider extracting that portion into a dedicated include file (e.g., `docs/output_conventions.md` rather than the full `README.md`).
+### Selective Includes
+
+To further optimize token usage, PDD supports **Selective Includes**, allowing you to include only specific parts of a file (e.g., a single function, class, or section).
+
+**Syntax:**
+```xml
+<!-- Python: function/class extraction -->
+<include path="src/utils.py" select="def:parse_user_id"/>
+<include path="src/models.py" select="class:User"/>
+
+<!-- Markdown: section under heading -->
+<include path="docs/config.md" select="section:Environment Variables"/>
+
+<!-- Generic: line range or regex -->
+<include path="src/config.py" lines="10-50"/>
+<include path="src/constants.py" select="pattern:/^API_.*=/"/>
+```
+
+**Interface Mode:**
+Use `mode="interface"` to extract only the public API (signatures, docstrings, type hints) of a module, skipping implementation details. This is ideal for large dependencies where you only need the contract.
+
+```xml
+<include path="src/billing/service.py" select="class:BillingService" mode="interface"/>
+```
+
+**Token Budgeting:**
+Enforce token limits on included content using `max_tokens`. The `overflow` attribute determines behavior when the limit is exceeded (`warn` (default), `truncate`, or `error`).
+
+```xml
+<include path="docs/api_ref.md" max_tokens="1000" overflow="error"/>
+```
+
+### LLM-Powered Extraction (<extract>)
+
+For large, unstructured documents where structural selectors aren't enough (e.g., "find all retry policies in this 50-page PDF"), use the `<extract>` tag. This uses an LLM to semantically extract relevant information.
+
+**Syntax:**
+
+```xml
+<extract path="docs/large_api_reference.md">
+  Authentication flow, JWT token structure, and refresh token handling
+</extract>
+```
+
+**Behavior:**
+
+- **First run:** PDD asks an LLM to perform the extraction, caches the result in `.pdd/extracts/`, and includes it.
+- **Subsequent runs:** PDD uses the cached file (deterministic and fast).
+- **Updates:** If the source file changes, PDD warns you. Run `pdd extract refresh` to update the cache.
 
 ### Positive over Negative Constraints
 

diff --git a/pdd/commands/__init__.py b/pdd/commands/__init__.py
@@ -12,6 +12,10 @@
 from .connect import connect
 from .auth import auth_group
 from .misc import preprocess
+try:
+    from .extract import extract
+except ImportError:
+    extract = None
-except ImportError:
-    extract = None
+except ModuleNotFoundError as exc:
+    expected_name = __name__.rsplit(".", 1)[0] + ".extract"
+    if getattr(exc, "name", None) == expected_name:
+        extract = None
+    else:
+        raise
-except ImportError:
-    extract = None
+except ModuleNotFoundError as exc:
+    expected_name = __name__.rsplit(".", 1)[0] + ".extract"
+    if getattr(exc, "name", None) == expected_name:
+        extract = None
+    else:
+        raise
 from .sessions import sessions
 from .report import report_core
 from .templates import templates_group
@@ -38,6 +42,8 @@ def register_commands(cli: click.Group) -> None:
     cli.add_command(crash)
     cli.add_command(trace)
     cli.add_command(preprocess)
+    if extract:
+        cli.add_command(extract)
     cli.add_command(report_core)
     cli.add_command(install_completion_cmd, name="install_completion")
     cli.add_command(verify)

diff --git a/pdd/preprocess.py b/pdd/preprocess.py
@@ -217,17 +217,39 @@ def process_xml_tags(text: str, recursive: bool, _seen: Optional[set] = None) ->
         _seen = set()
     text = process_pdd_tags(text)
     text = process_include_tags(text, recursive, _seen=_seen)
+    text = process_extract_tags(text, recursive)
     text = process_include_many_tags(text, recursive)
     text = process_shell_tags(text, recursive)
     text = process_web_tags(text, recursive)
     return text
 
+def _parse_attrs(attr_str: str) -> dict:
+    if not attr_str:
+        return {}
+    attrs = {}
+    # Simple attribute parser: key="value" or key='value'
+    for match in re.finditer(r'(\w+)\s*=\s*["\']([^"\']*)["\']', attr_str):
+        attrs[match.group(1)] = match.group(2)
+    return attrs
+
 def process_include_tags(text: str, recursive: bool, _seen: Optional[set] = None) -> str:
     if _seen is None:
         _seen = set()
-    pattern = r'<include>(.*?)</include>'
+    # Support both <include>path</include> and <include path="path" attrs... />
+    pattern = r'<include(?P<attrs>\s+[^>]*?)?>(?P<content>.*?)</include>|<include(?P<attrs_self>\s+[^>]*?)\s*/>'
+
     def replace_include(match):
-        file_path = match.group(1).strip()
+        attrs_str = match.group('attrs') or match.group('attrs_self') or ""
+        attrs = _parse_attrs(attrs_str)
+
+        file_path = attrs.get('path')
+        if file_path:
+            file_path = get_file_path(file_path) or match.group('content') or ""
-        file_path = attrs.get('path')
-        if file_path:
-            file_path = get_file_path(file_path) or match.group('content') or ""
+        # Support both attribute and content-based include paths
+        file_path = attrs.get('path') or match.group('content') or ""
-        file_path = attrs.get('path')
-        if file_path:
-            file_path = get_file_path(file_path) or match.group('content') or ""
+        # Support both attribute and content-based include paths
+        file_path = attrs.get('path') or match.group('content') or ""
+        file_path = file_path.strip()
+
+        if not file_path:
+            return match.group(0)
+
         try:
             full_path = get_file_path(file_path)
             resolved = os.path.realpath(full_path)
@@ -277,6 +299,37 @@ def replace_include(match):
                 console.print(f"Processing XML include: [cyan]{full_path}[/cyan]")
                 with open(full_path, 'r', encoding='utf-8') as file:
                     content = file.read()
+
+                    # Apply selectors if any
+                    selectors_str = attrs.get('select')
+                    lines_str = attrs.get('lines')
+                    mode = attrs.get('mode', 'full')
+                    max_tokens = attrs.get('max_tokens')
+                    overflow = attrs.get('overflow', 'warn')
+
+                    if selectors_str or lines_str or mode != 'full' or max_tokens:
+                        selectors = []
+                        if selectors_str:
+                            selectors.extend([s.strip() for s in selectors_str.split(',')])
+                        if lines_str:
+                            selectors.append(f"lines:{lines_str}")
+
+                        try:
+                            from pdd.content_selector import ContentSelector
+                            selector = ContentSelector()
+                            content = selector.select(
+                                content=content,
+                                selectors=selectors,
+                                file_path=full_path,
+                                mode=mode,
+                                max_tokens=int(max_tokens) if max_tokens else None,
+                                overflow=overflow
+                            )
+                        except ImportError:
+                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
-                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
-                        except Exception as e:
-                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")
+                            # When selectors/limits are requested but the content selector
+                            # is unavailable, avoid silently including full content.
+                            # First pass (recursive=True): leave the tag so a later run might resolve it
+                            # Second pass (recursive=False): replace with a visible placeholder
+                            return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            # On selection errors, do not fall back to full content.
+                            # Follow the same recursive/placeholder pattern as for missing files.
+                            return match.group(0) if recursive else f"[Content selection error for: {file_path}]"
-                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
-                        except Exception as e:
-                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")
+                            # When selectors/limits are requested but the content selector
+                            # is unavailable, avoid silently including full content.
+                            # First pass (recursive=True): leave the tag so a later run might resolve it
+                            # Second pass (recursive=False): replace with a visible placeholder
+                            return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            # On selection errors, do not fall back to full content.
+                            # Follow the same recursive/placeholder pattern as for missing files.
+                            return match.group(0) if recursive else f"[Content selection error for: {file_path}]"
+
                     if recursive:
                         child_seen = _seen | {resolved}
                         content = preprocess(content, recursive=True, double_curly_brackets=False, _seen=child_seen)
@@ -314,6 +367,41 @@ def replace_include_with_spans(match):
         iterations += 1
     return current_text
 
+def process_extract_tags(text: str, recursive: bool) -> str:
+    pattern = r'<extract(?P<attrs>\s+[^>]*?)?>(?P<query>.*?)</extract>'
+    def replace_extract(match):
+        attrs_str = match.group('attrs') or ""
+        attrs = _parse_attrs(attrs_str)
+        query = match.group('query').strip()
+        file_path = attrs.get('path')
+        if file_path:
+            file_path = get_file_path(file_path)
+
+        if not file_path:
+            console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
+            return "[Error: <extract> tag missing 'path' attribute]"
+
+        if recursive:
+            return match.group(0)
+
+        try:
+            from pdd.llm_extractor import LLMExtractor
+            extractor = LLMExtractor()
+            return extractor.extract(file_path=file_path, query=query)
+        except ImportError:
+            console.print("[yellow]Warning: pdd.llm_extractor not found. Cannot perform semantic extraction.[/yellow]")
+            return f"[Error: pdd.llm_extractor not found. Cannot extract from {file_path}]"
+        except Exception as e:
+            console.print(f"[bold red]Error in semantic extraction:[/bold red] {e}")
+            return f"[Error in semantic extraction from {file_path}: {e}]"
+
+    code_spans = _extract_code_spans(text)
+    def replace_extract_with_spans(match):
+        if _intersects_any_span(match.start(), match.end(), code_spans):
+            return match.group(0)
+        return replace_extract(match)
+    return re.sub(pattern, replace_extract_with_spans, text, flags=re.DOTALL)
+
 def process_pdd_tags(text: str) -> str:
     pattern = r'<pdd>.*?</pdd>'
     # Replace pdd tags with an empty string first

diff --git a/pdd/prompts/content_selector_python.prompt b/pdd/prompts/content_selector_python.prompt
@@ -0,0 +1,47 @@
+<pdd-reason>Provides precise extraction of file content based on criteria like lines, AST, and regex for selective includes.</pdd-reason>
+
+<pdd-interface>
+{
+  "type": "module",
+  "module": {
+    "functions": [
+      {"name": "ContentSelector.select", "signature": "(content: str, selectors: list[str], ...)", "returns": "str"}
+    ]
+  }
+}
+</pdd-interface>
+
+% You are an expert Python engineer. Your goal is to create a module for deterministic content selection from files.
+
+% Role & Scope
+The `content_selector` module provides precise extraction of file content based on various criteria (lines, AST, Markdown sections, regex). It is used by the PDD preprocessor to handle selective includes.
+
+% Requirements
+1. Implement `ContentSelector` class with a `select(content: str, selectors: list[str], file_path: str = None, mode: str = "full", max_tokens: int = None, overflow: str = "warn") -> str` method.
+2. Support `lines` selector: `lines:N-M`, `lines:N-`, `lines:-M`, `lines:N` (1-based indices).
+3. Support Python structural selection using `ast` for `.py` files:
+   - `def:function_name`: Extracts the full function definition, including decorators.
+   - `class:ClassName`: Extracts the full class definition, including decorators and all members.
+   - `class:ClassName.method_name`: Extracts a specific method from a class.
+4. Support Markdown structural selection for `.md` files:
+   - `section:Heading`: Extracts all content under the specified heading until the next heading of the same or higher level.
+5. Support regex pattern selection: `pattern:/regex/`.
+6. Support `mode="interface"` for Python:
+   - Extract only class/function/method signatures, docstrings, and type hints.
+   - Remove function/method bodies (replace with `...`).
+   - Exclude private members (starting with `_` but not `__init__`) by default.
+7. Support `max_tokens` (use `tiktoken` if available, otherwise fallback to `len(content) // 4`) with `overflow` options:
+   - `error`: Raise an error if limit exceeded.
+   - `truncate`: Truncate content and append a warning.
+   - `warn`: Include full content but issue a warning (default).
+8. Handle multiple selectors (comma-separated string or list) by returning the union of selected parts, preserved in original file order.
+9. Use `rich` for formatted error reporting and console status (e.g., warnings for truncation/overflow).
+10. Ensure robust error handling for malformed selectors or missing content, providing descriptive error messages.
+
+% Dependencies
+<rich_example>
+  <include>context/core/errors_example.py</include>
+</rich_example>
+
+% Deliverables
+- Code: `pdd/content_selector.py`
diff --git a/pdd/prompts/extract_command_python.prompt b/pdd/prompts/extract_command_python.prompt
@@ -0,0 +1,53 @@
+<pdd-reason>Provides a CLI command for managing LLM-powered semantic extractions from files.</pdd-reason>
+
+<pdd-interface>
+{
+  "type": "cli",
+  "cli": {
+    "commands": [
+      {"name": "pdd extract", "description": "Manage LLM-powered semantic extractions."}
+    ]
+  }
+}
+</pdd-interface>
+
+% You are an expert Python engineer. Your goal is to create a PDD CLI command for managing LLM-powered extractions.
+
+% Role & Scope
+The `extract` command provides a CLI interface for the `llm_extractor` module, allowing users to refresh, list, and preview semantic extractions.
+
+% Requirements
+1. Implement `extract` command in `pdd/commands/extract.py` using `click`.
+2. Support subcommands:
+   - `refresh [PROMPT_FILE]`: Re-runs all `<extract>` tags found in the specified prompt file(s).
+   - `list`: Lists all cached extractions in `.pdd/extracts/` with metadata (source, query, size).
+   - `status`: Checks for staleness of all cached extractions relative to their source files.
+   - `preview PATH --query QUERY`: Performs a one-off extraction and displays it to stdout without caching.
+3. Integrate with `LLMExtractor` for extraction logic.
+4. Use `rich` for beautiful table outputs and status messages.
+5. Register the command in `pdd/commands/__init__.py` (Note: this is a manual integration point, but the module should be designed for it).
+6. Support `--force` to skip confirmation prompts where applicable.
+
+% Dependencies
+<llm_extractor_interface>
+# Expected interface for pdd.llm_extractor (yet to be generated)
+class LLMExtractor:
+    def refresh_extractions(self, prompt_path: str = None) -> list[dict]:
+        """Refreshes extractions. Returns list of updated extraction metadata."""
+        pass
+
+    def list_extractions(self) -> list[dict]:
+        """Returns list of all cached extractions with metadata."""
+        pass
+
+    def check_status(self) -> list[dict]:
+        """Checks staleness. Returns list of extraction status dicts."""
+        pass
+
+    def preview_extraction(self, path: str, query: str) -> str:
+        """Performs one-off extraction without caching."""
+        pass
+</llm_extractor_interface>
+
+% Deliverables
+- Code: `pdd/commands/extract.py`