feat: selective includes and LLM-powered extraction (#190) by niti-go · Pull Request #552 · promptdriven/pdd

niti-go · 2026-02-23T06:05:42Z

Summary

This PR implements selective includes and LLM-powered extraction for token-efficient prompts, allowing users to extract specific parts of files (lines, Python classes/functions, Markdown sections) and use LLMs for semantic extraction with caching.

Closes #190

Changes Made

Prompts Modified

prompts/content_selector_python.prompt - Implements logic for selecting specific parts of files based on structural selectors (lines, Python AST, Markdown sections).
prompts/llm_extractor_python.prompt - Handles semantic extraction using LLMs.
prompts/extract_command_python.prompt - Provides the CLI interface for the extract command.

Documentation Updated

README.md - Added documentation for selective includes and the extract command.
docs/prompting_guide.md - Expanded the guide to include advanced usage of selective includes and LLM extraction.

Core Changes

pdd/preprocess.py - Updated to support the new selective include syntax and extraction logic.
pdd/commands/__init__.py - Registered the new extract command.
sync_order.sh - Added a script to synchronize prompts in the correct dependency order.

Review Checklist

Prompt syntax is valid
PDD conventions followed
Documentation is up to date

Next Steps After Merge

Regenerate code from modified prompts in dependency order:
```
./sync_order.sh
```
Or manually:
```
pdd sync extract_command
```

pdd sync llm_extractor
pdd sync content_selector

2. Run tests to verify functionality
3. Deploy if applicable

---
*Created by pdd change workflow*

gltanaka · 2026-02-24T21:41:01Z

thanks @niti-go, can you please fix the unit tests?

Copilot

Pull request overview

Adds “selective includes” and an LLM-backed <extract> mechanism to reduce prompt token usage, along with CLI/docs scaffolding and prompt modules that define the intended implementations.

Changes:

Extended pdd preprocessing to support attribute-based/self-closing <include .../> plus a new <extract> tag.
Added prompt modules specifying the ContentSelector, LLMExtractor, and pdd extract command behavior.
Updated docs/README and added a helper sync_order.sh script for syncing new prompt modules in dependency order.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sync_order.sh`	New helper script to run `pdd sync` for the new modules in dependency order.
`pdd/prompts/content_selector_python.prompt`	Specifies deterministic selection (lines/AST/markdown/regex, interface mode, token budgeting).
`pdd/prompts/llm_extractor_python.prompt`	Specifies LLM extraction with persistent caching and staleness detection.
`pdd/prompts/extract_command_python.prompt`	Specifies the `pdd extract` CLI command (refresh/list/status/preview).
`pdd/preprocess.py`	Implements parsing for include attributes/selectors and adds `<extract>` tag handling.
`pdd/commands/__init__.py`	Registers the new `extract` command (currently guarded by try/except).
`docs/prompting_guide.md`	Documents selective includes, token budgeting, and `<extract>` semantics/ordering.
`README.md`	Adds `extract` to command list and documents `pdd extract` usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T21:46:24Z

pdd/preprocess.py

+    def replace_extract(match):
+        attrs_str = match.group('attrs') or ""
+        attrs = _parse_attrs(attrs_str)
+        query = match.group('query').strip()
+        file_path = attrs.get('path')
+        if file_path:
+            file_path = get_file_path(file_path)
+
+        if not file_path:
+            console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
+            return "[Error: <extract> tag missing 'path' attribute]"
+
+        if recursive:
+            return match.group(0)
+


process_extract_tags validates path (and prints an error/returns an error string) before checking recursive. In the recursive pass, <extract> tags are supposed to be deferred unchanged; as written, a missing path will be replaced with an error during the recursive pass and cannot be retried later. Move the if recursive: return match.group(0) check to the top of replace_extract before any parsing/validation.

Copilot · 2026-02-24T21:46:25Z

pdd/commands/__init__.py

+except ImportError:
+    extract = None


Catching broad ImportError when importing .extract can hide real import-time failures inside pdd/commands/extract.py (e.g., missing dependency) and silently omit the command from the CLI. Prefer catching ModuleNotFoundError for the specific module, or re-raising unexpected import errors so packaging/dep issues fail loudly.

Suggested change

except ImportError:

extract = None

except ModuleNotFoundError as exc:

expected_name = __name__.rsplit(".", 1)[0] + ".extract"

if getattr(exc, "name", None) == expected_name:

extract = None

else:

raise

Copilot · 2026-02-24T21:46:25Z

pdd/preprocess.py

@@ -277,6 +299,37 @@ def replace_include(match):
                console.print(f"Processing XML include: [cyan]{full_path}[/cyan]")
                with open(full_path, 'r', encoding='utf-8') as file:
                    content = file.read()
+
+                    # Apply selectors if any
+                    selectors_str = attrs.get('select')
+                    lines_str = attrs.get('lines')
+                    mode = attrs.get('mode', 'full')
+                    max_tokens = attrs.get('max_tokens')
+                    overflow = attrs.get('overflow', 'warn')
+
+                    if selectors_str or lines_str or mode != 'full' or max_tokens:
+                        selectors = []
+                        if selectors_str:
+                            selectors.extend([s.strip() for s in selectors_str.split(',')])
+                        if lines_str:
+                            selectors.append(f"lines:{lines_str}")
+
+                        try:
+                            from pdd.content_selector import ContentSelector
+                            selector = ContentSelector()
+                            content = selector.select(
+                                content=content,
+                                selectors=selectors,
+                                file_path=full_path,
+                                mode=mode,
+                                max_tokens=int(max_tokens) if max_tokens else None,
+                                overflow=overflow
+                            )
+                        except ImportError:
+                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+
                    if recursive:
                        child_seen = _seen | {resolved}
                        content = preprocess(content, recursive=True, double_curly_brackets=False, _seen=child_seen)
@@ -314,6 +367,41 @@ def replace_include_with_spans(match):
        iterations += 1
    return current_text

+def process_extract_tags(text: str, recursive: bool) -> str:
+    pattern = r'<extract(?P<attrs>\s+[^>]*?)?>(?P<query>.*?)</extract>'
+    def replace_extract(match):
+        attrs_str = match.group('attrs') or ""
+        attrs = _parse_attrs(attrs_str)
+        query = match.group('query').strip()
+        file_path = attrs.get('path')
+        if file_path:
+            file_path = get_file_path(file_path)
+
+        if not file_path:
+            console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
+            return "[Error: <extract> tag missing 'path' attribute]"
+
+        if recursive:
+            return match.group(0)
+
+        try:
+            from pdd.llm_extractor import LLMExtractor
+            extractor = LLMExtractor()
+            return extractor.extract(file_path=file_path, query=query)
+        except ImportError:
+            console.print("[yellow]Warning: pdd.llm_extractor not found. Cannot perform semantic extraction.[/yellow]")
+            return f"[Error: pdd.llm_extractor not found. Cannot extract from {file_path}]"
+        except Exception as e:
+            console.print(f"[bold red]Error in semantic extraction:[/bold red] {e}")
+            return f"[Error in semantic extraction from {file_path}: {e}]"
+
+    code_spans = _extract_code_spans(text)
+    def replace_extract_with_spans(match):
+        if _intersects_any_span(match.start(), match.end(), code_spans):
+            return match.group(0)
+        return replace_extract(match)
+    return re.sub(pattern, replace_extract_with_spans, text, flags=re.DOTALL)
+


New behaviors in preprocess.py (self-closing <include .../> / attribute-based includes and <extract> deferral/execution) don’t appear to be covered by existing preprocess tests. Add unit tests alongside tests/test_preprocess.py to ensure both old and new <include> syntaxes work and that <extract> is deferred on recursive=True and executed/placeholdered on the second pass.

Copilot · 2026-02-24T21:46:25Z

README.md

+```bash
+# Refresh extractions for a specific prompt
+pdd extract refresh prompts/my_module.prompt
+
+# List all cached extractions
+pdd extract list
+
+# Show status of extractions (stale/fresh)
+pdd extract status
+
+# Preview extraction without caching
+pdd extract preview docs/large_api.md --query "auth flow"
+```


The PR description’s “Next Steps” uses pdd extract --refresh ..., but the README examples here use pdd extract refresh ... (subcommand). Please make the PR description and docs consistent with the actual CLI syntax to avoid user confusion.

Copilot · 2026-02-24T21:46:25Z

pdd/preprocess.py

+        file_path = attrs.get('path')
+        if file_path:
+            file_path = get_file_path(file_path) or match.group('content') or ""


process_include_tags currently breaks the legacy <include>path</include> form: when no path attribute is present, file_path stays None and file_path.strip() will raise. Also get_file_path() is being called before the file is opened, and then called again later, which can lead to duplicated ././... paths. Parse file_path as attrs.get('path') or match.group('content'), and only resolve it once (right before opening) to preserve existing behavior.

Suggested change

file_path = attrs.get('path')

if file_path:

file_path = get_file_path(file_path) or match.group('content') or ""

# Support both attribute and content-based include paths

file_path = attrs.get('path') or match.group('content') or ""

Copilot · 2026-02-24T21:46:25Z

pdd/preprocess.py

+                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")


When selectors/limits are requested (select, lines, mode, max_tokens) but pdd.content_selector is missing or selection errors, the code falls back to including full content. That can silently defeat token-budgeting and change prompt semantics. Consider failing fast (raise/return a visible placeholder) when selection was explicitly requested, rather than including the entire file.

Suggested change

console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")

except Exception as e:

console.print(f"[bold red]Error in content selection:[/bold red] {e}")

console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")

# When selectors/limits are requested but the content selector

# is unavailable, avoid silently including full content.

# First pass (recursive=True): leave the tag so a later run might resolve it

# Second pass (recursive=False): replace with a visible placeholder

return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"

except Exception as e:

console.print(f"[bold red]Error in content selection:[/bold red] {e}")

# On selection errors, do not fall back to full content.

# Follow the same recursive/placeholder pattern as for missing files.

return match.group(0) if recursive else f"[Content selection error for: {file_path}]"

niti-go · 2026-02-25T00:27:54Z

This PR isn't ready yet! It was created by pdd change. I still have to iterate on it. I'm marking it as a draft.

feat: selective includes and LLM-powered extraction (promptdriven#190)

74f7056

niti-go mentioned this pull request Feb 23, 2026

Feature: Selective includes and LLM-powered extraction for token-efficient prompts #190

Open

niti-go added 2 commits February 23, 2026 10:39

Merge branch 'promptdriven:main' into change/issue-190

9d717a1

Merge branch 'promptdriven:main' into change/issue-190

94e5a97

gltanaka requested a review from Copilot February 24, 2026 21:41

Copilot started reviewing on behalf of gltanaka February 24, 2026 21:41 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

niti-go marked this pull request as draft February 25, 2026 00:28

Merge branch 'promptdriven:main' into change/issue-190

3b1265b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: selective includes and LLM-powered extraction (#190)#552

feat: selective includes and LLM-powered extraction (#190)#552
niti-go wants to merge 4 commits intopromptdriven:mainfrom
niti-go:change/issue-190

niti-go commented Feb 23, 2026

Uh oh!

gltanaka commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

niti-go commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                            console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
-                        except Exception as e:
-                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")
+                            # When selectors/limits are requested but the content selector
+                            # is unavailable, avoid silently including full content.
+                            # First pass (recursive=True): leave the tag so a later run might resolve it
+                            # Second pass (recursive=False): replace with a visible placeholder
+                            return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"
+                        except Exception as e:
+                            console.print(f"[bold red]Error in content selection:[/bold red] {e}")
+                            # On selection errors, do not fall back to full content.
+                            # Follow the same recursive/placeholder pattern as for missing files.
+                            return match.group(0) if recursive else f"[Content selection error for: {file_path}]"

Conversation

niti-go commented Feb 23, 2026

Summary

Changes Made

Prompts Modified

Documentation Updated

Core Changes

Review Checklist

Next Steps After Merge

Uh oh!

gltanaka commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

niti-go commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants