feat: selective includes and LLM-powered extraction (#190)#552
feat: selective includes and LLM-powered extraction (#190)#552niti-go wants to merge 4 commits intopromptdriven:mainfrom
Conversation
|
thanks @niti-go, can you please fix the unit tests? |
There was a problem hiding this comment.
Pull request overview
Adds “selective includes” and an LLM-backed <extract> mechanism to reduce prompt token usage, along with CLI/docs scaffolding and prompt modules that define the intended implementations.
Changes:
- Extended
pddpreprocessing to support attribute-based/self-closing<include .../>plus a new<extract>tag. - Added prompt modules specifying the
ContentSelector,LLMExtractor, andpdd extractcommand behavior. - Updated docs/README and added a helper
sync_order.shscript for syncing new prompt modules in dependency order.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
sync_order.sh |
New helper script to run pdd sync for the new modules in dependency order. |
pdd/prompts/content_selector_python.prompt |
Specifies deterministic selection (lines/AST/markdown/regex, interface mode, token budgeting). |
pdd/prompts/llm_extractor_python.prompt |
Specifies LLM extraction with persistent caching and staleness detection. |
pdd/prompts/extract_command_python.prompt |
Specifies the pdd extract CLI command (refresh/list/status/preview). |
pdd/preprocess.py |
Implements parsing for include attributes/selectors and adds <extract> tag handling. |
pdd/commands/__init__.py |
Registers the new extract command (currently guarded by try/except). |
docs/prompting_guide.md |
Documents selective includes, token budgeting, and <extract> semantics/ordering. |
README.md |
Adds extract to command list and documents pdd extract usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def replace_extract(match): | ||
| attrs_str = match.group('attrs') or "" | ||
| attrs = _parse_attrs(attrs_str) | ||
| query = match.group('query').strip() | ||
| file_path = attrs.get('path') | ||
| if file_path: | ||
| file_path = get_file_path(file_path) | ||
|
|
||
| if not file_path: | ||
| console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute") | ||
| return "[Error: <extract> tag missing 'path' attribute]" | ||
|
|
||
| if recursive: | ||
| return match.group(0) | ||
|
|
There was a problem hiding this comment.
process_extract_tags validates path (and prints an error/returns an error string) before checking recursive. In the recursive pass, <extract> tags are supposed to be deferred unchanged; as written, a missing path will be replaced with an error during the recursive pass and cannot be retried later. Move the if recursive: return match.group(0) check to the top of replace_extract before any parsing/validation.
| except ImportError: | ||
| extract = None |
There was a problem hiding this comment.
Catching broad ImportError when importing .extract can hide real import-time failures inside pdd/commands/extract.py (e.g., missing dependency) and silently omit the command from the CLI. Prefer catching ModuleNotFoundError for the specific module, or re-raising unexpected import errors so packaging/dep issues fail loudly.
| except ImportError: | |
| extract = None | |
| except ModuleNotFoundError as exc: | |
| expected_name = __name__.rsplit(".", 1)[0] + ".extract" | |
| if getattr(exc, "name", None) == expected_name: | |
| extract = None | |
| else: | |
| raise |
| @@ -277,6 +299,37 @@ def replace_include(match): | |||
| console.print(f"Processing XML include: [cyan]{full_path}[/cyan]") | |||
| with open(full_path, 'r', encoding='utf-8') as file: | |||
| content = file.read() | |||
|
|
|||
| # Apply selectors if any | |||
| selectors_str = attrs.get('select') | |||
| lines_str = attrs.get('lines') | |||
| mode = attrs.get('mode', 'full') | |||
| max_tokens = attrs.get('max_tokens') | |||
| overflow = attrs.get('overflow', 'warn') | |||
|
|
|||
| if selectors_str or lines_str or mode != 'full' or max_tokens: | |||
| selectors = [] | |||
| if selectors_str: | |||
| selectors.extend([s.strip() for s in selectors_str.split(',')]) | |||
| if lines_str: | |||
| selectors.append(f"lines:{lines_str}") | |||
|
|
|||
| try: | |||
| from pdd.content_selector import ContentSelector | |||
| selector = ContentSelector() | |||
| content = selector.select( | |||
| content=content, | |||
| selectors=selectors, | |||
| file_path=full_path, | |||
| mode=mode, | |||
| max_tokens=int(max_tokens) if max_tokens else None, | |||
| overflow=overflow | |||
| ) | |||
| except ImportError: | |||
| console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]") | |||
| except Exception as e: | |||
| console.print(f"[bold red]Error in content selection:[/bold red] {e}") | |||
|
|
|||
| if recursive: | |||
| child_seen = _seen | {resolved} | |||
| content = preprocess(content, recursive=True, double_curly_brackets=False, _seen=child_seen) | |||
| @@ -314,6 +367,41 @@ def replace_include_with_spans(match): | |||
| iterations += 1 | |||
| return current_text | |||
|
|
|||
| def process_extract_tags(text: str, recursive: bool) -> str: | |||
| pattern = r'<extract(?P<attrs>\s+[^>]*?)?>(?P<query>.*?)</extract>' | |||
| def replace_extract(match): | |||
| attrs_str = match.group('attrs') or "" | |||
| attrs = _parse_attrs(attrs_str) | |||
| query = match.group('query').strip() | |||
| file_path = attrs.get('path') | |||
| if file_path: | |||
| file_path = get_file_path(file_path) | |||
|
|
|||
| if not file_path: | |||
| console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute") | |||
| return "[Error: <extract> tag missing 'path' attribute]" | |||
|
|
|||
| if recursive: | |||
| return match.group(0) | |||
|
|
|||
| try: | |||
| from pdd.llm_extractor import LLMExtractor | |||
| extractor = LLMExtractor() | |||
| return extractor.extract(file_path=file_path, query=query) | |||
| except ImportError: | |||
| console.print("[yellow]Warning: pdd.llm_extractor not found. Cannot perform semantic extraction.[/yellow]") | |||
| return f"[Error: pdd.llm_extractor not found. Cannot extract from {file_path}]" | |||
| except Exception as e: | |||
| console.print(f"[bold red]Error in semantic extraction:[/bold red] {e}") | |||
| return f"[Error in semantic extraction from {file_path}: {e}]" | |||
|
|
|||
| code_spans = _extract_code_spans(text) | |||
| def replace_extract_with_spans(match): | |||
| if _intersects_any_span(match.start(), match.end(), code_spans): | |||
| return match.group(0) | |||
| return replace_extract(match) | |||
| return re.sub(pattern, replace_extract_with_spans, text, flags=re.DOTALL) | |||
|
|
|||
There was a problem hiding this comment.
New behaviors in preprocess.py (self-closing <include .../> / attribute-based includes and <extract> deferral/execution) don’t appear to be covered by existing preprocess tests. Add unit tests alongside tests/test_preprocess.py to ensure both old and new <include> syntaxes work and that <extract> is deferred on recursive=True and executed/placeholdered on the second pass.
| ```bash | ||
| # Refresh extractions for a specific prompt | ||
| pdd extract refresh prompts/my_module.prompt | ||
|
|
||
| # List all cached extractions | ||
| pdd extract list | ||
|
|
||
| # Show status of extractions (stale/fresh) | ||
| pdd extract status | ||
|
|
||
| # Preview extraction without caching | ||
| pdd extract preview docs/large_api.md --query "auth flow" | ||
| ``` |
There was a problem hiding this comment.
The PR description’s “Next Steps” uses pdd extract --refresh ..., but the README examples here use pdd extract refresh ... (subcommand). Please make the PR description and docs consistent with the actual CLI syntax to avoid user confusion.
| file_path = attrs.get('path') | ||
| if file_path: | ||
| file_path = get_file_path(file_path) or match.group('content') or "" |
There was a problem hiding this comment.
process_include_tags currently breaks the legacy <include>path</include> form: when no path attribute is present, file_path stays None and file_path.strip() will raise. Also get_file_path() is being called before the file is opened, and then called again later, which can lead to duplicated ././... paths. Parse file_path as attrs.get('path') or match.group('content'), and only resolve it once (right before opening) to preserve existing behavior.
| file_path = attrs.get('path') | |
| if file_path: | |
| file_path = get_file_path(file_path) or match.group('content') or "" | |
| # Support both attribute and content-based include paths | |
| file_path = attrs.get('path') or match.group('content') or "" |
| console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]") | ||
| except Exception as e: | ||
| console.print(f"[bold red]Error in content selection:[/bold red] {e}") |
There was a problem hiding this comment.
When selectors/limits are requested (select, lines, mode, max_tokens) but pdd.content_selector is missing or selection errors, the code falls back to including full content. That can silently defeat token-budgeting and change prompt semantics. Consider failing fast (raise/return a visible placeholder) when selection was explicitly requested, rather than including the entire file.
| console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]") | |
| except Exception as e: | |
| console.print(f"[bold red]Error in content selection:[/bold red] {e}") | |
| console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]") | |
| # When selectors/limits are requested but the content selector | |
| # is unavailable, avoid silently including full content. | |
| # First pass (recursive=True): leave the tag so a later run might resolve it | |
| # Second pass (recursive=False): replace with a visible placeholder | |
| return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]" | |
| except Exception as e: | |
| console.print(f"[bold red]Error in content selection:[/bold red] {e}") | |
| # On selection errors, do not fall back to full content. | |
| # Follow the same recursive/placeholder pattern as for missing files. | |
| return match.group(0) if recursive else f"[Content selection error for: {file_path}]" |
|
This PR isn't ready yet! It was created by pdd change. I still have to iterate on it. I'm marking it as a draft. |
Summary
This PR implements selective includes and LLM-powered extraction for token-efficient prompts, allowing users to extract specific parts of files (lines, Python classes/functions, Markdown sections) and use LLMs for semantic extraction with caching.
Closes #190
Changes Made
Prompts Modified
prompts/content_selector_python.prompt- Implements logic for selecting specific parts of files based on structural selectors (lines, Python AST, Markdown sections).prompts/llm_extractor_python.prompt- Handles semantic extraction using LLMs.prompts/extract_command_python.prompt- Provides the CLI interface for theextractcommand.Documentation Updated
README.md- Added documentation for selective includes and theextractcommand.docs/prompting_guide.md- Expanded the guide to include advanced usage of selective includes and LLM extraction.Core Changes
pdd/preprocess.py- Updated to support the new selective include syntax and extraction logic.pdd/commands/__init__.py- Registered the newextractcommand.sync_order.sh- Added a script to synchronize prompts in the correct dependency order.Review Checklist
Next Steps After Merge
pdd sync llm_extractor
pdd sync content_selector