Skip to content

feat: selective includes and LLM-powered extraction (#190)#552

Draft
niti-go wants to merge 4 commits intopromptdriven:mainfrom
niti-go:change/issue-190
Draft

feat: selective includes and LLM-powered extraction (#190)#552
niti-go wants to merge 4 commits intopromptdriven:mainfrom
niti-go:change/issue-190

Conversation

@niti-go
Copy link
Contributor

@niti-go niti-go commented Feb 23, 2026

Summary

This PR implements selective includes and LLM-powered extraction for token-efficient prompts, allowing users to extract specific parts of files (lines, Python classes/functions, Markdown sections) and use LLMs for semantic extraction with caching.

Closes #190

Changes Made

Prompts Modified

  • prompts/content_selector_python.prompt - Implements logic for selecting specific parts of files based on structural selectors (lines, Python AST, Markdown sections).
  • prompts/llm_extractor_python.prompt - Handles semantic extraction using LLMs.
  • prompts/extract_command_python.prompt - Provides the CLI interface for the extract command.

Documentation Updated

  • README.md - Added documentation for selective includes and the extract command.
  • docs/prompting_guide.md - Expanded the guide to include advanced usage of selective includes and LLM extraction.

Core Changes

  • pdd/preprocess.py - Updated to support the new selective include syntax and extraction logic.
  • pdd/commands/__init__.py - Registered the new extract command.
  • sync_order.sh - Added a script to synchronize prompts in the correct dependency order.

Review Checklist

  • Prompt syntax is valid
  • PDD conventions followed
  • Documentation is up to date

Next Steps After Merge

  1. Regenerate code from modified prompts in dependency order:
    ./sync_order.sh
    Or manually:
    pdd sync extract_command
    

pdd sync llm_extractor
pdd sync content_selector

2. Run tests to verify functionality
3. Deploy if applicable

---
*Created by pdd change workflow*

@gltanaka
Copy link
Contributor

thanks @niti-go, can you please fix the unit tests?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “selective includes” and an LLM-backed <extract> mechanism to reduce prompt token usage, along with CLI/docs scaffolding and prompt modules that define the intended implementations.

Changes:

  • Extended pdd preprocessing to support attribute-based/self-closing <include .../> plus a new <extract> tag.
  • Added prompt modules specifying the ContentSelector, LLMExtractor, and pdd extract command behavior.
  • Updated docs/README and added a helper sync_order.sh script for syncing new prompt modules in dependency order.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sync_order.sh New helper script to run pdd sync for the new modules in dependency order.
pdd/prompts/content_selector_python.prompt Specifies deterministic selection (lines/AST/markdown/regex, interface mode, token budgeting).
pdd/prompts/llm_extractor_python.prompt Specifies LLM extraction with persistent caching and staleness detection.
pdd/prompts/extract_command_python.prompt Specifies the pdd extract CLI command (refresh/list/status/preview).
pdd/preprocess.py Implements parsing for include attributes/selectors and adds <extract> tag handling.
pdd/commands/__init__.py Registers the new extract command (currently guarded by try/except).
docs/prompting_guide.md Documents selective includes, token budgeting, and <extract> semantics/ordering.
README.md Adds extract to command list and documents pdd extract usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +372 to +386
def replace_extract(match):
attrs_str = match.group('attrs') or ""
attrs = _parse_attrs(attrs_str)
query = match.group('query').strip()
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path)

if not file_path:
console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
return "[Error: <extract> tag missing 'path' attribute]"

if recursive:
return match.group(0)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_extract_tags validates path (and prints an error/returns an error string) before checking recursive. In the recursive pass, <extract> tags are supposed to be deferred unchanged; as written, a missing path will be replaced with an error during the recursive pass and cannot be retried later. Move the if recursive: return match.group(0) check to the top of replace_extract before any parsing/validation.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +18
except ImportError:
extract = None
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching broad ImportError when importing .extract can hide real import-time failures inside pdd/commands/extract.py (e.g., missing dependency) and silently omit the command from the CLI. Prefer catching ModuleNotFoundError for the specific module, or re-raising unexpected import errors so packaging/dep issues fail loudly.

Suggested change
except ImportError:
extract = None
except ModuleNotFoundError as exc:
expected_name = __name__.rsplit(".", 1)[0] + ".extract"
if getattr(exc, "name", None) == expected_name:
extract = None
else:
raise

Copilot uses AI. Check for mistakes.
Comment on lines 235 to 404
@@ -277,6 +299,37 @@ def replace_include(match):
console.print(f"Processing XML include: [cyan]{full_path}[/cyan]")
with open(full_path, 'r', encoding='utf-8') as file:
content = file.read()

# Apply selectors if any
selectors_str = attrs.get('select')
lines_str = attrs.get('lines')
mode = attrs.get('mode', 'full')
max_tokens = attrs.get('max_tokens')
overflow = attrs.get('overflow', 'warn')

if selectors_str or lines_str or mode != 'full' or max_tokens:
selectors = []
if selectors_str:
selectors.extend([s.strip() for s in selectors_str.split(',')])
if lines_str:
selectors.append(f"lines:{lines_str}")

try:
from pdd.content_selector import ContentSelector
selector = ContentSelector()
content = selector.select(
content=content,
selectors=selectors,
file_path=full_path,
mode=mode,
max_tokens=int(max_tokens) if max_tokens else None,
overflow=overflow
)
except ImportError:
console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")

if recursive:
child_seen = _seen | {resolved}
content = preprocess(content, recursive=True, double_curly_brackets=False, _seen=child_seen)
@@ -314,6 +367,41 @@ def replace_include_with_spans(match):
iterations += 1
return current_text

def process_extract_tags(text: str, recursive: bool) -> str:
pattern = r'<extract(?P<attrs>\s+[^>]*?)?>(?P<query>.*?)</extract>'
def replace_extract(match):
attrs_str = match.group('attrs') or ""
attrs = _parse_attrs(attrs_str)
query = match.group('query').strip()
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path)

if not file_path:
console.print("[bold red]Error:[/bold red] <extract> tag missing 'path' attribute")
return "[Error: <extract> tag missing 'path' attribute]"

if recursive:
return match.group(0)

try:
from pdd.llm_extractor import LLMExtractor
extractor = LLMExtractor()
return extractor.extract(file_path=file_path, query=query)
except ImportError:
console.print("[yellow]Warning: pdd.llm_extractor not found. Cannot perform semantic extraction.[/yellow]")
return f"[Error: pdd.llm_extractor not found. Cannot extract from {file_path}]"
except Exception as e:
console.print(f"[bold red]Error in semantic extraction:[/bold red] {e}")
return f"[Error in semantic extraction from {file_path}: {e}]"

code_spans = _extract_code_spans(text)
def replace_extract_with_spans(match):
if _intersects_any_span(match.start(), match.end(), code_spans):
return match.group(0)
return replace_extract(match)
return re.sub(pattern, replace_extract_with_spans, text, flags=re.DOTALL)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behaviors in preprocess.py (self-closing <include .../> / attribute-based includes and <extract> deferral/execution) don’t appear to be covered by existing preprocess tests. Add unit tests alongside tests/test_preprocess.py to ensure both old and new <include> syntaxes work and that <extract> is deferred on recursive=True and executed/placeholdered on the second pass.

Copilot uses AI. Check for mistakes.
Comment on lines +2544 to +2556
```bash
# Refresh extractions for a specific prompt
pdd extract refresh prompts/my_module.prompt

# List all cached extractions
pdd extract list

# Show status of extractions (stale/fresh)
pdd extract status

# Preview extraction without caching
pdd extract preview docs/large_api.md --query "auth flow"
```
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description’s “Next Steps” uses pdd extract --refresh ..., but the README examples here use pdd extract refresh ... (subcommand). Please make the PR description and docs consistent with the actual CLI syntax to avoid user confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +247
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path) or match.group('content') or ""
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_include_tags currently breaks the legacy <include>path</include> form: when no path attribute is present, file_path stays None and file_path.strip() will raise. Also get_file_path() is being called before the file is opened, and then called again later, which can lead to duplicated ././... paths. Parse file_path as attrs.get('path') or match.group('content'), and only resolve it once (right before opening) to preserve existing behavior.

Suggested change
file_path = attrs.get('path')
if file_path:
file_path = get_file_path(file_path) or match.group('content') or ""
# Support both attribute and content-based include paths
file_path = attrs.get('path') or match.group('content') or ""

Copilot uses AI. Check for mistakes.
Comment on lines +329 to +331
console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When selectors/limits are requested (select, lines, mode, max_tokens) but pdd.content_selector is missing or selection errors, the code falls back to including full content. That can silently defeat token-budgeting and change prompt semantics. Consider failing fast (raise/return a visible placeholder) when selection was explicitly requested, rather than including the entire file.

Suggested change
console.print("[yellow]Warning: pdd.content_selector not found. Including full content.[/yellow]")
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
console.print("[yellow]Warning: pdd.content_selector not found.[/yellow]")
# When selectors/limits are requested but the content selector
# is unavailable, avoid silently including full content.
# First pass (recursive=True): leave the tag so a later run might resolve it
# Second pass (recursive=False): replace with a visible placeholder
return match.group(0) if recursive else f"[Content selector unavailable for: {file_path}]"
except Exception as e:
console.print(f"[bold red]Error in content selection:[/bold red] {e}")
# On selection errors, do not fall back to full content.
# Follow the same recursive/placeholder pattern as for missing files.
return match.group(0) if recursive else f"[Content selection error for: {file_path}]"

Copilot uses AI. Check for mistakes.
@niti-go
Copy link
Contributor Author

niti-go commented Feb 25, 2026

This PR isn't ready yet! It was created by pdd change. I still have to iterate on it. I'm marking it as a draft.

@niti-go niti-go marked this pull request as draft February 25, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Selective includes and LLM-powered extraction for token-efficient prompts

3 participants