Skip to content

Commit 329928e

Browse files
committed
docs: add missing GitHub issue templates and reorder
Added three missing enhancement issue templates: - #6 Extract Pages method (simpler alternative to split_pdf) - #10 Convert to Office Formats (DOCX, XLSX, PPTX export) - #13 Batch Processing (client-side bulk operations) Reordered existing templates to maintain logical sequence. All 13 enhancements now have corresponding issue templates.
1 parent d3afe37 commit 329928e

File tree

8 files changed

+312
-0
lines changed

8 files changed

+312
-0
lines changed

github_issues/06_extract_pages.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Feature: Extract Page Range Method
2+
3+
## Summary
4+
Implement `extract_pages()` as a simpler alternative to `split_pdf()` for extracting a continuous range of pages.
5+
6+
## Proposed Implementation
7+
```python
8+
def extract_pages(
9+
self,
10+
input_file: FileInput,
11+
start_page: int,
12+
end_page: Optional[int] = None, # None means to end
13+
output_path: Optional[str] = None,
14+
) -> Optional[bytes]:
15+
```
16+
17+
## Benefits
18+
- Simpler API than split_pdf for common use case
19+
- More intuitive for single range extraction
20+
- Clear intent and usage
21+
- Memory efficient for large documents
22+
23+
## Implementation Details
24+
- Use Build API with single FilePart and page range
25+
- Support negative indexing (-1 for last page)
26+
- Handle "to end" extraction with None
27+
- Clear error messages for invalid ranges
28+
29+
## Testing Requirements
30+
- [ ] Test single page extraction
31+
- [ ] Test range extraction
32+
- [ ] Test "to end" extraction (end_page=None)
33+
- [ ] Test negative page indexes
34+
- [ ] Test invalid ranges (start > end)
35+
- [ ] Test out of bounds pages
36+
37+
## OpenAPI Reference
38+
- Uses FilePart with `pages` parameter
39+
- Page ranges use start/end format
40+
- Build API with single part
41+
42+
## Use Case Example
43+
```python
44+
# Extract first 10 pages
45+
first_chapter = client.extract_pages(
46+
"book.pdf",
47+
start_page=0,
48+
end_page=10
49+
)
50+
51+
# Extract from page 50 to end
52+
appendix = client.extract_pages(
53+
"book.pdf",
54+
start_page=50
55+
# end_page=None means to end
56+
)
57+
58+
# Extract single page
59+
cover = client.extract_pages(
60+
"book.pdf",
61+
start_page=0,
62+
end_page=1
63+
)
64+
```
65+
66+
## Relationship to split_pdf
67+
- `split_pdf`: Multiple ranges, multiple outputs
68+
- `extract_pages`: Single range, single output
69+
- This method is essentially `split_pdf` with a single range
70+
71+
## Priority
72+
🟢 Priority 2 - Core missing method
73+
74+
## Labels
75+
- feature
76+
- pdf-manipulation
77+
- pages
78+
- openapi-compliance
File renamed without changes.
File renamed without changes.
File renamed without changes.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Feature: Convert to Office Formats Method
2+
3+
## Summary
4+
Implement `convert_to_office()` to export PDFs to Microsoft Office formats (DOCX, XLSX, PPTX).
5+
6+
## Proposed Implementation
7+
```python
8+
def convert_to_office(
9+
self,
10+
input_file: FileInput,
11+
output_path: Optional[str] = None,
12+
format: Literal["docx", "xlsx", "pptx"] = "docx",
13+
ocr_language: Optional[Union[str, List[str]]] = None, # Auto-OCR if needed
14+
) -> Optional[bytes]:
15+
```
16+
17+
## Benefits
18+
- Edit PDFs in familiar Office applications
19+
- Preserve formatting and layout where possible
20+
- Automatic OCR for scanned documents
21+
- Workflow integration with Office 365
22+
- Accessibility improvements
23+
24+
## Implementation Details
25+
- Use Build API with output type: `docx`, `xlsx`, or `pptx`
26+
- Automatic format detection based on content
27+
- OCR integration for scanned PDFs
28+
- Handle complex layouts gracefully
29+
30+
## Testing Requirements
31+
- [ ] Test DOCX conversion (text documents)
32+
- [ ] Test XLSX conversion (tables/data)
33+
- [ ] Test PPTX conversion (presentations)
34+
- [ ] Test with scanned documents (OCR)
35+
- [ ] Test formatting preservation
36+
- [ ] Test with complex layouts
37+
- [ ] Test with forms and tables
38+
39+
## OpenAPI Reference
40+
- Output types: `docx`, `xlsx`, `pptx`
41+
- Part of BuildOutput options
42+
- Supports OCR language parameter
43+
44+
## Use Case Example
45+
```python
46+
# Convert PDF to Word for editing
47+
word_doc = client.convert_to_office(
48+
"report.pdf",
49+
format="docx",
50+
output_path="report.docx"
51+
)
52+
53+
# Convert scanned document with OCR
54+
editable_doc = client.convert_to_office(
55+
"scanned_contract.pdf",
56+
format="docx",
57+
ocr_language=["english", "spanish"]
58+
)
59+
60+
# Convert data PDF to Excel
61+
spreadsheet = client.convert_to_office(
62+
"financial_data.pdf",
63+
format="xlsx",
64+
output_path="data.xlsx"
65+
)
66+
67+
# Convert to PowerPoint
68+
presentation = client.convert_to_office(
69+
"slides.pdf",
70+
format="pptx"
71+
)
72+
```
73+
74+
## Format Selection Guide
75+
- **DOCX**: Text-heavy documents, reports, contracts
76+
- **XLSX**: Data tables, financial reports, lists
77+
- **PPTX**: Presentations, slide decks
78+
79+
## Known Limitations
80+
- Complex layouts may not convert perfectly
81+
- Some PDF features have no Office equivalent
82+
- Font substitution may occur
83+
- Interactive elements may be lost
84+
85+
## Priority
86+
🟡 Priority 3 - Format conversion method
87+
88+
## Labels
89+
- feature
90+
- conversion
91+
- office
92+
- docx
93+
- xlsx
94+
- pptx
95+
- openapi-compliance
File renamed without changes.
File renamed without changes.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Feature: Batch Processing Method
2+
3+
## Summary
4+
Implement `batch_process()` for efficient processing of multiple files with the same operations.
5+
6+
## Proposed Implementation
7+
```python
8+
def batch_process(
9+
self,
10+
input_files: List[FileInput],
11+
operations: List[Dict[str, Any]], # List of operations to apply
12+
output_dir: Optional[str] = None,
13+
output_format: str = "{name}_{index}{ext}", # Naming pattern
14+
parallel: bool = True,
15+
max_workers: int = 4,
16+
continue_on_error: bool = True,
17+
progress_callback: Optional[Callable[[int, int], None]] = None,
18+
) -> BatchResult:
19+
```
20+
21+
## Benefits
22+
- Process hundreds of files efficiently
23+
- Parallel processing for performance
24+
- Consistent operations across files
25+
- Progress tracking and reporting
26+
- Error recovery and partial results
27+
- Memory-efficient streaming
28+
29+
## Implementation Details
30+
- Client-side enhancement (not in OpenAPI)
31+
- Use ThreadPoolExecutor for parallel processing
32+
- Implement retry logic for transient failures
33+
- Stream results to avoid memory issues
34+
- Provide detailed error reporting
35+
36+
## BatchResult Structure
37+
```python
38+
@dataclass
39+
class BatchResult:
40+
successful: List[Tuple[str, Union[bytes, str]]] # (input_file, output)
41+
failed: List[Tuple[str, Exception]] # (input_file, error)
42+
total_processed: int
43+
processing_time: float
44+
45+
@property
46+
def success_rate(self) -> float:
47+
return len(self.successful) / self.total_processed * 100
48+
```
49+
50+
## Testing Requirements
51+
- [ ] Test sequential processing
52+
- [ ] Test parallel processing
53+
- [ ] Test error handling and recovery
54+
- [ ] Test progress callback
55+
- [ ] Test memory usage with large batches
56+
- [ ] Test interruption and resume
57+
- [ ] Test various operation combinations
58+
59+
## Use Case Example
60+
```python
61+
# Add watermark to all PDFs in directory
62+
files = glob.glob("documents/*.pdf")
63+
result = client.batch_process(
64+
input_files=files,
65+
operations=[
66+
{"method": "watermark_pdf", "params": {"text": "CONFIDENTIAL"}}
67+
],
68+
output_dir="watermarked/",
69+
parallel=True,
70+
max_workers=8
71+
)
72+
73+
print(f"Processed {result.total_processed} files")
74+
print(f"Success rate: {result.success_rate}%")
75+
76+
# OCR and flatten multiple documents
77+
operations = [
78+
{"method": "ocr_pdf", "params": {"language": "english"}},
79+
{"method": "flatten_annotations", "params": {}}
80+
]
81+
82+
def progress_update(current, total):
83+
print(f"Processing {current}/{total}...")
84+
85+
result = client.batch_process(
86+
input_files=["scan1.pdf", "scan2.pdf", "scan3.pdf"],
87+
operations=operations,
88+
output_dir="processed/",
89+
progress_callback=progress_update
90+
)
91+
92+
# Complex workflow with error handling
93+
result = client.batch_process(
94+
input_files=large_file_list,
95+
operations=[
96+
{"method": "rotate_pages", "params": {"degrees": 90, "page_indexes": [0]}},
97+
{"method": "ocr_pdf", "params": {"language": ["english", "spanish"]}},
98+
{"method": "convert_to_pdfa", "params": {"conformance": "pdfa-2b"}}
99+
],
100+
continue_on_error=True, # Don't stop on individual failures
101+
output_format="processed_{name}_{index}{ext}"
102+
)
103+
104+
# Review failures
105+
for file, error in result.failed:
106+
print(f"Failed to process {file}: {error}")
107+
```
108+
109+
## Operation Format
110+
```python
111+
{
112+
"method": "method_name", # Direct API method name
113+
"params": { # Method parameters
114+
"param1": value1,
115+
"param2": value2
116+
}
117+
}
118+
```
119+
120+
## Performance Considerations
121+
- Default 4 workers balances speed and API limits
122+
- Automatic retry with exponential backoff
123+
- Memory streaming for large files
124+
- Progress callback doesn't impact performance
125+
126+
## Error Handling
127+
- Individual file failures don't stop batch
128+
- Detailed error information per file
129+
- Automatic retry for transient errors
130+
- Optional stop-on-error mode
131+
132+
## Priority
133+
🟠 Priority 4 - Advanced feature
134+
135+
## Labels
136+
- feature
137+
- performance
138+
- batch-processing
139+
- client-enhancement

0 commit comments

Comments
 (0)