Skip to content

Commit 0637706

Browse files
committed
docs: comprehensive future enhancement plan with GitHub issue templates
Created detailed enhancement roadmap based on OpenAPI v1.9.0 analysis: 📋 Enhancement Plan: - 13 proposed enhancements across 4 priority levels - Detailed implementation specifications - Testing requirements and use cases - Recommended 4-phase implementation timeline 📁 GitHub Issue Templates: - Individual issue template for each enhancement - Consistent format with implementation details - OpenAPI references and code examples - Priority levels and labels 🎯 Goals: - Increase API coverage from ~30% to ~80% - Maintain backward compatibility - Add most requested features - Follow OpenAPI specification precisely This provides a clear roadmap for community contributions and systematic feature development.
1 parent 4de64d0 commit 0637706

File tree

5 files changed

+458
-0
lines changed

5 files changed

+458
-0
lines changed
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Feature: Convert to PDF/A Method
2+
3+
## Summary
4+
Implement `convert_to_pdfa()` to convert PDFs to PDF/A archival format for long-term preservation and compliance.
5+
6+
## Proposed Implementation
7+
```python
8+
def convert_to_pdfa(
9+
self,
10+
input_file: FileInput,
11+
output_path: Optional[str] = None,
12+
conformance: Literal["pdfa-1a", "pdfa-1b", "pdfa-2a", "pdfa-2u", "pdfa-2b", "pdfa-3a", "pdfa-3u"] = "pdfa-2b",
13+
vectorization: bool = True,
14+
rasterization: bool = True,
15+
) -> Optional[bytes]:
16+
```
17+
18+
## Benefits
19+
- Long-term archival compliance (ISO 19005)
20+
- Legal and regulatory requirement fulfillment
21+
- Guaranteed font embedding
22+
- Self-contained documents
23+
- Multiple conformance levels for different needs
24+
25+
## Implementation Details
26+
- Use Build API with output type: `pdfa`
27+
- Support all PDF/A conformance levels
28+
- Provide sensible defaults (PDF/A-2b most common)
29+
- Handle vectorization/rasterization options
30+
- Clear error messages for conversion failures
31+
32+
## Testing Requirements
33+
- [ ] Test each conformance level
34+
- [ ] Test vectorization on/off
35+
- [ ] Test rasterization on/off
36+
- [ ] Test with complex PDFs (forms, multimedia)
37+
- [ ] Verify output is valid PDF/A
38+
- [ ] Test conversion failures gracefully
39+
40+
## OpenAPI Reference
41+
- Output type: `pdfa`
42+
- Conformance levels: pdfa-1a, pdfa-1b, pdfa-2a, pdfa-2u, pdfa-2b, pdfa-3a, pdfa-3u
43+
- Options: vectorization (default: true), rasterization (default: true)
44+
45+
## Use Case Example
46+
```python
47+
# Convert for long-term archival (most permissive)
48+
archived_pdf = client.convert_to_pdfa(
49+
"document.pdf",
50+
conformance="pdfa-2b"
51+
)
52+
53+
# Convert for accessibility compliance (strictest)
54+
accessible_pdf = client.convert_to_pdfa(
55+
"document.pdf",
56+
conformance="pdfa-2a",
57+
output_path="archived_accessible.pdf"
58+
)
59+
```
60+
61+
## Conformance Level Guide
62+
- **PDF/A-1a**: Level A compliance, accessibility features required
63+
- **PDF/A-1b**: Level B compliance, visual appearance preservation
64+
- **PDF/A-2a/2b**: Based on PDF 1.7, more features allowed
65+
- **PDF/A-2u**: Unicode mapping required
66+
- **PDF/A-3a/3u**: Allows embedded files
67+
68+
## Priority
69+
🟡 Priority 3 - Format conversion method
70+
71+
## Labels
72+
- feature
73+
- conversion
74+
- compliance
75+
- archival
76+
- openapi-compliance
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Feature: Convert PDF to Images Method
2+
3+
## Summary
4+
Implement `convert_to_images()` to extract PDF pages as image files in various formats.
5+
6+
## Proposed Implementation
7+
```python
8+
def convert_to_images(
9+
self,
10+
input_file: FileInput,
11+
output_dir: Optional[str] = None, # Directory for multiple images
12+
format: Literal["png", "jpeg", "webp"] = "png",
13+
pages: Optional[List[int]] = None, # None means all pages
14+
width: Optional[int] = None,
15+
height: Optional[int] = None,
16+
dpi: int = 150,
17+
) -> Union[List[bytes], None]: # Returns list of image bytes or None if saved
18+
```
19+
20+
## Benefits
21+
- Generate thumbnails and previews
22+
- Web-friendly image formats
23+
- Flexible resolution control
24+
- Selective page extraction
25+
- Batch image generation
26+
27+
## Implementation Details
28+
- Use Build API with output type: `image`
29+
- Support PNG, JPEG, and WebP formats
30+
- Handle multi-page extraction (returns list)
31+
- Automatic file naming when saving to directory
32+
- Resolution control via width/height/DPI
33+
34+
## Testing Requirements
35+
- [ ] Test PNG format extraction
36+
- [ ] Test JPEG format extraction
37+
- [ ] Test WebP format extraction
38+
- [ ] Test single page extraction
39+
- [ ] Test multi-page extraction
40+
- [ ] Test resolution options (width, height, DPI)
41+
- [ ] Test file saving vs bytes return
42+
43+
## OpenAPI Reference
44+
- Output type: `image`
45+
- Formats: png, jpeg, jpg, webp
46+
- Parameters: width, height, dpi, pages (range)
47+
48+
## Use Case Example
49+
```python
50+
# Extract all pages as PNG thumbnails
51+
thumbnails = client.convert_to_images(
52+
"document.pdf",
53+
format="png",
54+
width=200 # Fixed width, height auto-calculated
55+
)
56+
57+
# Extract specific pages as high-res JPEGs
58+
client.convert_to_images(
59+
"document.pdf",
60+
output_dir="./page_images",
61+
format="jpeg",
62+
pages=[0, 1, 2], # First 3 pages
63+
dpi=300 # High resolution
64+
)
65+
66+
# Generate web-optimized previews
67+
web_images = client.convert_to_images(
68+
"document.pdf",
69+
format="webp",
70+
width=800,
71+
height=600
72+
)
73+
```
74+
75+
## File Naming Convention
76+
When saving to directory:
77+
- Single page: `{original_name}.{format}`
78+
- Multiple pages: `{original_name}_page_{n}.{format}`
79+
80+
## Priority
81+
🟡 Priority 3 - Format conversion method
82+
83+
## Labels
84+
- feature
85+
- conversion
86+
- images
87+
- thumbnails
88+
- openapi-compliance
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Feature: Extract Content as JSON Method
2+
3+
## Summary
4+
Implement `extract_content()` to extract text, tables, and metadata from PDFs as structured JSON data.
5+
6+
## Proposed Implementation
7+
```python
8+
def extract_content(
9+
self,
10+
input_file: FileInput,
11+
extract_text: bool = True,
12+
extract_tables: bool = True,
13+
extract_metadata: bool = True,
14+
extract_structure: bool = False,
15+
language: Union[str, List[str]] = "english",
16+
output_path: Optional[str] = None,
17+
) -> Union[Dict[str, Any], None]:
18+
```
19+
20+
## Benefits
21+
- Structured data extraction for analysis
22+
- Table detection and extraction
23+
- Metadata parsing
24+
- Search indexing support
25+
- Machine learning data preparation
26+
- Multi-language text extraction
27+
28+
## Implementation Details
29+
- Use Build API with output type: `json-content`
30+
- Map parameters to OpenAPI options:
31+
- `plainText`: extract_text
32+
- `tables`: extract_tables
33+
- `structuredText`: extract_structure
34+
- Include document metadata in response
35+
- Support OCR for scanned documents
36+
37+
## Testing Requirements
38+
- [ ] Test plain text extraction
39+
- [ ] Test table extraction
40+
- [ ] Test metadata extraction
41+
- [ ] Test structured text extraction
42+
- [ ] Test with multi-language documents
43+
- [ ] Test with scanned documents (OCR)
44+
- [ ] Validate JSON structure
45+
46+
## OpenAPI Reference
47+
- Output type: `json-content`
48+
- Options: plainText, structuredText, tables, keyValuePairs
49+
- Language support for OCR
50+
- Returns structured JSON
51+
52+
## Use Case Example
53+
```python
54+
# Extract everything from a document
55+
content = client.extract_content(
56+
"report.pdf",
57+
extract_text=True,
58+
extract_tables=True,
59+
extract_metadata=True
60+
)
61+
62+
# Access extracted data
63+
print(content["metadata"]["title"])
64+
print(content["text"])
65+
for table in content["tables"]:
66+
print(table["data"])
67+
68+
# Extract for multilingual search indexing
69+
search_data = client.extract_content(
70+
"multilingual.pdf",
71+
language=["english", "spanish", "french"],
72+
extract_structure=True
73+
)
74+
```
75+
76+
## Expected JSON Structure
77+
```json
78+
{
79+
"metadata": {
80+
"title": "Document Title",
81+
"author": "Author Name",
82+
"created": "2024-01-01T00:00:00Z",
83+
"pages": 10
84+
},
85+
"text": "Extracted plain text...",
86+
"structured_text": {
87+
"paragraphs": [...],
88+
"headings": [...]
89+
},
90+
"tables": [
91+
{
92+
"page": 1,
93+
"data": [["Header1", "Header2"], ["Row1Col1", "Row1Col2"]]
94+
}
95+
]
96+
}
97+
```
98+
99+
## Priority
100+
🟡 Priority 3 - Format conversion method
101+
102+
## Labels
103+
- feature
104+
- extraction
105+
- data-processing
106+
- json
107+
- openapi-compliance

github_issues/09_ai_redact.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Feature: AI-Powered Redaction Method
2+
3+
## Summary
4+
Implement `ai_redact()` to use Nutrient's AI capabilities for automatic detection and redaction of sensitive information.
5+
6+
## Proposed Implementation
7+
```python
8+
def ai_redact(
9+
self,
10+
input_file: FileInput,
11+
output_path: Optional[str] = None,
12+
sensitivity_level: Literal["low", "medium", "high"] = "medium",
13+
entity_types: Optional[List[str]] = None, # ["email", "ssn", "phone", etc.]
14+
review_mode: bool = False, # Create redactions without applying
15+
confidence_threshold: float = 0.8,
16+
) -> Optional[bytes]:
17+
```
18+
19+
## Benefits
20+
- Automated GDPR/CCPA compliance
21+
- Reduce manual review time by 90%
22+
- Consistent redaction across documents
23+
- Multiple entity type detection
24+
- Configurable sensitivity levels
25+
- Review mode for human verification
26+
27+
## Implementation Details
28+
- Use dedicated `/ai/redact` endpoint
29+
- Different from create_redactions (rule-based)
30+
- Support confidence thresholds
31+
- Allow entity type filtering
32+
- Option to review before applying
33+
34+
## Testing Requirements
35+
- [ ] Test sensitivity levels (low/medium/high)
36+
- [ ] Test specific entity detection
37+
- [ ] Test review mode
38+
- [ ] Test confidence thresholds
39+
- [ ] Compare with manual redaction
40+
- [ ] Test on various document types
41+
42+
## OpenAPI Reference
43+
- Endpoint: `/ai/redact`
44+
- Separate from Build API
45+
- AI-powered detection
46+
- Returns processed document
47+
48+
## Use Case Example
49+
```python
50+
# Automatic GDPR compliance
51+
gdpr_safe = client.ai_redact(
52+
"customer_data.pdf",
53+
entity_types=["email", "phone", "name", "address"],
54+
sensitivity_level="high"
55+
)
56+
57+
# Review before applying
58+
review_pdf = client.ai_redact(
59+
"contract.pdf",
60+
entity_types=["ssn", "bank_account", "credit_card"],
61+
review_mode=True, # Creates redaction annotations only
62+
confidence_threshold=0.9
63+
)
64+
65+
# Then manually review and apply
66+
final = client.apply_redactions(review_pdf)
67+
```
68+
69+
## Supported Entity Types
70+
- Personal: name, email, phone, address
71+
- Financial: ssn, credit_card, bank_account, routing_number
72+
- Medical: medical_record, diagnosis, prescription
73+
- Custom: (API may support additional types)
74+
75+
## Priority
76+
🟠 Priority 4 - Advanced feature
77+
78+
## Labels
79+
- feature
80+
- ai
81+
- redaction
82+
- compliance
83+
- gdpr
84+
- openapi-compliance

0 commit comments

Comments
 (0)