-
Notifications
You must be signed in to change notification settings - Fork 125
Open
Description
Bug Report: PDF Processing Returns Empty Content
Summary
DocStrange processes PDF successfully (reports "1 successful") but returns empty content for all output formats. The PDF is valid and readable by other tools.
Environment
- docstrange version: 1.1.5
- OS: macOS (Darwin)
- Python: (from mise/pipx installation)
- Authentication: Authenticated cloud mode (10k/month free calls)
- PDF details:
- File:
2512.14012.pdf(likely arXiv paper) - Size: 1.4 MB
- Pages: 11 (confirmed with
filecommand) - Format: PDF 1.7
- File:
Steps to Reproduce
- Basic markdown conversion (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --verboseOutput:
Processing: /Users/ramarivera/Downloads/2512.14012.pdf
Summary: 1 successful, 0 failed
Initialized extractor in cloud mode:
- Output format: markdown
- Auth: authenticated (10k/month) free calls
[empty output]
- JSON with field extraction (fails):
docstrange ~/Downloads/2512.14012.pdf --output json --extract-fields title abstract authorsOutput:
{
"document": {
"raw_content": ""
},
"format": "json_parse_error",
"error": "Expecting value: line 1 column 1 (char 0)"
}- With OCR enabled (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --ocr-enabled --verboseStill returns empty content.
- With Gemini model (fails):
docstrange ~/Downloads/2512.14012.pdf --model gemini --output markdown --verboseStill returns empty content.
- Saving to file (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --output-file output.mdCreates output.md with 0 bytes.
Expected Behavior
- Should extract text content from the PDF
- Should return markdown/JSON with the document content
- Should not report "successful" if content extraction failed
Actual Behavior
- Reports "1 successful" in summary
- Returns completely empty content (
raw_content: "") - No error messages or warnings about why extraction failed
- Output files are empty (0 bytes for markdown, error JSON for json format)
Additional Testing
✅ DocStrange works with simple text files:
echo "Test document" > test.txt
docstrange test.txt --output markdownReturns:
# Text Document
Test document❌ This specific PDF fails consistently
Tried all combinations of:
- Output formats: markdown, json, text, html
- Models: default (nanonets), gemini
- Flags: --ocr-enabled, --include-images, --preserve-layout
- All produce empty content
Possible Causes
- Silent failure in PDF text extraction (no error logged)
- Cloud API returning empty response without error
- PDF might have embedded text that's not being detected
- PDF might need OCR but OCR isn't triggering properly
Related Issues
- Issue Error getting json data "No content available" #48 mentions "No content available" for JSON extraction
- Issue Major Accuracy Difference Between Hosted OCR (docstrange.nanonets.com) and Downloaded Local Model #35 mentions accuracy differences between hosted vs local
Diagnostic Commands
# Verify PDF is valid
file 2512.14012.pdf
# Output: PDF document, version 1.7, 11 pages
# Check credentials
ls -la ~/.docstrange/credentials.json
# Exists and was created during authentication
# Test with verbose mode
docstrange 2512.14012.pdf --output json --verbose
# Shows successful authentication but empty contentRequest
- Could you investigate why PDFs report "successful" but return empty content?
- Should there be more detailed error logging when extraction silently fails?
- Is there a way to get detailed debug logs to see what's happening during processing?
Sample File
The PDF that fails: 2512.14012.pdf (arXiv paper, 1.4MB, 11 pages)
I can provide the file if needed for debugging.
Note: Similar issues were searched before filing. This report was created with AI assistance.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels