Skip to content

PDF processing reports success but returns empty content #51

@ramarivera

Description

@ramarivera

Bug Report: PDF Processing Returns Empty Content

Summary

DocStrange processes PDF successfully (reports "1 successful") but returns empty content for all output formats. The PDF is valid and readable by other tools.

Environment

  • docstrange version: 1.1.5
  • OS: macOS (Darwin)
  • Python: (from mise/pipx installation)
  • Authentication: Authenticated cloud mode (10k/month free calls)
  • PDF details:
    • File: 2512.14012.pdf (likely arXiv paper)
    • Size: 1.4 MB
    • Pages: 11 (confirmed with file command)
    • Format: PDF 1.7

Steps to Reproduce

  1. Basic markdown conversion (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --verbose

Output:

Processing: /Users/ramarivera/Downloads/2512.14012.pdf

Summary: 1 successful, 0 failed
Initialized extractor in cloud mode:
  - Output format: markdown
  - Auth: authenticated (10k/month) free calls

[empty output]
  1. JSON with field extraction (fails):
docstrange ~/Downloads/2512.14012.pdf --output json --extract-fields title abstract authors

Output:

{
  "document": {
    "raw_content": ""
  },
  "format": "json_parse_error",
  "error": "Expecting value: line 1 column 1 (char 0)"
}
  1. With OCR enabled (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --ocr-enabled --verbose

Still returns empty content.

  1. With Gemini model (fails):
docstrange ~/Downloads/2512.14012.pdf --model gemini --output markdown --verbose

Still returns empty content.

  1. Saving to file (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --output-file output.md

Creates output.md with 0 bytes.

Expected Behavior

  • Should extract text content from the PDF
  • Should return markdown/JSON with the document content
  • Should not report "successful" if content extraction failed

Actual Behavior

  • Reports "1 successful" in summary
  • Returns completely empty content (raw_content: "")
  • No error messages or warnings about why extraction failed
  • Output files are empty (0 bytes for markdown, error JSON for json format)

Additional Testing

✅ DocStrange works with simple text files:

echo "Test document" > test.txt
docstrange test.txt --output markdown

Returns:

# Text Document

Test document

❌ This specific PDF fails consistently

Tried all combinations of:

  • Output formats: markdown, json, text, html
  • Models: default (nanonets), gemini
  • Flags: --ocr-enabled, --include-images, --preserve-layout
  • All produce empty content

Possible Causes

  1. Silent failure in PDF text extraction (no error logged)
  2. Cloud API returning empty response without error
  3. PDF might have embedded text that's not being detected
  4. PDF might need OCR but OCR isn't triggering properly

Related Issues

Diagnostic Commands

# Verify PDF is valid
file 2512.14012.pdf
# Output: PDF document, version 1.7, 11 pages

# Check credentials
ls -la ~/.docstrange/credentials.json
# Exists and was created during authentication

# Test with verbose mode
docstrange 2512.14012.pdf --output json --verbose
# Shows successful authentication but empty content

Request

  • Could you investigate why PDFs report "successful" but return empty content?
  • Should there be more detailed error logging when extraction silently fails?
  • Is there a way to get detailed debug logs to see what's happening during processing?

Sample File

The PDF that fails: 2512.14012.pdf (arXiv paper, 1.4MB, 11 pages)
I can provide the file if needed for debugging.


Note: Similar issues were searched before filing. This report was created with AI assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions