Skip to content

Failures with docx files #104

@ajkiessl

Description

Some docx files fail when parsing text. An example is FileResource 30381.

It logs:

Faraday::TimeoutError: Net::ReadTimeout with #<TCPSocket:(closed)>

in Sidekiq, but the metadata-listener pod logs:

Error performing MetadataListener::Job (Job ID: ae115d50-7f77-4e5b-a100-d9a580507371) from Async(metadata) in 18486.85ms: Errno::ENOENT (No such file or directory @ rb_file_s_size - /app/20250404-9-46q9tv.txt):
/app/lib/metadata_listener/report/extracted_text.rb:43:in 'FileTest.size'
/app/lib/metadata_listener/report/extracted_text.rb:43:in 'Pathname#size'
/app/lib/metadata_listener/report/extracted_text.rb:43:in 'MetadataListener::Report::ExtractedText#params'
/app/lib/metadata_listener/report/extracted_text.rb:34:in 'block in MetadataListener::Report::ExtractedText#response'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/connection.rb:441:in 'block in Faraday::Connection#run_request'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/connection.rb:458:in 'block in Faraday::Connection#build_request'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/request.rb:41:in 'block in Faraday::Request.create'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/request.rb:40:in 'Faraday::Request.create'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/connection.rb:454:in 'Faraday::Connection#build_request'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/connection.rb:436:in 'Faraday::Connection#run_request'
/app/vendor/bundle/ruby/3.4.0/gems/faraday-2.10.0/lib/faraday/connection.rb:280:in 'Faraday::Connection#put'
/app/lib/metadata_listener/report/extracted_text.rb:33:in 'MetadataListener::Report::ExtractedText#response'

Indicating an issue parsing the file. Oddly, it seems to be parsing a .txt temp file. Perhaps this was extracted from the docx.

docx files are known to cause issues (like with MiniMagick and thumbnail creation). This only affects the text extraction, so it's not vital. The virus check still works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions