Skip to content

Conversation

@CyMule
Copy link
Contributor

@CyMule CyMule commented Jul 29, 2025

Problem

S3 downloads were sometimes failing with NotADirectoryError and FileExistsError when S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.

Example conflict:

  • S3 object: foo (file)
  • S3 object: foo/documents (file requiring foo to be a directory)

This created a race condition where download order determined success/failure

Solution

Used tempfile to create unique download paths for each S3 object:

Before:

S3: "foo" → Local: /downloads/foo
S3: "foo/documents" → Local: /downloads/foo/documents
Conflict: foo cannot be both file and directory

After:

S3: "foo" → Local: /downloads/a1b2c3d4e5f6/foo
S3: "foo/documents" → Local: /downloads/9g8h7i6j5k4l/documents
No conflicts: Each file gets unique directory

Future Work

This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.

@cmscmadd
Copy link

Does this fix the file not found errors we sometimes see as an S3 source?

@CyMule CyMule marked this pull request as ready for review July 30, 2025 22:34
@CyMule CyMule changed the title fix: prevent S3 path conflicts using hash-based directory isolation fix: prevent S3 path conflicts using tempfile Jul 30, 2025
Comment on lines 179 to 181
expected_filenames.sort()
actual_filenames.sort()
assert expected_filenames == actual_filenames, (
Copy link

@PastelStorm PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.

I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).

For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)

expected_filenames = {Path(s3_key).name for s3_key in s3_keys}
actual_filenames = {Path(download_file).name for download_file in download_files}

assert expected_filenames == actual_filenames

Comment on lines 275 to 280
if not file_data.source_identifiers:
return None

filename = file_data.source_identifiers.filename
if not filename:
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define both booleans as variables, join them with an and and return None once

mkdir_concurrent_safe(self.download_dir)

temp_dir = tempfile.mkdtemp(
prefix="unstructured_",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this a class-level constant

Copy link

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but otherwise LGTM!

@CyMule CyMule merged commit 3bfc8df into main Jul 31, 2025
31 checks passed
@CyMule CyMule deleted the fix/s3-path-conflicts-hash-isolation branch July 31, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants