fix: prevent S3 path conflicts using tempfile #569

CyMule · 2025-07-29T20:17:34Z

Problem

S3 downloads were sometimes failing with NotADirectoryError and FileExistsError when S3 buckets contained objects with conflicting naming patterns that cannot be represented in traditional filesystem hierarchies.

Example conflict:

S3 object: foo (file)
S3 object: foo/documents (file requiring foo to be a directory)

This created a race condition where download order determined success/failure

Solution

Used tempfile to create unique download paths for each S3 object:

Before:

S3: "foo" → Local: /downloads/foo
S3: "foo/documents" → Local: /downloads/foo/documents
Conflict: foo cannot be both file and directory

After:

S3: "foo" → Local: /downloads/a1b2c3d4e5f6/foo
S3: "foo/documents" → Local: /downloads/9g8h7i6j5k4l/documents
No conflicts: Each file gets unique directory

Future Work

This PR targets only the s3 downloads. I think it would make sense to use tempfiles for all downloads (as in PR #571), but that requires more extensive changes to implement cleanly. This fix provides immediate relief from the path conflict issues while we work on the more comprehensive tempfile solution.

cmscmadd · 2025-07-29T20:53:12Z

Does this fix the file not found errors we sometimes see as an S3 source?

PastelStorm · 2025-07-31T02:05:28Z

test/integration/connectors/utils/validation/source.py

+        expected_filenames.sort()
+        actual_filenames.sort()
+        assert expected_filenames == actual_filenames, (


It's not super important here and shouldn't be a blocker but in general I would avoid this pattern in the code.

I did some math and you should get about 10x-12x speedup if you create and compare two sets because TimSort has O(n*log n) complexity. Comparing two sets or two lists is the same O(n).

For 100k files this would be a difference of 3.5kk operations (sorted lists) vs. 300k operations (sets)

expected_filenames = {Path(s3_key).name for s3_key in s3_keys} actual_filenames = {Path(download_file).name for download_file in download_files} assert expected_filenames == actual_filenames

PastelStorm · 2025-07-31T02:09:04Z

unstructured_ingest/processes/connectors/fsspec/fsspec.py

+        if not file_data.source_identifiers:
+            return None
+
+        filename = file_data.source_identifiers.filename
+        if not filename:
+            return None


define both booleans as variables, join them with an and and return None once

PastelStorm · 2025-07-31T02:13:35Z

unstructured_ingest/processes/connectors/fsspec/fsspec.py

+        mkdir_concurrent_safe(self.download_dir)
+
+        temp_dir = tempfile.mkdtemp(
+            prefix="unstructured_", 


I'd make this a class-level constant

PastelStorm

A few nits but otherwise LGTM!

CyMule temporarily deployed to ci July 29, 2025 20:17 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 29, 2025 20:28 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 15:38 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 15:39 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 16:37 — with GitHub Actions Inactive

CyMule temporarily deployed to ci July 30, 2025 20:57 — with GitHub Actions Inactive

remove line

bc8e2ed

CyMule temporarily deployed to ci July 30, 2025 22:27 — with GitHub Actions Inactive

CyMule marked this pull request as ready for review July 30, 2025 22:34

CyMule changed the title ~~fix: prevent S3 path conflicts using hash-based directory isolation~~ fix: prevent S3 path conflicts using tempfile Jul 30, 2025

PastelStorm reviewed Jul 31, 2025

View reviewed changes

PastelStorm approved these changes Jul 31, 2025

View reviewed changes

CyMule and others added 2 commits July 31, 2025 08:41

addres feedback

ed2a837

Merge branch 'main' into fix/s3-path-conflicts-hash-isolation

9d1ee57

CyMule temporarily deployed to ci July 31, 2025 12:52 — with GitHub Actions Inactive

CyMule merged commit 3bfc8df into main Jul 31, 2025
31 checks passed

CyMule deleted the fix/s3-path-conflicts-hash-isolation branch July 31, 2025 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent S3 path conflicts using tempfile #569

fix: prevent S3 path conflicts using tempfile #569

Uh oh!

CyMule commented Jul 29, 2025 •

edited

Loading

Uh oh!

cmscmadd commented Jul 29, 2025

Uh oh!

PastelStorm Jul 31, 2025 •

edited

Loading

Uh oh!

PastelStorm Jul 31, 2025

Uh oh!

PastelStorm Jul 31, 2025

Uh oh!

PastelStorm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: prevent S3 path conflicts using tempfile #569

fix: prevent S3 path conflicts using tempfile #569

Uh oh!

Conversation

CyMule commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Future Work

Uh oh!

cmscmadd commented Jul 29, 2025

Uh oh!

PastelStorm Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

PastelStorm Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

PastelStorm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CyMule commented Jul 29, 2025 •

edited

Loading

PastelStorm Jul 31, 2025 •

edited

Loading