Commit f8c1baa
downloader: avoid duplicate downloads (#243)
GitHub: fix GH-242
Red Datasets sometimes handle large files. Downloading the same data
multiple times is not suitable for this use case. We should not download
per worker.
Add a cache check not only before acquiring the lock, but also after
acquiring the lock.
While checking only after acquiring the lock would be enough, it would
create a lock file unnecessarily, so we kept the pre-lock cache check as
well.
This patch will avoid duplicate downloads in the following case:
```mermaid
sequenceDiagram
participant P1 as Process 1
participant P2 as Process 2
participant FS as File System
P1->>FS: check output_path.exist? => false
P2->>FS: check output_path.exist? => false
P1->>FS: create lock => success
P2->>FS: create lock => failure (sleep 1~10s)
P1->>FS: download
P1->>FS: delete lock
P2->>FS: create lock => success
Note over P2: No re-check after lock
P2->>FS: download (duplicate)
P2->>FS: delete lock
```
---------
Co-authored-by: Sutou Kouhei <kou@clear-code.com>1 parent 681b46d commit f8c1baa
1 file changed
+12
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
| 23 | + | |
27 | 24 | | |
28 | 25 | | |
29 | 26 | | |
| 27 | + | |
| 28 | + | |
30 | 29 | | |
31 | 30 | | |
32 | 31 | | |
| |||
94 | 93 | | |
95 | 94 | | |
96 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
97 | 105 | | |
98 | 106 | | |
99 | 107 | | |
| |||
0 commit comments