Skip to content

Conversation

@gsaluja9
Copy link
Collaborator

@gsaluja9 gsaluja9 commented Feb 28, 2025

Data ingestion with streaming took 41 mins. Previous version does it in < 9 mins consistently.

The problem was due to the fact that our pulls from URL were executing in lock step with db query. This was preventing it from truly pipelining image flows and also preventing threads to use data effectively.

configuration Time batch size workers in PQ
downloading and extracting 9 mins 100 8
naive streaming 41 mins 100 8
streaming with decoupled clients 9 mins 15 secs 100 8

@gsaluja9 gsaluja9 marked this pull request as draft March 4, 2025 14:53
@gsaluja9
Copy link
Collaborator Author

gsaluja9 commented May 16, 2025

Further, a comparison on just ingesting 5000 images with Basic cloud instance from 2 different clients.

Run time on jupyterlab workflow on cloud,

> adb utils execute remove_all 
Danger
[17:37:46] This will execute remove_all.                                                                                                  utilities.py:29
Are you sure you want to continue? [y/N]: y
> python ingest_streaming.py val_images.adb.csv 100 8 
Synced to <queue.Queue object at 0x7fa18015b040>
Progress: 5.60kbatches [04:44, 19.7batches/s]                                                                                                            
============ ApertureDB Loader Stats ============
Total time (s): 284.45887088775635
Total queries executed: 56
Avg Query time (s): 4.380231142044067
Query time std: 5.751092935590115
Avg Query Throughput (q/s): 1.8263876358512763
Overall insertion throughput (element/s): 17.577233518489685
Total inserted elements: 5000
Total successful commands: 5000
=================================================
Done

Run time from local dev box.

> adb utils execute remove_all
Danger
[13:52:13] This will execute remove_all.                                                                                                                                                             utilities.py:29
Are you sure you want to continue? [y/N]: y
> python ingest_streaming.py val_images.adb.csv 100 8
Synced to <queue.Queue object at 0x7135d8597d90>
Progress: 5.60kbatches [00:55, 101batches/s]
============ ApertureDB Loader Stats ============
Total time (s): 55.247413635253906
Total queries executed: 56
Avg Query time (s): 6.890184359891074
Query time std: 5.404900144666826
Avg Query Throughput (q/s): 1.1610719803913159
Overall insertion throughput (element/s): 90.50197413783461
Total inserted elements: 5000
Total successful commands: 5000
=================================================
Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants