-
Notifications
You must be signed in to change notification settings - Fork 95
Speed up loading of bounding box csv files #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
sfmig
wants to merge
44
commits into
main
Choose a base branch
from
smg/parquet-bboxes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9faa8ba to
b4c944a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 34 34
Lines 2111 2106 -5
=========================================
- Hits 2111 2106 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
df74f48 to
2399ca8
Compare
d7534a8 to
43da863
Compare
7 tasks
a538200 to
e43a9d9
Compare
…one with test_foo
…ch tests are launched
This reverts commit de97253.
bb47e28 to
f086f05
Compare
5663d23 to
3419670
Compare
This reverts commit e7a9824.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



Description
What is this PR
Why is this PR needed?
We noticed that reading a bbox dataset from a VIA tracks file could be very slow, especially for files with a lot of frames and individuals. In this PR I made several changes to improve this.
What does this PR do?
This PR makes changes mainly the validation of VIA tracks files and their loading as
movementdatasets.Previously in the validator I prioritised giving helpful and specific messages to users about errors over loading speed. So I would loop through the rows to report on all the rows with issues, and keep the logic of each test almost independent. However, this makes everything v slow with very large files. I have now changed the validator to a fail-fast approach, and also modified where possible to single passes through the data (while keeping validation checks still separate). Now we only report on the first error encountered, but in exchange loading with no errors is faster.
There was some redundancy between the validator checks and the loader. This is because during validation we checked quantities that are also needed to build the dataset. For example, we check if the frame numbers are defined correctly and it is possible to extract integers from them. At the time we prioritised keeping the validation and the loading logic separate, but again this redundancy slows down things with larger files. I have changed the validator to also pre-parse the required data for loading (x,y, width, height, frame number, track IDs, confidence values if defined). These data are collected as attributes of the file validator, and the file validator object is the input to the loading function. This blurs the boundaries between validator and loader, but reduces the redundancy.
The loading module has become quite simplified: it now retrieves the relevant data from the validation object and formats it as a dataframe, fills in empty rows with nans and extracts the relevant numpy arrays from it.
In the validator, I replaces
json.loadsbyorjson.loads, which gave us some extra speedups. Other improvements include:ast.eval_literal-->json.loads--->orjson.loadsposes)I also modified how the path to the pytest plugins is defined ( in
tests/conftest.py), since at the moment it requires the tests to be launched from the root directory.References
I benchmarked
from_via_tracks_fileon a few files I have access to, that represent realistic datasets of many individuals and frames. Below are the median values in seconds except when specified.When two values are shown, the first one refers to the bulk of the original loading function,
_df_from_via_tracks_file, which takes place after validation. The function is refactored and renamed with this PR.main)individuals: 144
individuals: 295
individuals: 602
individuals: 4968
I attach the
pytest-benchmarkresults for the first and the last two columns in the table above (the first column is thebeforejson, the last two columns are theafterjson and theafter_orjsonjson).0005_before_small_34mb.json
0011_before_medium_100mb_data.json
0005_after_small_34mb_data.json
0004_after_medium_100mb_data.json
0003_after_medium_400mb_data.json
0006_after_large_2gb_data.json
0007_after_orjson_small_34mb_data.json
0008_after_orjson_medium_100mb_data.json
0009_after_orjson_medium_400mb_data.json
0010_after_orjson_large_2gb_data.json
orjson vs json
I would like to know other's opinions on replacing
jsonby orjson, since it can be significantly faster.The migration seems worth a separate PR, the main difference between them being that
orjson.dumps()returns bytes, not str, and that it does not have anindentparameter. Without this being an exhaustive review, I highlight some of the differences from their docs and from consulting with Claude:Apparently file I/O also differs:
One option could be to allow them to co-exist in
movementfor a while (we useorjsonfor loading VIA tracks files for now, and open an issue for migration in the future).How has this PR been tested?
Tests pass locally and in CI.
Is this a breaking change?
No, only private functions are modified.
Does this PR require an update to the documentation?
No.
Checklist: