Speed up loading of bounding box csv files #769

sfmig · 2026-01-20T12:06:22Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
We noticed that reading a bbox dataset from a VIA tracks file could be very slow, especially for files with a lot of frames and individuals. In this PR I made several changes to improve this.

What does this PR do?
This PR makes changes mainly the validation of VIA tracks files and their loading as movement datasets.

Previously in the validator I prioritised giving helpful and specific messages to users about errors over loading speed. So I would loop through the rows to report on all the rows with issues, and keep the logic of each test almost independent. However, this makes everything v slow with very large files. I have now changed the validator to a fail-fast approach, and also modified where possible to single passes through the data (while keeping validation checks still separate). Now we only report on the first error encountered, but in exchange loading with no errors is faster.
There was some redundancy between the validator checks and the loader. This is because during validation we checked quantities that are also needed to build the dataset. For example, we check if the frame numbers are defined correctly and it is possible to extract integers from them. At the time we prioritised keeping the validation and the loading logic separate, but again this redundancy slows down things with larger files. I have changed the validator to also pre-parse the required data for loading (x,y, width, height, frame number, track IDs, confidence values if defined). These data are collected as attributes of the file validator, and the file validator object is the input to the loading function. This blurs the boundaries between validator and loader, but reduces the redundancy.
The loading module has become quite simplified: it now retrieves the relevant data from the validation object and formats it as a dataframe, fills in empty rows with nans and extracts the relevant numpy arrays from it.
In the validator, I replaces json.loads by orjson.loads, which gave us some extra speedups. Other improvements include:
- ast.eval_literal --> json.loads ---> orjson.loads
- replacing split and stack by a more efficient approach
- replacing float64 numpy array with float32 (consistent with poses)
  - this also affects the exporting function
- adapting the tests

I also modified how the path to the pytest plugins is defined ( in tests/conftest.py), since at the moment it requires the tests to be launched from the root directory.

References

I benchmarked from_via_tracks_file on a few files I have access to, that represent realistic datasets of many individuals and frames. Below are the median values in seconds except when specified.

When two values are shown, the first one refers to the bulk of the original loading function, _df_from_via_tracks_file, which takes place after validation. The function is refactored and renamed with this PR.

VIA tracks csv file (size on disk)	n coords	before (`main`)	after loading improvements	after validator improvements	plus orjson
small 34MB	time: 2067 individuals: 144	23 // 36	0.8 // 9	..// 0.89	.. // 0.53
medium 100MB	time: 6553 individuals: 295	1min 8s // 2min 33s	2.6 // 75.1	..// 2.7	.. // 1.7
medium 400MB	time: 27489 individuals: 602			..// 11.7	.. // 8.0
large 1.8GB	time: 108922 individuals: 4968			..// ~3min	..// 2min 32s

I attach the pytest-benchmark results for the first and the last two columns in the table above (the first column is the before json, the last two columns are the after json and the after_orjson json).
0005_before_small_34mb.json
0011_before_medium_100mb_data.json

0005_after_small_34mb_data.json
0004_after_medium_100mb_data.json
0003_after_medium_400mb_data.json
0006_after_large_2gb_data.json

0007_after_orjson_small_34mb_data.json
0008_after_orjson_medium_100mb_data.json
0009_after_orjson_medium_400mb_data.json
0010_after_orjson_large_2gb_data.json

orjson vs json

I would like to know other's opinions on replacing json by orjson, since it can be significantly faster.

The migration seems worth a separate PR, the main difference between them being that orjson.dumps() returns bytes, not str, and that it does not have an indent parameter. Without this being an exhaustive review, I highlight some of the differences from their docs and from consulting with Claude:

# Standard json - returns str
json.dumps({"key": "value"})  # '{"key": "value"}'

# orjson - returns bytes
orjson.dumps({"key": "value"})  # b'{"key": "value"}'

# If you need a string (common for CSV columns, attributes, etc.):
orjson.dumps({"key": "value"}).decode("utf-8")

#################
# Re: indent, before
json.dumps(data, indent=2)

# after
orjson.dumps(data, option=orjson.OPT_INDENT_2).decode("utf-8")

Apparently file I/O also differs:

# Before - json.load() / json.dump() with file objects
with open("file.json") as f:
    data = json.load(f)

with open("file.json", "w") as f:
    json.dump(data, f)

# After - orjson doesn't have load()/dump(), use loads()/dumps()
with open("file.json", "rb") as f:  # Note: 'rb' for bytes
    data = orjson.loads(f.read())

with open("file.json", "wb") as f:  # Note: 'wb' for bytes
    f.write(orjson.dumps(data))

One option could be to allow them to co-exist in movement for a while (we use orjson for loading VIA tracks files for now, and open an issue for migration in the future).

How has this PR been tested?

Tests pass locally and in CI.

Is this a breaking change?

No, only private functions are modified.

Does this PR require an update to the documentation?

No.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

codecov · 2026-01-20T13:14:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (b3f5a9c) to head (5e99927).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #769   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           34        34           
  Lines         2111      2106    -5     
=========================================
- Hits          2111      2106    -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…one with test_foo

…fore

…ch tests are launched

This reverts commit de97253.

This reverts commit e7a9824.

…of the data

sonarqubecloud · 2026-01-29T19:12:51Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sfmig force-pushed the smg/parquet-bboxes branch from 9faa8ba to b4c944a Compare January 20, 2026 12:07

sfmig force-pushed the smg/parquet-bboxes branch from df74f48 to 2399ca8 Compare January 20, 2026 19:36

sfmig mentioned this pull request Jan 22, 2026

Add preliminary benchmarks #772

Merged

6 tasks

sfmig force-pushed the smg/parquet-bboxes branch 2 times, most recently from d7534a8 to 43da863 Compare January 22, 2026 10:36

sfmig mentioned this pull request Jan 26, 2026

Support loading VIA-tracks file in parquet format #779

Draft

7 tasks

sfmig force-pushed the smg/parquet-bboxes branch 2 times, most recently from a538200 to e43a9d9 Compare January 26, 2026 16:49

sfmig marked this pull request as ready for review January 26, 2026 17:07

sfmig marked this pull request as draft January 26, 2026 17:47

sfmig added 19 commits January 27, 2026 18:31

Some claude suggestions

d64fd67

Add pre-parsing of dataframe columns

131c316

Uncomment validators

8298d2a

Tests pass

dba3d8f

Remove old implementation

d88ff07

Remove option of df_input being parsed already

b482097

Refactor

0bda162

Recover old implementation and verify result is the same as previous …

0e1bc37

…one with test_foo

Cast frame number as int explicitly and set float32

6ab1274

Adapt tests to float32

94e533b

Convert to float64 with 6 decimals to be json serializable

e15685f

Remove foo test

1f5c93b

Remove old implementation

17e516d

Partially replace old tests

48e919f

Add test for filling values

4510199

Refactor parsing function. Remove if-cases since file is validated be…

39d224d

…fore

Floating point comparison

7037e55

Define pytest plugins module path independently of directory from whi…

49ba4e3

…ch tests are launched

Add proto benchmarks and reasons to skip

efc7aa1

sfmig added 11 commits January 27, 2026 18:31

Remove skips

eb80bcc

Replace ast.literal_eval with json.loads

36250e2

Try loading sparse array

2ffb5d2

Revert "Try loading sparse array"

9a4cd87

This reverts commit de97253.

Small edits

bac8973

Remove benchmark bits

7ec60b5

Remove benchmark bits

f7105df

Replace split and stack by faster reshape

d4c63ab

Single loop thru rows of df

1784015

Reuse df read from validator

18285d6

Improvements to validation and reduce duplication with loading

f086f05

sfmig force-pushed the smg/parquet-bboxes branch from bb47e28 to f086f05 Compare January 27, 2026 18:32

Fix benchmark commits

3419670

sfmig force-pushed the smg/parquet-bboxes branch from 5663d23 to 3419670 Compare January 27, 2026 18:40

sfmig added 13 commits January 27, 2026 18:50

Update benchmarks

1cf39c5

Clarify comments

28d725c

Use orjson when validating and parsing VIA tracks data

db681d3

Single pass approach

e7a9824

Revert "Single pass approach"

38fe164

This reverts commit e7a9824.

Report row number in error messages

70d3aa0

Fix validation tests and add regexp check and test

08463dc

Catch case in which regex cannot be compiled

259ca78

Fix remaining tests

c170bdf

Add test for validator attributes and docstring

667ad84

Add a new URL to linkcheck_ignore

93554ac

Small edits

d859d12

Update load_bboxes.py to ensure numpy arrays are created with a copy …

5e99927

…of the data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up loading of bounding box csv files #769

Speed up loading of bounding box csv files #769

Uh oh!

sfmig commented Jan 20, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Speed up loading of bounding box csv files #769

Are you sure you want to change the base?

Speed up loading of bounding box csv files #769

Uh oh!

Conversation

sfmig commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

References

orjson vs json

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

codecov bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfmig commented Jan 20, 2026 •

edited

Loading

codecov bot commented Jan 20, 2026 •

edited

Loading