Skip to content

Conversation

@sfmig
Copy link
Member

@sfmig sfmig commented Jan 20, 2026

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
We noticed that reading a bbox dataset from a VIA tracks file could be very slow, especially for files with a lot of frames and individuals. In this PR I made several changes to improve this.

What does this PR do?
This PR makes changes mainly the validation of VIA tracks files and their loading as movement datasets.

  • Previously in the validator I prioritised giving helpful and specific messages to users about errors over loading speed. So I would loop through the rows to report on all the rows with issues, and keep the logic of each test almost independent. However, this makes everything v slow with very large files. I have now changed the validator to a fail-fast approach, and also modified where possible to single passes through the data (while keeping validation checks still separate). Now we only report on the first error encountered, but in exchange loading with no errors is faster.

  • There was some redundancy between the validator checks and the loader. This is because during validation we checked quantities that are also needed to build the dataset. For example, we check if the frame numbers are defined correctly and it is possible to extract integers from them. At the time we prioritised keeping the validation and the loading logic separate, but again this redundancy slows down things with larger files. I have changed the validator to also pre-parse the required data for loading (x,y, width, height, frame number, track IDs, confidence values if defined). These data are collected as attributes of the file validator, and the file validator object is the input to the loading function. This blurs the boundaries between validator and loader, but reduces the redundancy.

  • The loading module has become quite simplified: it now retrieves the relevant data from the validation object and formats it as a dataframe, fills in empty rows with nans and extracts the relevant numpy arrays from it.

  • In the validator, I replaces json.loads by orjson.loads, which gave us some extra speedups. Other improvements include:

    • ast.eval_literal --> json.loads ---> orjson.loads
    • replacing split and stack by a more efficient approach
    • replacing float64 numpy array with float32 (consistent with poses)
      • this also affects the exporting function
    • adapting the tests

I also modified how the path to the pytest plugins is defined ( in tests/conftest.py), since at the moment it requires the tests to be launched from the root directory.

References

I benchmarked from_via_tracks_file on a few files I have access to, that represent realistic datasets of many individuals and frames. Below are the median values in seconds except when specified.

When two values are shown, the first one refers to the bulk of the original loading function, _df_from_via_tracks_file, which takes place after validation. The function is refactored and renamed with this PR.

VIA tracks csv file (size on disk) n coords before (main) after loading improvements after validator improvements plus orjson
small 34MB time: 2067
individuals: 144
23 // 36 0.8 // 9 ..// 0.89 .. // 0.53
medium 100MB time: 6553
individuals: 295
1min 8s // 2min 33s 2.6 // 75.1 ..// 2.7 .. // 1.7
medium 400MB time: 27489
individuals: 602
..// 11.7 .. // 8.0
large 1.8GB time: 108922
individuals: 4968
..// ~3min ..// 2min 32s

I attach the pytest-benchmark results for the first and the last two columns in the table above (the first column is the before json, the last two columns are the after json and the after_orjson json).
0005_before_small_34mb.json
0011_before_medium_100mb_data.json

0005_after_small_34mb_data.json
0004_after_medium_100mb_data.json
0003_after_medium_400mb_data.json
0006_after_large_2gb_data.json

0007_after_orjson_small_34mb_data.json
0008_after_orjson_medium_100mb_data.json
0009_after_orjson_medium_400mb_data.json
0010_after_orjson_large_2gb_data.json

orjson vs json

I would like to know other's opinions on replacing json by orjson, since it can be significantly faster.

The migration seems worth a separate PR, the main difference between them being that orjson.dumps() returns bytes, not str, and that it does not have an indent parameter. Without this being an exhaustive review, I highlight some of the differences from their docs and from consulting with Claude:

# Standard json - returns str
json.dumps({"key": "value"})  # '{"key": "value"}'

# orjson - returns bytes
orjson.dumps({"key": "value"})  # b'{"key": "value"}'

# If you need a string (common for CSV columns, attributes, etc.):
orjson.dumps({"key": "value"}).decode("utf-8")

#################
# Re: indent, before
json.dumps(data, indent=2)

# after
orjson.dumps(data, option=orjson.OPT_INDENT_2).decode("utf-8")

Apparently file I/O also differs:

# Before - json.load() / json.dump() with file objects
with open("file.json") as f:
    data = json.load(f)

with open("file.json", "w") as f:
    json.dump(data, f)

# After - orjson doesn't have load()/dump(), use loads()/dumps()
with open("file.json", "rb") as f:  # Note: 'rb' for bytes
    data = orjson.loads(f.read())

with open("file.json", "wb") as f:  # Note: 'wb' for bytes
    f.write(orjson.dumps(data))

One option could be to allow them to co-exist in movement for a while (we use orjson for loading VIA tracks files for now, and open an issue for migration in the future).

How has this PR been tested?

Tests pass locally and in CI.

Is this a breaking change?

No, only private functions are modified.

Does this PR require an update to the documentation?

No.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@sfmig sfmig force-pushed the smg/parquet-bboxes branch from 9faa8ba to b4c944a Compare January 20, 2026 12:07
@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (b3f5a9c) to head (5e99927).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #769   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           34        34           
  Lines         2111      2106    -5     
=========================================
- Hits          2111      2106    -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfmig sfmig force-pushed the smg/parquet-bboxes branch from df74f48 to 2399ca8 Compare January 20, 2026 19:36
@sfmig sfmig mentioned this pull request Jan 22, 2026
6 tasks
@sfmig sfmig force-pushed the smg/parquet-bboxes branch 2 times, most recently from d7534a8 to 43da863 Compare January 22, 2026 10:36
@sfmig sfmig force-pushed the smg/parquet-bboxes branch 2 times, most recently from a538200 to e43a9d9 Compare January 26, 2026 16:49
@sfmig sfmig marked this pull request as ready for review January 26, 2026 17:07
@sfmig sfmig marked this pull request as draft January 26, 2026 17:47
@sfmig sfmig force-pushed the smg/parquet-bboxes branch from bb47e28 to f086f05 Compare January 27, 2026 18:32
@sfmig sfmig force-pushed the smg/parquet-bboxes branch from 5663d23 to 3419670 Compare January 27, 2026 18:40
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants