Skip to content

Use git sparse-checkout in participant_job for more efficient dataset cloning#337

Merged
tien-tong merged 4 commits intomainfrom
git-sparse-checkout
Mar 2, 2026
Merged

Use git sparse-checkout in participant_job for more efficient dataset cloning#337
tien-tong merged 4 commits intomainfrom
git-sparse-checkout

Conversation

@tien-tong
Copy link
Copy Markdown
Contributor

@tien-tong tien-tong commented Feb 27, 2026

Summary

This branch switches participant_job.sh to use git sparse-checkout when cloning, istead of checking out the full repository tree


1. participant_job: sparse-checkout

File: participant_job.sh.jinja2

Participant jobs now clone the analysis dataset with:

# Clone the dataset without checking out the working tree.
# This initializes the Git repository and DataLad metadata,
# but does not populate any files on disk.
datalad clone ... -- --no-checkout

# Enable sparse-checkout in "cone" mode (optimized for directory-level patterns).
# This allows us to specify top-level directories to materialize.
git sparse-checkout init --cone

# Restrict the working tree to only the required paths:
git sparse-checkout set code containers <input_dataset paths>

# Populate the working tree with only the paths defined above
git checkout -f

Key changes

  • Sparse checkout: Only required paths are materialized instead of the full tree.

  • Non-zipped BIDS inputs: Sparse-checkout is applied per dataset so that only:

    • the current subject/session, and
    • dataset_description.json
      are present.

    This avoids full-dataset scans (e.g., via BIDSLayout) over unretrieved content.

  • Cleanup refactor: Replaces manual datalad drop / rm -rf calls with a single trap cleanup EXIT, ensuring consistent cleanup behavior.


2. Shared container image handling

File: participant_job.sh.jinja2

After sparse-checkout, jobs no longer re-fetch containers via DataLad. Instead:

The container image is symlinked from:

${PROJECT_ROOT}/analysis/containers/.datalad/environments/<container_name>/image

into the job’s local containers/ tree.

This avoids redundant container downloads across jobs and prevents failures that can occur when hundreds of jobs simultaneously attempt to datalad get the same file.


3. Zip discovery (sparse-checkout safe)

File: determine_zipfilename.sh.jinja2

Replaces filesystem-based find with a Git-tree-based approach:

git -C "${zip_search_path}" ls-tree -r --name-only HEAD

combined with a grep pattern to identify the zip file.

This allows zip filename resolution directly from the Git tree, making it compatible with --no-checkout and sparse-checkout workflows, without relying on materialized files.


4. Merge and status fixes

merge.py

  • Job branch refs are now resolved as origin/<branch> (e.g., origin/job-0001-sub-01) in the merge clone.
  • These refs are passed to get_git_show_ref_shasum, ensuring correct merge behavior when working with remote refs.

base.py

  • If output_ria_data_dir points to the RIA store root (without a .git directory), it is resolved via the RIA alias/data symlink so that the actual dataset path is used.

utils.py

  • get_results_branches_from_clone(clone_path)
    Lists job branches via git branch -r in a clone (used by merge_ds), avoiding slow or hanging RIA listing in CI.

  • get_results_branches_from_ria(ria_data_dir, timeout)
    Lists job branches via git ls-remote --heads with a timeout to prevent CI hangs.

  • update_results_status
    Uses:

    .infer_objects(copy=False).astype(bool)

    when filling has_results, avoiding pandas dtype issues in boolean logic.

@tien-tong tien-tong linked an issue Feb 27, 2026 that may be closed by this pull request
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 28, 2026

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.07%. Comparing base (c56b73d) to head (1faf33a).

Files with missing lines Patch % Lines
babs/utils.py 83.33% 2 Missing and 2 partials ⚠️
babs/base.py 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
+ Coverage   79.02%   79.07%   +0.04%     
==========================================
  Files          16       16              
  Lines        1812     1840      +28     
  Branches      304      312       +8     
==========================================
+ Hits         1432     1455      +23     
- Misses        266      268       +2     
- Partials      114      117       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tien-tong tien-tong requested a review from mattcieslak March 2, 2026 14:45
@tien-tong tien-tong changed the title Use git sparse-checkout Use git sparse-checkout in participant_job for more efficient dataset cloning Mar 2, 2026
@tien-tong tien-tong added the enhancement New feature or request label Mar 2, 2026
@tien-tong tien-tong merged commit 5740ba2 into main Mar 2, 2026
10 checks passed
@tien-tong tien-tong deleted the git-sparse-checkout branch March 2, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce inode usage in participant_job.sh

2 participants