Use git sparse-checkout in participant_job for more efficient dataset cloning#337
Merged
Use git sparse-checkout in participant_job for more efficient dataset cloning#337
git sparse-checkout in participant_job for more efficient dataset cloning#337Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #337 +/- ##
==========================================
+ Coverage 79.02% 79.07% +0.04%
==========================================
Files 16 16
Lines 1812 1840 +28
Branches 304 312 +8
==========================================
+ Hits 1432 1455 +23
- Misses 266 268 +2
- Partials 114 117 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
git sparse-checkoutgit sparse-checkout in participant_job for more efficient dataset cloning
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch switches
participant_job.shto usegit sparse-checkoutwhen cloning, istead of checking out the full repository tree1. participant_job: sparse-checkout
File:
participant_job.sh.jinja2Participant jobs now clone the analysis dataset with:
Key changes
Sparse checkout: Only required paths are materialized instead of the full tree.
Non-zipped BIDS inputs: Sparse-checkout is applied per dataset so that only:
dataset_description.jsonare present.
This avoids full-dataset scans (e.g., via
BIDSLayout) over unretrieved content.Cleanup refactor: Replaces manual
datalad drop/rm -rfcalls with a singletrap cleanup EXIT, ensuring consistent cleanup behavior.2. Shared container image handling
File:
participant_job.sh.jinja2After sparse-checkout, jobs no longer re-fetch containers via DataLad. Instead:
The container image is symlinked from:
into the job’s local
containers/tree.This avoids redundant container downloads across jobs and prevents failures that can occur when hundreds of jobs simultaneously attempt to
datalad getthe same file.3. Zip discovery (sparse-checkout safe)
File:
determine_zipfilename.sh.jinja2Replaces filesystem-based
findwith a Git-tree-based approach:git -C "${zip_search_path}" ls-tree -r --name-only HEADcombined with a
greppattern to identify the zip file.This allows zip filename resolution directly from the Git tree, making it compatible with
--no-checkoutand sparse-checkout workflows, without relying on materialized files.4. Merge and status fixes
merge.pyorigin/<branch>(e.g.,origin/job-0001-sub-01) in the merge clone.get_git_show_ref_shasum, ensuring correct merge behavior when working with remote refs.base.pyoutput_ria_data_dirpoints to the RIA store root (without a.gitdirectory), it is resolved via the RIAalias/datasymlink so that the actual dataset path is used.utils.pyget_results_branches_from_clone(clone_path)Lists job branches via
git branch -rin a clone (used bymerge_ds), avoiding slow or hanging RIA listing in CI.get_results_branches_from_ria(ria_data_dir, timeout)Lists job branches via
git ls-remote --headswith a timeout to prevent CI hangs.update_results_statusUses:
when filling
has_results, avoiding pandas dtype issues in boolean logic.