Skip to content

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Oct 29, 2024

edit: formerly it was "project" but then renamed to "study" for better alignment

This PR was initially submitted as #1861 but I made a mistake to combine it with a discussion of transformations of existing projects' layouts into such BIDS project dataset. Please refer to that PR for examples but otherwise let's concentrate here on the discussion of this specific proposed change.

  • Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivative data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
  • Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

When accepted:

@yarikoptic yarikoptic added the opinions wanted Please read and offer your opinion on this matter label Oct 29, 2024
@yarikoptic yarikoptic requested a review from effigies October 29, 2024 21:01
@yarikoptic yarikoptic requested a review from tsalo October 29, 2024 21:01
@yarikoptic yarikoptic changed the title Enh project datasettype Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Oct 30, 2024
@yarikoptic yarikoptic force-pushed the enh-project-datasettype branch from d0d5c37 to fb4f5a4 Compare January 23, 2025 02:40
Copy link

codecov bot commented Jan 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.71%. Comparing base (5a357dd) to head (1fe00d9).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1972   +/-   ##
=======================================
  Coverage   82.71%   82.71%           
=======================================
  Files          20       20           
  Lines        1608     1608           
=======================================
  Hits         1330     1330           
  Misses        278      278           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Mar 1, 2025

FWIW, we conversed with @effigies and he brought up an interesting argument, although IMHO not contradicting this one per se, is that ATM any BIDS dataset (raw or derivative) which already contains some subdatasets under derivatives/ could be considered to be a "project" BIDS dataset. Indeed it is true. But IMHO it is just a legacy side-effect of BIDS recommendation to stick derivatives under derivatives/ which somewhat invalidate the claim that the BIDS dataset as a whole is a "raw" dataset.

edit: related linked below is #2103 highlighting the same situation with "raw" dataset containing "derivatives/"

@yarikoptic
Copy link
Collaborator Author

@effigies I wonder if we should extend DatasetType to become a list or replace or compliment with DatasetTypes which could then indeed be just "project" when containing other derivative bids-datasets or sourcedata, or ["raw", "project"] when also a raw and project, or could be also ["derivative", "project"] when derivative but points to sourcedata/ original "raw" ... ?

@effigies
Copy link
Collaborator

I'm skeptical of that need. I would expect your ["raw", "project"] to look like:

project/
  dataset_description.json  # "DatasetType": "project"
  rawdata/
    dataset_description.json  # "DatasetType": "raw"
  derivatives/
    preproc/
      dataset_description.json  # "DatasetType": "derivative"

And ["derivative", "project"]:

project/
  dataset_description.json  # "DatasetType": "project"
  sourcedata/
    dataset_description.json  # "DatasetType": "raw"
  rawdata/
    dataset_description.json  # "DatasetType": "derivative"
  deriv/
    analysis/
      dataset_description.json  # "DatasetType": "derivative"

Subdatasets should be validatable BIDS datasets in their own right, avoiding the need for a top-level dataset_description.json to modify how they are intended to be validated.

@effigies
Copy link
Collaborator

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

@yarikoptic
Copy link
Collaborator Author

yet to "process" but a quick side idea inspired by #1928 --- I wonder if there is a hierarchy here: project (everything common) -> raw (current default, requires having sub- folder(s)) -> derivative (more stuff could be added), as every next level adds capabilities but includes all of the prior one as derivative could include raw in it? or we have already something which invalidates that?

@yarikoptic
Copy link
Collaborator Author

I think this overall needs more specification. What are valid directories in a project-type dataset? We should add them to https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/directories.yaml.

They are already there, that's somewhat the point here -- that we are already defining the structure of all those folders, nothing new to add.

I think a project dataset is barely worth specifying if we don't validate at least the raw data subdataset. Possibly we should have rules for indicating where validators should look for subdatasets.

In OpenNeuroDerivatives, we use sourcedata/* as a mixture of BIDS and non-BIDS (FreeSurfer) inputs. Nipoppy uses derivatives/<pipeline>/<version>/<output> as derivative datasets; presumably this could also be a mixture of BIDS and non-BIDS.

I think presence of the subdatasets is not really the differentiation here, and formalization of rules for their validation is orthogonal to this issue. Having in mind my prior observation that "raw" is pretty much "a project with data in sub-* folders" we might be circling back to that issue of requiring sub- folders in "raw" BIDS datasets? ;-) and otherwise they are just "BIDS project" datasets, which could contain sourcedata/ and derivatives/ subdatasets, logs/, code/ etc, but not sub-* subfolders with actual data?

Could even kinda become nice that we would facilitate people to even start their "raw BIDS datasets" as "project BIDS datasets" where they plan (README, code/ etc) until they start populate with data and thus becoming raw?

@yarikoptic yarikoptic force-pushed the enh-project-datasettype branch 2 times, most recently from 9c68bf8 to 6f236ce Compare June 10, 2025 12:45
@yarikoptic
Copy link
Collaborator Author

@effigies Following your idea, I have now added a "warning" (to reflect level of the analogous SubjectFolders check in "raw" BIDS). I guess, in principle, we could take this as an opportunity to revert SubjectFolders check back to "error" and demand making it "project" type, and then make it also "error" in project type. WDYT?

@yarikoptic yarikoptic added the copenhagen For discussion in Copenhagen label Jun 11, 2025
@yarikoptic
Copy link
Collaborator Author

While discussing with @jbpoline we wondered, if may be study would be a better descriptor to use here in favor of project. One of the rationales, is that e.g. in BEP035 (attn @bids-standard/bep035) on Mega-analysis they introduce study- entity as a groupping element. It kinda then would match natively.

we also mention "study" in various places in BIDS which seems to align nicely here
❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [#863](https://github.com/bids-standard/bids-specification/pull/863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

and "project" mentionings are not particularly aligned. So, I think, we should just make it a "study", hence renaming accordingly.

@yarikoptic yarikoptic changed the title Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Add DatasetType="study" and rework existing "layout" example into a proper BIDS dataset Jun 12, 2025
This reverts commit a3c12f8 where I have tried
to introduce it in
bids-standard#1741 but it required a
little more of further detailing.
Idea from @effigies while discussing this PR at BIDS Maintainers meeting 2025
…ith SubjectFolders check

Also adjusted wording to be aligned too
While discussing with @jbpoline we wondered, if may be `study` would be a better descriptor to use here in favor of `project`.  One of the rationales, is that e.g. in [BEP035](https://bids.neuroimaging.io/extensions/beps/bep_035.html) (attn @bids-standard/bep035) on Mega-analysis they introduce `study-` entity as a groupping element.  It kinda then would match natively.

we also  mention "study" in various places in BIDS which seems to align nicely here

```shell
❯ git grep study
src/CHANGES.md:-   \[FIX] update physio bids name in longitudinal study page examples [bids-standard#863](bids-standard#863) ([Remi-Gau](https://github.com/Remi-Gau))
src/appendices/coordinate-systems.md:The following template identifiers are RECOMMENDED for individual- and study-specific reference
src/appendices/coordinate-systems.md:In the case of multiple study templates, additional names may need to be defined.
src/appendices/coordinate-systems.md:| study                 | Custom space defined using a group/study-specific template. This coordinate system requires specifying an additional file to be fully defined.                                                                                                                         |
src/appendices/hed.md:numerical values that are similar across the recordings in the study.
src/appendices/hed.md:repository on GitHub should be used to validate the study event annotations.
src/common-principles.md:    unless when appropriate given the study goals, for example, when scanning babies.
src/introduction.md:> The data used in the study were organized using the
src/modality-specific-files/genetic-descriptor.md:     "Dataset": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001364.v1.p1",
src/modality-specific-files/intracranial-electroencephalography.md:Note that the date and time information SHOULD be stored in the study key file
src/modality-specific-files/magnetic-resonance-spectroscopy.md:acquisition parameters in filenames is helpful or necessary to distinguish datasets in a given study.
src/modality-specific-files/motion.md:Note that the onsets of the recordings SHOULD be stored in the study key file [(`scans.tsv`)](../modality-agnostic-files.md#scans-file).
src/modality-specific-files/positron-emission-tomography.md:This entity is OPTIONAL if only one tracer is used in the study,
src/modality-specific-files/task-events.md:Please mind that this does not imply that only so called "event related" study designs
src/schema/objects/common_principles.yaml:    A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
src/schema/objects/common_principles.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/common_principles.yaml:    A person or animal participating in the study.
src/schema/objects/entities.yaml:    For example, this should be used when a study includes two T1w images -
src/schema/objects/entities.yaml:    Session can (but doesn't have to) be synonymous to a visit in a longitudinal study.
src/schema/objects/entities.yaml:    A person or animal participating in the study.
src/schema/objects/enums.yaml:study:
src/schema/objects/enums.yaml:  value: study
src/schema/objects/enums.yaml:  display_name: study
src/schema/objects/enums.yaml:    Custom space defined using a group/study-specific template.
src/schema/objects/metadata.yaml:    Reference to the study/studies on which the implementation is based.
src/schema/objects/metadata.yaml:    The version of the HED schema used to validate HED tags for study.
tools/schemacode/src/bidsschematools/tests/data/broken_dataset_description.json:"EthicsApprovals": ["The original study from which this BIDS example dataset was derived was approved by the Ethics committee of Ghent University Hospital with identifier EC 2017/1103."]

```

and "project" mentionings are not particularly  aligned.  So, I think, we should just make it a "study", hence renaming accordingly.
@effigies effigies force-pushed the enh-project-datasettype branch from 154a8cc to 315c08f Compare June 12, 2025 09:20
@effigies effigies changed the title Add "study" DatasetType to organize a collection of source and derivative datasets [ENH] Add "study" DatasetType to organize a collection of source and derivative datasets Jun 16, 2025
@effigies
Copy link
Collaborator

A functional question for the validator: Should we error if there are no valid BIDS datasets in sourcedata/?

cc @rwblair @nellh

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Jun 16, 2025

sorry, I do not see how this relates to this PR since having a "valid BIDS datasets in sourcedata/" seems to point to be a "derivative BIDS" dataset by its definition.

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

@yarikoptic
Copy link
Collaborator Author

@effigies who, among maintainers, do you think might also be interested to review this PR?

@effigies
Copy link
Collaborator

But what are you validating if there are no valid subdatasets? Why are you running the validator?

@yarikoptic
Copy link
Collaborator Author

dang... I guess I mixed up this PR with something else since I thought that we have merged it!

But what are you validating if there are no valid subdatasets? Why are you running the validator?

I did some edit above, may be it came during your question:?

edit: the point is that "study" dataset could even be empty to start with, start collecting various other sourcedata/ and eventually do get sourcedata/raw BIDS dataset. I think it should remain valid through those life cycles.

so the point is that you would validate what is available -- that there is no sub- folders, that dataset_description.json is proper with correct DatasetType, etc. Later I will adjust

to get some examples.

@yarikoptic
Copy link
Collaborator Author

@effigies could you invite or nominate a few other PR reviewers who might potentially be interested? ;-)

Copy link
Member

@kabilar kabilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yarikoptic. A study BIDS dataset would be useful for the LINC project where we are disseminating source, raw, and derived data together.

cc @satra @balbasty @ayendiki

@yarikoptic yarikoptic requested a review from kabilar August 8, 2025 17:05
Copy link
Member

@kabilar kabilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @yarikoptic.

@yarikoptic yarikoptic force-pushed the enh-project-datasettype branch from 6e3b18b to 02a8074 Compare August 8, 2025 20:10
@kabilar
Copy link
Member

kabilar commented Aug 13, 2025

Hi @yarikoptic @effigies, I am new to the BIDS release process. What are the next steps to get this merged and released as part of the latest BIDS spec? Thanks.

Copy link
Member

@julia-pfarr julia-pfarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me for a start!

@yarikoptic
Copy link
Collaborator Author

2 maintainers + 1 (associated) contributor approvals. How many more to request/wait for ? ;)

@julia-pfarr julia-pfarr merged commit 132a9a4 into bids-standard:master Aug 20, 2025
17 of 22 checks passed
@yarikoptic
Copy link
Collaborator Author

One trivial click for a maintainer, one giant leap for the BIDSkind!

Thank you @julia-pfarr ! ;-)

yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Aug 26, 2025
…where was missing

Apparently I have missed this file entirely whenever I was preparing
  bids-standard#1972 (study DatasetType)

and also due to all the duplication we (I) missed that docs was not listed among "root.subdirs"
or not listed at all.

This would provide a fix, but I wonder if we could/should make it so we could avoid duplication
altogether.  As I have argued in bids-standard#1972 I feel that "study" is the base dataset type and
next ones just add potentially more to them.  So may be we could come up with some more
compact representation here... but not in this PR
julia-pfarr pushed a commit that referenced this pull request Aug 26, 2025
…d "docs" where was missing (#2185)

* Add "directories" description for "study" DatasetType and add "docs" where was missing

Apparently I have missed this file entirely whenever I was preparing
  #1972 (study DatasetType)

and also due to all the duplication we (I) missed that docs was not listed among "root.subdirs"
or not listed at all.

This would provide a fix, but I wonder if we could/should make it so we could avoid duplication
altogether.  As I have argued in #1972 I feel that "study" is the base dataset type and
next ones just add potentially more to them.  So may be we could come up with some more
compact representation here... but not in this PR

* Remove "phentoype/" and "stimuli/" from the "root" of the "study" dataset

Since per se they should either be under "sourcedata/" or "derivatives/" one way or
another even if it is some stimuli-only dataset(s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

copenhagen For discussion in Copenhagen needs-review opinions wanted Please read and offer your opinion on this matter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants