Skip to content

Conversation

bclenet
Copy link
Contributor

@bclenet bclenet commented Apr 10, 2025

This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.

bclenet and others added 30 commits March 18, 2025 11:06
Copy link
Collaborator

@cmaumet cmaumet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Boris,

Thanks a lot for this new version of the BIDS-Prov spec.

As discussed by email I reviewed the "key concepts" section as well as the outline.

Overall, it looks great and I think the outline gives a good flow of information.

One suggestion would be :


Overview
---- Goals
---- General principles
---- Key concepts
Provenance files
---- Activities
---- Entities
---- Software
---- Environments
Provenance of a BIDS file
---- Sidecar json
---- Provenance files
-------- Activities
-------- Software
-------- Environments
-------- Entities
Provenance of a BIDS dataset
---- Description using provenance records
---- Description of processes or pipelines
Consistency and uniqueness of identifiers
---- Identifiers for entities
---- Identifiers for other provenance records
Minimal examples
---- Provenance of a BIDS raw dataset
---- Provenance of a BIDS study dataset


And the section "Provenance of a BIDS dataset" would refer back to the subection "Provenance files" as needed.


Provenance records are described as JSON objects in BIDS. They are stored inside **provenance files** (see [Provenance files](#provenance-files)).

Additionally, **provenance metadata** of entities can be stored as regular BIDS metadata inside:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Additionally, **provenance metadata** of entities can be stored as regular BIDS metadata inside:
Additionally, **provenance metadata** of entities can be stored as regular BIDS metadata inside sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file)).


Additionally, **provenance metadata** of entities can be stored as regular BIDS metadata inside:

- sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file));

Additionally, **provenance metadata** of entities can be stored as regular BIDS metadata inside:

- sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file));
- `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).
Finally, activities responsible for the creation of the dataset can be stored in `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activites are not stored in dataset_description.json.

Is your suggestion only related to the writing as a list or is it a matter of meaning ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a link to an Activity is stored in dataset_description.json? If yes then the text above can be amended as follows: "Finally, activities responsible for the creation of the dataset can be linked from "

@effigies
Copy link
Collaborator

effigies commented Oct 3, 2025

Here are the outstanding issues I can see, given the current state of the pull requests:

  1. SIDECAR_WITHOUT_DATAFILE errors. I have pointed @bclenet to where this would need to be addressed: https://github.com/bids-standard/bids-validator/blob/c331a16/src/validators/internal/unusedFile.ts. The basic problem is that JSON files are not generally distinguishable from sidecars, which we do want to check apply to some file. We check against the two exceptions to this rule, but that list inclusion check could be made a function to handle prov.
  2. Unvalidated prov contents. Use schema/rules/json/prov.yaml to define the contents of the files. I hope the existing files will be a good model, but let me know if you need some scaffolding there.
  3. The arbitrary subdirectories are going to be a pain. The simplest thing would be to drop the idea and have a flat directory, which works right now. The alternative is probably going to involve new concepts in both the file rules of the directory rules (cc @rwblair):
    provenance:
    level: optional
    datatypes:
    - prov
    suffixes:
    - act
    - ent
    - env
    - soft
    extensions:
    - .json
    entities:
    prov: required
    raw:
    root:
    subdirs:
    - code
    - docs
    - derivatives
    - logs
    - phenotype
    - prov
    - sourcedata
    - stimuli
    - subject
    code:
    name: code
    level: optional
    opaque: true
    derivatives:
    name: derivatives
    level: optional
    opaque: true
    docs:
    name: docs
    level: optional
    opaque: true
    logs:
    name: logs
    level: optional
    opaque: true
    phenotype:
    name: phenotype
    level: optional
    opaque: false
    prov:
    name: prov
    level: optional
    opaque: false
    sourcedata:
    name: sourcedata
    level: optional
    opaque: true
    stimuli:
    name: stimuli
    level: optional
    opaque: true
    subject:
    entity: subject
    level: required
    opaque: false
    subdirs:
    - oneOf:
    - session
    - datatype
    session:
    entity: session
    level: optional
    opaque: false
    subdirs:
    - datatype
    datatype:
    value: datatype
    level: required
    opaque: false

Copy link
Collaborator

@cmaumet cmaumet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another set of proposed updates as per our discussions


### Provenance of a BIDS raw dataset

Consider the following BIDS raw dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease reading of this section, it would be nice to add some textual description of what is found in the dataset, something along the lines of "following BIDS raw dataset that contains a single T1-weighted image that was generated from a set of DICOM files:"


### Provenance of a BIDS derivative dataset

Consider the following BIDS derivative dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, a few words describing the dataset would be great.

@effigies
Copy link
Collaborator

effigies commented Oct 7, 2025

I had a chance to talk with @rwblair about the arbitrary subdirectories yesterday. At least from the examples given (/prov/preprocspm/prov-preprocspm{1,2}_*.json), the idea is to allow grouping, which BIDS generally does with entities. Why not:

/prov[/prov-<label>]/prov-<label>_desc-<label>_<suffix>.json

Although a bit different from how BIDS has done things (data type before first entity directory), the machinery we already have in the schema is sufficient to encode this and the changes to the validator should not be difficult.

@bclenet
Copy link
Contributor Author

bclenet commented Oct 7, 2025

@effigies , thanks for the input on the arbitrary subdirectories ! We'll discuss that with Yarik and Camille tomorrow.

@yarikoptic
Copy link
Collaborator

/prov[/prov-<label>]/prov-<label>_desc-<label>_<suffix>.json

Although a bit different from how BIDS has done things (data type before first entity directory),

FWIW, I like it since it is generic. Might also be applicable to e.g. BEP044:Stimuli where ATM stimuli files are not groupped but have stim-<label>_...otherentities flat listing.

But overall it comes to the question when is worth keeping flat vs creating those folders, and it is kinda a generic aspect: e.g. if there is a BIDS dataset with only T1w images for 10 subjects -- folders are not really making it easier to navigate the data. Somewhat of a usecase for

Similarly here: if it is just a single "stage" derivative dataset (e.g. bids-app applied in one go across all subjects) -- there is no need for subfolders there, right?

@effigies
Copy link
Collaborator

effigies commented Oct 7, 2025

I have no objections to keeping it flat. I was under the impression that nesting was important.

@yarikoptic
Copy link
Collaborator

@effigies I see now that you proposed to make proc-<label>/ folders optional -- would that be somehow consistent with the rest of BIDS and wouldn't introduce difficulty? I do not remember any other entity for which we have it optional kinda.

Per our discussion I also would recommend establishing prov/provenance.{tsv,json} with descriptions for those labels. Here I assume that provenance is the plural form here. Begs a question if should be provenance/ folder then, similar to stimuli/ for stim- entities in that BEP etc.

I lean toward always requiring them for the sake of consistency.

@effigies
Copy link
Collaborator

effigies commented Oct 8, 2025

would that be somehow consistent with the rest of BIDS and wouldn't introduce difficulty? I do not remember any other entity for which we have it optional kinda.

We don't, but it is a convention, not a technical problem.

Per our discussion I also would recommend establishing prov/provenance.{tsv,json} with descriptions for those labels. Here I assume that provenance is the plural form here. Begs a question if should be provenance/ folder then, similar to stimuli/ for stim- entities in that BEP etc.

I lean toward always requiring them for the sake of consistency.

I defer to the BEP leads to make the proposal they want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants