Skip to content

How to handle file exclusions and inclusions in 0.8? #378

@tyarkoni

Description

@tyarkoni

This is an issue that has cropped up repeatedly in various guises (e.g., #215, #364, #277, #184, #131, and probably others). The question is how to allow users to specify explicit inclusion and exclusion paths at BIDSLayout initialization. The reason for bringing this up again is that, as of version 0.8 (see #369), pybids will no longer depend on grabbit. The cord-cutting means we can no longer rely on the behavior implemented in grabbit. Since this was entirely undocumented in pybids, I think we have a good opportunity to start afresh and hopefully settle on something that works for everyone.

The main constraints I think we should try to respect are:

  • We want to exclude a bunch of hard-coded subdirectories by default (e.g., 'code', 'stimuli', 'sourcedata', etc.)
  • Users should be able to easily override any of the default exclusions and make sure they're indexed
  • Users should be able to specify arbitrary directories anywhere in the file system that should not be indexed (in the event they're encountered in any raw or derivatives BIDSLayout)

The current approach doesn't allow users to specify explicit exclusions at all (well, it does, but this is an undocumented grabbit feature). It uses an include argument only as a means of negating the default exclusions. E.g., if you want 'stimuli'/ to be indexed, you pass include=['stimuli']. Beyond this, there's no pybids-level ability to control inclusions or exclusions (aside from specifying derivatives, which is a separate matter that I think we're handling in a satisfactory way). I don't think this is satisfactory, and a bunch of the opened issues reflect that.

Here are a few proposals (feel free to suggest others):

  1. Keep the current approach, where include negates values in the default exclusion list, but add an exclude argument that causes any matching files/dirs to be skipped during indexing. The main downside I see here is that the behavior is counterintuitive, as include and exclude act asymmetrically. A potential fix is to give these arguments different names (e.g., override_exclusions and exclude_paths).

  2. Stick with just exclude, and have any manually specified value override the default internal list (e.g., if you pass ['code', 'sourcedata'], then things like 'stimuli' will now be indexed, and only files/dirs that match the elements in your list will be skipped). The downside of this is it requires users to know what the default exclusions are, and reproduce them, and this will probably get pretty messy.

  3. Get rid of the current default exclusion list entirely, and treat exclude as a strict list of paths to exclude from indexing. Now that the validator is working properly, directories like 'stimuli' will automatically be skipped if validate=True, because files won't pass the validator unless they're explicitly part of the spec. The downside of this option is that it makes it difficult to index selectively—e.g., if you want to index only what's in 'stimuli', you need to set validate=False and then pass a whole pile of exclusions (i.e., everything that doesn't pass the validator except for 'stimuli').

I lean towards (1) (with more explicit argument names). Thoughts? If I don't get any feedback in the next couple of days, I'll make an executive decision in the interest of getting 0.8 merged, so speak up now if you have an opinion! (Tagging in @effigies @adelavega @yarikoptic @gkiar)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions