-
Notifications
You must be signed in to change notification settings - Fork 130
[ENH]: exclude common non-BIDS data sources + evade potential named tuple problem #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,7 +69,7 @@ class BIDSLayout(Layout): | |
| was left empty (in which case the root defaults to the file system | ||
| root). | ||
| derivatives (bool, str, list): Specificies whether and/or which | ||
| derivatives to to index. If True, all pipelines found in the | ||
| derivatives to index. If True, all pipelines found in the | ||
| derivatives/ subdirectory will be indexed. If a str or list, gives | ||
| the paths to one or more derivatives directories to index. If False | ||
| or None, the derivatives/ directory is ignored during indexing, and | ||
|
|
@@ -114,7 +114,10 @@ def __init__(self, root, validate=True, index_associated=True, | |
| "dataset_description.json." % k) | ||
|
|
||
| # Determine which subdirectories to exclude from indexing | ||
| excludes = {"code", "stimuli", "sourcedata", "models", "derivatives"} | ||
| # A set is extended with non-BIDS-defined common suspects | ||
| # which are not expected to carry any data of interest | ||
| excludes = {"code", "stimuli", "sourcedata", "models", "derivatives", | ||
| ".git", ".datalad", ".cache", ".workdir", "workdir"} | ||
| if include is not None: | ||
| include = listify(include) | ||
| if "derivatives" in include: | ||
|
|
@@ -563,18 +566,24 @@ def index_file(self, f, overwrite=False): | |
| an entry already exists. | ||
| """ | ||
| if isinstance(f, six.string_types): | ||
| f = self.layout.get_file(f) | ||
| filename = self.layout.get_file(f).path | ||
| elif isinstance(f, tuple): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the motivation for this? For methods like this, I'd prefer to minimize the number of valid types that can be passed as input. It's cleaner if the caller has to first extract a string or
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Motivation was pointed to below this line: http://github.com/grabbles/grabbit/pull/84 . In the effort of using fitlins with current pybids we ended up here with the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yarikoptic How would you feel about getting rid of the namedtuple entirely, and just returning the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see no problems with it, as long as its backwards compatible. It does seem a bit redundant and confusing two have two types of objects that can be returned.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've opened #281 to deal with this. |
||
| # See http://github.com/grabbles/grabbit/pull/84 | ||
| # .path is not available within File named tuple | ||
| filename = f.filename | ||
| else: | ||
| filename = f | ||
|
|
||
| if f.path in self.file_index and not overwrite: | ||
| if filename in self.file_index and not overwrite: | ||
| return | ||
|
|
||
| md = self._get_metadata(f.path) | ||
| md = self._get_metadata(filename) | ||
|
|
||
| for md_key, md_val in md.items(): | ||
| if md_key not in self.key_index: | ||
| self.key_index[md_key] = {} | ||
| self.key_index[md_key][f.path] = md_val | ||
| self.file_index[f.path][md_key] = md_val | ||
| self.key_index[md_key][filename] = md_val | ||
| self.file_index[filename][md_key] = md_val | ||
|
|
||
| def _get_metadata(self, path, **kwargs): | ||
| potential_jsons = self.layout._get_nearest_helper(path, '.json', | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than hard-coding these, I think it's reasonable to expect users to maintain a
.bidsignorefile, as these should also cause validator issues.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> cat .bidsignore inputs/ workdir/ eyetrack/ README.md model-frst_smdl.json CHANGELOG.mdthis is what I have in my .bidsignore, files in workdir were still considered.
@yarikoptic remarks: "Would it ever makes sense to consider .git files?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
.gitshould never be considered. I'm not sure.bidsignorereading is implemented. That would be a valuable contribution. But I also think sensible hardcoded defaults (such as these) make sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please update the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to specify sensible defaults, I'd be inclined to disallow all
.*files except.bidsignore.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like
workdir/, there's no limit to what people might want to add.I agree that
.bidsignoresupport is the right way to handle this. Pybids currently passes (inkwargs) anexcludeargument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement.bidsignoresupport would be to add entries in.bidsignoreto theexcludelist. There may be unintended consequences, however, so this should be investigated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, but I think excluding
.*folders by default is probably a good idea (when would it be a good idea to scan them in?) Other folders can go into abidsignoreThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious -- what for to allow/consider .bidsignore? if any function would be needing to treat this this file, it would just test for its presence and open it. It doesn't need it to be a part of the Layout schema/pool of files, since there is nothing to query for in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree too! Support of .bidsignore would allow for all those
workdiretc. But also I agree with @adelavega -- I do not think in any BIDS spec we would/should allow/be interested in files/directories starting with.for the purpose of paths matching etc.As I have pointed above,
.bidsignoreis a special file which is not part of the path layout template matching and actually not a part of the BIDS specification itself! Its existence was instigated within bids-validator, so it somewhat of a tooling level file. So it should IMHO be ok to ignore all.files/dirs in general for pattern matching. I think it might even be worth to "hardcode" such consideration in BIDS specification itself? e.g."Note on BIDS dataset tools support/development: No directories or files in BIDS are generally named with leading
.and thus should be ignored for the purpose of discovery of relevant data files. A.bidsignorefile can be used by dataset users to inform tools of additional directories/files which should also be ignored by the tools. The format of.bidsignorefile is .."Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed—let's add
"^\."to the list of default excludes (if I'm not mistaken, they're regexes by default) and remove the other entries (i.e.,workdirand anything that starts with.). We can add.bidsignoresupport in a separate PR, unless someone wants to take a pass at it here. It shouldn't be difficult—unless I'm missing something, I think just adding all the lines in that file to the internally-constructedexcludelist that gets passed onto grabbit'sLayoutinitializer should do the trick.I don't have strong feelings about whether or not the spec should mention
.and.bidsignore. I guess I lean towards not for the time being; to my mind the place to put that would be some kind of developer guide or appendix.