[ENH]: exclude common non-BIDS data sources + evade potential named tuple problem#277
[ENH]: exclude common non-BIDS data sources + evade potential named tuple problem#277adswa wants to merge 3 commits intobids-standard:masterfrom
Conversation
| # A set is extended with non-BIDS-defined common suspects | ||
| # which are not expected to carry any data of interest | ||
| excludes = {"code", "stimuli", "sourcedata", "models", "derivatives", | ||
| ".git", ".datalad", ".cache", ".workdir", "workdir"} |
There was a problem hiding this comment.
Rather than hard-coding these, I think it's reasonable to expect users to maintain a .bidsignore file, as these should also cause validator issues.
There was a problem hiding this comment.
> cat .bidsignore
inputs/
workdir/
eyetrack/
README.md
model-frst_smdl.json
CHANGELOG.mdthis is what I have in my .bidsignore, files in workdir were still considered.
@yarikoptic remarks: "Would it ever makes sense to consider .git files?"
There was a problem hiding this comment.
I agree .git should never be considered. I'm not sure .bidsignore reading is implemented. That would be a valuable contribution. But I also think sensible hardcoded defaults (such as these) make sense.
There was a problem hiding this comment.
Also, please update the docstring
There was a problem hiding this comment.
If we're going to specify sensible defaults, I'd be inclined to disallow all .* files except .bidsignore.
There was a problem hiding this comment.
The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/, there's no limit to what people might want to add.
I agree that .bidsignore support is the right way to handle this. Pybids currently passes (in kwargs) an exclude argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore support would be to add entries in .bidsignore to the exclude list. There may be unintended consequences, however, so this should be investigated.
There was a problem hiding this comment.
Sounds good to me, but I think excluding .* folders by default is probably a good idea (when would it be a good idea to scan them in?) Other folders can go into a bidsignore
There was a problem hiding this comment.
If we're going to specify sensible defaults, I'd be inclined to disallow all
.*files except.bidsignore.
Just curious -- what for to allow/consider .bidsignore? if any function would be needing to treat this this file, it would just test for its presence and open it. It doesn't need it to be a part of the Layout schema/pool of files, since there is nothing to query for in its name.
There was a problem hiding this comment.
The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like
workdir/, there's no limit to what people might want to add.I agree that
.bidsignoresupport is the right way to handle this.
I agree too! Support of .bidsignore would allow for all those workdir etc. But also I agree with @adelavega -- I do not think in any BIDS spec we would/should allow/be interested in files/directories starting with . for the purpose of paths matching etc.
As I have pointed above, .bidsignore is a special file which is not part of the path layout template matching and actually not a part of the BIDS specification itself! Its existence was instigated within bids-validator, so it somewhat of a tooling level file. So it should IMHO be ok to ignore all . files/dirs in general for pattern matching. I think it might even be worth to "hardcode" such consideration in BIDS specification itself? e.g.
"Note on BIDS dataset tools support/development: No directories or files in BIDS are generally named with leading . and thus should be ignored for the purpose of discovery of relevant data files. A .bidsignore file can be used by dataset users to inform tools of additional directories/files which should also be ignored by the tools. The format of .bidsignore file is .."
There was a problem hiding this comment.
Agreed—let's add "^\." to the list of default excludes (if I'm not mistaken, they're regexes by default) and remove the other entries (i.e., workdir and anything that starts with .). We can add .bidsignore support in a separate PR, unless someone wants to take a pass at it here. It shouldn't be difficult—unless I'm missing something, I think just adding all the lines in that file to the internally-constructed exclude list that gets passed onto grabbit's Layout initializer should do the trick.
I don't have strong feelings about whether or not the spec should mention . and .bidsignore. I guess I lean towards not for the time being; to my mind the place to put that would be some kind of developer guide or appendix.
Codecov Report
@@ Coverage Diff @@
## master #277 +/- ##
==========================================
- Coverage 71.16% 71.07% -0.09%
==========================================
Files 24 24
Lines 2507 2510 +3
Branches 609 610 +1
==========================================
Hits 1784 1784
- Misses 558 561 +3
Partials 165 165
Continue to review full report at Codecov.
|
tyarkoni
left a comment
There was a problem hiding this comment.
The hard-coded list should only include directories mentioned in the spec. See comments.
| if isinstance(f, six.string_types): | ||
| f = self.layout.get_file(f) | ||
| filename = self.layout.get_file(f).path | ||
| elif isinstance(f, tuple): |
There was a problem hiding this comment.
What's the motivation for this? For methods like this, I'd prefer to minimize the number of valid types that can be passed as input. It's cleaner if the caller has to first extract a string or BIDSFile, I think. Plus, long term, I'd like to get rid of the namedtuple entirely—I think it mostly just adds confusion.
There was a problem hiding this comment.
Motivation was pointed to below this line: http://github.com/grabbles/grabbit/pull/84 . In the effort of using fitlins with current pybids we ended up here with the File tuple which has no .path (see that issue for our past discussion). Since our attempts are largely shots in the darkness trying to maneuver between fluid APIs, I did not want to try to change more than minimally necessary and this change was such a minimal change to stay compatible possibly with older and newer pybids API.
There was a problem hiding this comment.
@yarikoptic How would you feel about getting rid of the namedtuple entirely, and just returning the File object by default? The main functional difference is that entity names are attributes in the tuple, but not in the File class. But for the sake of backward compatibility, I'd be okay with hacking getattr to check in the .entities dict before raising an exception. That way I think nobody's existing code would break, and we would reduce code complexity. @effigies @adelavega feel free to chime in also.
There was a problem hiding this comment.
I see no problems with it, as long as its backwards compatible. It does seem a bit redundant and confusing two have two types of objects that can be returned.
| # A set is extended with non-BIDS-defined common suspects | ||
| # which are not expected to carry any data of interest | ||
| excludes = {"code", "stimuli", "sourcedata", "models", "derivatives", | ||
| ".git", ".datalad", ".cache", ".workdir", "workdir"} |
There was a problem hiding this comment.
The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/, there's no limit to what people might want to add.
I agree that .bidsignore support is the right way to handle this. Pybids currently passes (in kwargs) an exclude argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore support would be to add entries in .bidsignore to the exclude list. There may be unintended consequences, however, so this should be investigated.
|
The namedtuple is now deprecated, and a |
|
Oh, right, there's still the outstanding question of how to handle exclusions. I think my suggestion to add |
|
I think ignoring |
|
I think this PR is no longer necessary given various changes in 0.7 and 0.8, but it would in any event probably be less work to start from scratch than to resolve the merge conflicts. So I'm closing it. |
This PR