-
Notifications
You must be signed in to change notification settings - Fork 11
[New Check] Entire column of a table is not NaN #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
abbe8f1
2a6516b
a322c04
db1b426
28fc042
29a171d
2ac54f8
73d4978
c72a095
48c5e60
b67f2c8
e5f2618
75a687e
62350db
fd35f34
50f81ac
44cec05
1e83c6b
ecc129d
09b0474
10c2073
41056ea
8def458
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -183,3 +183,18 @@ def check_table_values_for_dict(table: DynamicTable, nelems: int = 200): | |||||||||||||||||||||||
| if is_string_json_loadable(string=string): | ||||||||||||||||||||||||
| message += " This string is also JSON loadable, so call `json.loads(...)` on the string to unpack." | ||||||||||||||||||||||||
| yield InspectorMessage(message=message) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| @register_check(importance=Importance.BEST_PRACTICE_SUGGESTION, neurodata_type=DynamicTable) | ||||||||||||||||||||||||
| def check_col_not_nan(table: DynamicTable, nelems: int = 200): | ||||||||||||||||||||||||
| """Check if all of the values in a single column of a table are NaN.""" | ||||||||||||||||||||||||
| for column in table.columns: | ||||||||||||||||||||||||
| if not hasattr(column, "data") or isinstance(column, VectorIndex) or isinstance(column.data[0], str): | ||||||||||||||||||||||||
| continue | ||||||||||||||||||||||||
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | ||||||||||||||||||||||||
| if np.any(~np.isnan(column[subindex_selection])): | ||||||||||||||||||||||||
| continue | ||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||
| yield InspectorMessage( | ||||||||||||||||||||||||
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | |
| if np.any(~np.isnan(column[subindex_selection])): | |
| continue | |
| else: | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) | |
| if np.all(np.isnan(column[:nelems])): | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's written this way in case np.any has an early return condition once it detects a single non-NaN value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.all can have the same early stopping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, interesting results...
Indeed as you say, all does indeed have early stopping just like any...
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.006882190704345703s!
Took 3.719329833984375e-05s!... however numpy does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
np.all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
np.all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.05324196815490723s!
Took 0.05084705352783203s!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By sparse do you mean short?
Either approach is fine with me, either using all or np.all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 'sparsity' (% of the number of False items being passed into all) isn't as relevant - I just mean for any case where there is at least one False element occurring somewhat early on in the input to all (as opposed to occurring later in the vector, as seen above), it will exit a few ~ms faster per check (which could add up to seconds when streaming over an entire dataset).
Anyway, I implemented the all logic here.
Uh oh!
There was an error while loading. Please reload this page.