Merge branch 'dev' into enh/subject_species

CodyCBakerPhD · web-flow · commit 248b78bc7f2c · 2022-04-14T23:32:50.000-04:00
diff --git a/docs/best_practices/extensions.rst b/docs/best_practices/extensions.rst
@@ -1,40 +1,47 @@
 Extensions
 ==========
 
-Extend only when necessary. Extensions are an essential mechanism to integrate data with NWB that is otherwise not
-supported. However, we here need to consider that there are certain costs associated with extensions, e.g., cost of
-creating, supporting, documenting, and maintaining new extensions and effort for users to use and learn extensions.
-As such, users should attempt to use core neurodata_types or existing extentions before creating extensions.
-``DynamicTables``, which are used throughout the NWB schema e.g., to store information about time intervals and
-electrodes, provide the ability to dynamically add columns without the need for extensions, and can help avoid the
-need for custom extensions in many cases.
+Extend the core NWB schema only when necessary. Extensions are an essential mechanism to integrate
+data with NWB that is otherwise not supported. However, we here need to consider that there are certain costs associated
+with extensions, *e.g.*, cost of creating, supporting, documenting, and maintaining new extensions and effort for users
+to use and learn already-created extensions. As such, users should attempt to use core ``neurodata_types`` or
+pre-existing extentions before creating new ones. :nwb-schema:ref:`sec-DynamicTables`, which are used throughout the
+NWB schema to store information about time intervals, electrodes, or spiking output, provide the ability to
+dynamically add columns without the need for extensions, and can help avoid the need for custom extensions in many
+cases.
+
+If an extension is required, tutorials for the process may be found through the
+:nwb-overview:`NWB Overview for Extensions <extensions_tutorial/extensions_tutorial_home.html>`.
+
+It is also encouraged for extensions to contain their own check functions for their own best practices.
+See the` :ref:`adding_custom_checks` section of the Developer Guide for how to do this.
 
-TODO, add links to the tutorials for extensions
 
 
 Use Existing Neurodata Types
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When possible, use existing types when creating extensions either by creating new neurodata_types that inherit from
-existing ones, or by creating neurodata_types that contain existing ones. Building on existing types facilitates the
+
+When possible, use existing types when creating extensions either by creating new ``neurodata_types`` that inherit from
+existing ones, or by creating ``neurodata_types`` that contain existing ones. Building on existing types facilitates the
 reuse of existing functionality and interpretation of the data. If a community extension already exists that has a
 similar scope, it is preferable to use that extension rather than creating a new one.
 
 
 Provide Documentation
 ~~~~~~~~~~~~~~~~~~~~~
 
-When creating extensions be sure to provide meaningful documentation as part of the extension specification, of all
-fields (groups, datasets, attributes, links etc.) to describe what they store and how they are used.
+When creating extensions be sure to provide thorough, meaningful documentation as part of the extension specification.
+Explain all fields (groups, datasets, attributes, links etc.) and describe what they store and how they
+should be used.
 
 
 Write the Specification to the NWBFile
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-When using pynwb, you can store the specification (core and extension)  within the NWBFile by using
-. Caching the specification is preferable, particularly if you are using a
-custom extension, because this ensures that anybody who receives the data also receives the necessary data to
-interpret it.
+You can store the specification (core and extension) within the NWBFile through caching.
+Caching the specification is preferable, particularly if you are using a custom extension, because this ensures that
+anybody who receives the data also receives the necessary data to interpret it.
 
 .. note::
-    In PyNWB, the extension is cached automatically. This can be specified explicitly with ``io.write(filepath,
-    cache_spec=True)``
+    In :pynwb-docs:`PyNWB <>`, the extension is cached automatically. This can be specified explicitly with
+    ``io.write(filepath, cache_spec=True)``
diff --git a/docs/best_practices/tables.rst b/docs/best_practices/tables.rst
@@ -1,7 +1,8 @@
 Tables
 ======
 
-The DynamicTable data type that NWB uses allows you to define custom columns, which offer a high degree of flexibility.
+The :nwb-schema:ref:`dynamictable` data type stores tabular data. It also allows you to define custom columns, which offer a high
+degree of flexibility.
 
 
 
@@ -10,38 +11,89 @@ The DynamicTable data type that NWB uses allows you to define custom columns, wh
 Tables With Only a Single Row
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-It is not common to save a table with only a single row entry. Consider other ``neurodata_types``, such as a one-dimensional :nwb-schema:ref:`sec-TimeSeries` or any of its subtypes.
+It is not common to save a table with only a single row entry. Consider other ``neurodata_types``, such as a one-dimensional :nwb-schema:ref:`sec-TimeSeries`.
 
 Check function: :py:meth:`~nwbinspector.checks.tables.check_single_row`
 
 
 
 .. _best_practice_dynamic_table_region_data_validity:
 
-Unassigned
-~~~~~~~~~~
+Table Region Data
+~~~~~~~~~~~~~~~~~
 
-Store data with long columns rather than long rows. When constructing dynamic tables, keep in mind that the data is stored by column, so it will be
-inefficient to store data in a table with many columns.
-bools
+Store data with long columns rather than long rows. When constructing dynamic tables, keep in mind that the data is
+stored by column, so it will be less efficient to store data in a table that has many more columns than rows.
 
-Check function :ref:`check_column_binary_capability <check_column_binary_capability>`
+Check function :py:meth:`~nwbinspector.checks.tables.check_dynamic_table_region_data_validity`
 
 
 
+.. _best_practice_column_binary_capability:
 
-Use boolean values where appropriate. Although boolean values (True/False) are not used in the core schema, they are a supported data type, and we
-encourage the use of DynamicTable columns with boolean values. For instance, boolean values would be appropriate for a correct custom column to the trials table.
-times
+Boolean Columns
+~~~~~~~~~~~~~~~
 
-Times are always stored in seconds in NWB. This rule applies to times in TimeSeries, TimeIntervals and across NWB:N in general. E.g., in TimeInterval
-objects such as the trials and epochs table, start_time and stop_time should both be in seconds with respect to the timestamps_reference_time (which by
-default is set to the session_start_time).
-Additional time columns in TimeInterval tables (e.g., trials) should have _time as name suffix. E.g., if you add more times in the trials table, for
-instance a subject response time, name it with _time at the end (e.g. response_time) and store the time values in seconds from the timestamps_reference_time,
-just like start_time and stop_time.
+Use boolean values where appropriate. Although boolean values (``True``/``False``) are not used in the core schema,
+they are a supported data type, and we encourage the use of :nwb-schema:ref:`dynamictable` columns with boolean
+values. It is also encouraged practice for boolean columns to be named ``is_condition`` where ``condition`` is
+whatever the positive state of the variable is, e.g. you might create a column called ``is_correct`` that has boolean
+values.
 
-Set the timestamps_reference_time if you need to use a different reference time. Rather than relative times, it can in practice be useful to use a common
-global reference time across files (e.g., Posix time). To do so, NWB:N allows users to set the timestamps_reference_time which serves as reference for all
-timestamps in a file. By default, timestamp_reference_time is usually set to the session_start_time to use relative times.
-electrodes: ‘location’
+The reason for this practice is two-fold:
+
+(i) It allows for easier user comprehension of the information by intuitively restricting the range of possible values
+for the column; a user would otherwise have to extract all the values and calculate the unique set to see that there
+are only two values.
+
+(ii) For large amounts of data, it also saves storage space for the data within the HDF5 file by using the minimal
+number of bytes per item. This can be especially importance if the repeated values are long strings or float casts of
+``1`` and ``0``.
+
+An example of a violation of this practice would be for a column of strings with the following values everywhere;
+
+.. code-block:: python
+
+    hit_or_miss_col = ["Hit", "Miss", "Miss", "Hit", ...]
+
+This should instead become
+
+.. code-block:: python
+
+    is_hit = [True, False, False, True, ...]
+
+
+Check function :py:meth:`~nwbinspector.checks.tables.check_column_binary_capability`
+
+.. note::
+
+    If the two unique values in your column are ``float`` types that differ from ``1`` and ``0``, the reported values
+    are to be considered as additional contextual information for the column, and this practice does not apply.
+
+.. note::
+
+    HDF5 does not natively store boolean values. ``h5py`` handles this by automatically transforming boolean values
+    into an enumerated type, where 0 maps to "TRUE" and 1 maps to "FALSE". Then on read these values are converted back
+    to the ``np.bool`` type. ``pynwb`` does the same, so if you are reading and writing with pynwb you may not need
+    to worry about this. However, this will be important to know if you write using PyNWB and read with some other
+    language.
+
+
+
+Timing Columns
+~~~~~~~~~~~~~~
+
+Times are always stored in seconds in NWB. In :nwb-schema:ref:`sec-TimeIntervals` tables such as the
+:nwb-schema:ref:`trials <sec-groups-intervals-trials>` and
+:nwb-schema:ref:`epochs <epochs>`, ``start_time`` and ``stop_time`` should both be in seconds with respect to the
+``timestamps_reference_time`` of the :nwb-schema:ref:`sec-NWBFile` (which by default is the
+``session_start_time``, see :ref:`best_practice_global_time_reference` for more details).
+
+Additional time columns in :nwb-schema:ref:`sec-TimeIntervals` tables, such as the
+:nwb-schema:ref:`Trials <sec-groups-intervals-trials>` should have ``_time`` as a suffix to the name.
+*E.g.*, if you add more times in :nwb-schema:ref:`trials <sec-groups-intervals-trials>`, such as a subject
+response time, name it ``response_time`` and store the time values in seconds from the ``timestamps_reference_time``
+of the :nwb-schema:ref:`sec-NWBFile`, just like ``start_time`` and ``stop_time``.
+This convention is used by downstream processing tools. For instance, NWBWidgets uses these times to create
+peri-stimulus time histograms relating spiking activity to trial events. See
+:ref:`best_practice_global_time_reference` for more details.
diff --git a/docs/conf_extlinks.py b/docs/conf_extlinks.py
@@ -28,7 +28,7 @@
     # "nwb-widgets-docs": ("https://github.com/NeurodataWithoutBorders/nwb-jupyter-widgets", ""),
     # "nwb-widgets-src": ("https://github.com/NeurodataWithoutBorders/nwb-jupyter-widgets", ""),
     # "caiman-docs": ("https://caiman.readthedocs.io/en/master/", ""),
-    "nwb-overview": ("https://nwb-overview.readthedocs.io/en/latest/", ""),
+    "nwb-overview": ("https://nwb-overview.readthedocs.io/en/latest/%s", "%s"),
     # "nwb-overview-src": ("https://github.com/NeurodataWithoutBorders/nwb-overview", ""),
     # "nwb-main": ("https://www.nwb.org/", ""),
     "conda-install": (
diff --git a/docs/developer_guide.rst b/docs/developer_guide.rst
@@ -11,3 +11,27 @@ build a new data interface and a graphical overview of the object structure can
 :nwbinspector-contributing:`Contributing <>` page.
 
 Otherwise feel free to raise a bug report, documentation mistake, or general feature request for our maintainers to address!
+
+
+
+.. _adding_custom_checks:
+
+Adding Custom Checks to the Registry
+------------------------------------
+
+If you are writing an extension, or have any personal Best Practices specific to your lab, you can incorporate these
+into your own usage of the NWBInspector. To add a custom check to your default registry, all you have to do is wrap
+your check function with the :py:class:`~nwbinspector.register_checks.register_check` decorator like so...
+
+.. code-block:: python
+
+    from nwbinspector.register_checks import available_checks, register_check, Importance
+
+    @register_check(importance=Importance.SOME_IMPORTANCE_LEVEL, neurodata_type=some_neurodata_type)
+    def check_personal_practice(...):
+        ...
+
+Then, all that is needed for this to be automatically included when you run the inspector through the CLI is to specify
+the modules flag ``-m`` or ``--modules`` along with the name of your module that contains the custom check. If using
+the library instead, you need only import the ``available_checks`` global variable from your own submodules, or
+otherwise import your check functions after importing the ``nwbinspector`` in your ``__init__.py``.
diff --git a/nwbinspector/checks/ecephys.py b/nwbinspector/checks/ecephys.py
@@ -50,3 +50,28 @@ def check_electrical_series_dims(electrical_series: ElectricalSeries):
 def check_electrical_series_reference_electrodes_table(electrical_series: ElectricalSeries):
     if electrical_series.electrodes.table.name != "electrodes":
         return InspectorMessage(message="electrodes does not  reference an electrodes table.")
+
+
+@register_check(importance=Importance.BEST_PRACTICE_VIOLATION, neurodata_type=Units)
+def check_spike_times_not_in_unobserved_interval(units_table: Units, nunits: int = 4):
+    """Check if a Units table has spike times that occur outside of observed intervals."""
+    if not units_table.obs_intervals:
+        return
+    for unit_spike_times, unit_obs_intervals in zip(
+        units_table["spike_times"][:nunits], units_table["obs_intervals"][:nunits]
+    ):
+        spike_times_array = np.array(unit_spike_times)
+        if not all(
+            sum(
+                [
+                    np.logical_and(start <= spike_times_array, spike_times_array <= stop)
+                    for start, stop in unit_obs_intervals
+                ]
+            )
+        ):
+            return InspectorMessage(
+                message=(
+                    "This Units table contains spike times that occur during periods of time not labeled as being "
+                    "observed intervals."
+                )
+            )
diff --git a/nwbinspector/checks/tables.py b/nwbinspector/checks/tables.py
@@ -8,11 +8,12 @@
 from pynwb.file import TimeIntervals, Units
 
 from ..register_checks import register_check, InspectorMessage, Importance
-from ..utils import format_byte_size, is_ascending_series
+from ..utils import format_byte_size, is_ascending_series, is_dict_in_string, is_string_json_loadable
 
 
 @register_check(importance=Importance.CRITICAL, neurodata_type=DynamicTableRegion)
 def check_dynamic_table_region_data_validity(dynamic_table_region: DynamicTableRegion, nelems=200):
+    """Check if a DynamicTableRegion is valid."""
     if np.any(np.asarray(dynamic_table_region.data[:nelems]) > len(dynamic_table_region.table)):
         return InspectorMessage(
             message=(
@@ -158,6 +159,23 @@ def check_single_row(
         )
 
 
+@register_check(importance=Importance.BEST_PRACTICE_VIOLATION, neurodata_type=DynamicTable)
+def check_table_values_for_dict(table: DynamicTable, nelems: int = 200):
+    """Check if any values in a row or column of a table contain a string casting of a Python dictionary."""
+    for column in table.columns:
+        if not hasattr(column, "data") or isinstance(column, VectorIndex) or not isinstance(column.data[0], str):
+            continue
+        for string in column.data[:nelems]:
+            if is_dict_in_string(string=string):
+                message = (
+                    f"The column '{column.name}' contains a string value that contains a dictionary! Please "
+                    "unpack dictionaries as additional rows or columns of the table."
+                )
+                if is_string_json_loadable(string=string):
+                    message += " This string is also JSON loadable, so call `json.loads(...)` on the string to unpack."
+                yield InspectorMessage(message=message)
+
+
 # @register_check(importance="Best Practice Violation", neurodata_type=pynwb.core.DynamicTable)
 # def check_column_data_is_not_none(nwbfile):
 #     """Check column values in DynamicTable to enssure they are not None."""
diff --git a/nwbinspector/utils.py b/nwbinspector/utils.py
@@ -1,4 +1,6 @@
 """Commonly reused logic for evaluating conditions; must not have external dependencies."""
+import re
+import json
 import numpy as np
 from typing import TypeVar, Optional, List
 from pathlib import Path
@@ -7,6 +9,8 @@
 FilePathType = TypeVar("FilePathType", str, Path)
 OptionalListOfStrings = Optional[List[str]]
 
+dict_regex = r"({.+:.+})"
+
 
 def format_byte_size(byte_size: int, units: str = "SI"):
     """
@@ -46,3 +50,25 @@ def check_regular_series(series: np.ndarray, tolerance_decimals: int = 9):
 
 def is_ascending_series(series: np.ndarray, nelems=None):
     return np.all(np.diff(series[:nelems]) > 0)
+
+
+def is_dict_in_string(string: str):
+    """
+    Determine if the string value contains an encoded Python dictionary.
+
+    Can also be the direct results of string casting a dictionary, *e.g.*, ``str(dict(a=1))``.
+    """
+    return any(re.findall(pattern=dict_regex, string=string))
+
+
+def is_string_json_loadable(string: str):
+    """
+    Determine if the serialized dictionary is a JSON object.
+
+    Rather than constructing a complicated regex pattern, a simple try/except of the json.load should suffice.
+    """
+    try:
+        json.loads(string)
+        return True
+    except json.JSONDecodeError:
+        return False
diff --git a/tests/test_utils.py b/tests/test_utils.py
diff --git a/tests/unit_tests/test_ecephys.py b/tests/unit_tests/test_ecephys.py
diff --git a/tests/unit_tests/test_tables.py b/tests/unit_tests/test_tables.py