Skip to content

Commit a934816

Browse files
authored
Merge pull request #594 from lincc-frameworks/config_dataflow
Adding Doc Pages For Config System and Data Flow
2 parents 9308a12 + 5468b86 commit a934816

File tree

4 files changed

+48
-0
lines changed

4 files changed

+48
-0
lines changed
324 KB
Loading

docs/_static/hyrax_data_flow.png

1.2 MB
Loading

docs/configuration_system.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
The ``Hyrax`` Configuration System
2+
==================================
3+
4+
``hyrax`` makes extensive use of the config variables to manage the runtime enviroment of training and inference runs. There is a ``hyrax_default_config.toml`` file (full contents listed here), included with ``hyrax``, that contains every variable that ``hyrax`` could need to operate. To create a custom configuration file, simply create a ``.toml`` file and change variables as you see fit, or if you’re running with a custom dataset or model, add your own variables.
5+
6+
Config variables are inherited from a hierarchy of sources, similar to ``python`` classes. First, ``hyrax`` will prioritize the variables set in the default configuration. Next, it will load the relevant default config of any custom ``hyrax`` packages that the user is utilizing. It determines what packages to include by checking what custom classes are loaded in initially and looking for the relevant default configs. If a package doesn’t have a default, ``hyrax`` will throw a warning. Finally, it will use whatever variables have been declared in the user defined config toml (see here for how to load those through a notebook/script or the CLI).
7+
8+
.. figure:: _static/hyrax_config_system.png
9+
:width: 100%
10+
:alt: The inheritance hierarchy of the hyrax configuration system.
11+
12+
``hyrax`` will pass along all the configuration variables to the relevant models and dataset classes and allows them to configure the runtime through one system. This allows for extensibility and cross-compatibility within the broader “hyrax ecosystem”. From the point of view of the code, these configuration variables should be static. This makes it easier for researchers to develop code separate from the runtime environment.
13+
14+
A core design principle of ``hyrax`` is "code by config", meaning that all runtime parameters should be set through configuration files rather than hard-coded values. This approach enhances flexibility, reproducibility, and ease of experimentation, as users can modify configurations without altering the underlying codebase. This also facilitates sharing and collaboration, as configurations can be easily shared and adapted for different use cases while keeping fundamental models and datasets consistent.
15+
16+
After training is completed, ``hyrax`` will write out all of the variables (combined from all the various source configs) used at runtime in the runtime directory as a ``runtime_config.toml`` file, so that the user can see what variables were actually used in one place.

docs/data_flow.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Data Flow Through Hyrax
2+
========================
3+
4+
Accessing Data on Disk
5+
----------------------
6+
7+
``Hyrax`` makes use of ``Dataset`` and ``DataProvider`` classes to act as interfaces between ``hyrax``, ``pytorch``, and data on disk. A main goal of ``hyrax`` development is to create a large, stable collection of dataset classes based on different data types and sources (like LSST, HSC, Gaia, Roman, etc.) that will allow researchers to hit the ground running with their machine learning projects.
8+
9+
``Datasets`` are the direct interface between the ``hyrax`` ecosystem and the desired data. The ``Dataset`` is responsible for handling the data on a per-index level (managing specific data types, labels, and other metadata), while the ``DataProvider`` is responsible for batching the data and passing it along toward the training step. This separation of concerns allows for great flexibility in how data is handled and processed.
10+
11+
The ``collate`` Function
12+
------------------------
13+
14+
The ``collate`` function is responsible for taking in a batch of data from the ``DataProvider`` and transforming it into a format that can be more readily ingested by the model. By default, the ``collate`` function takes in a list of dictionaries (one dictionary per item in the batch) and converts each field into a list, passing along a dictionary of lists to the ``to_tensor`` function. The function is customizable and has options to handle ragged data with padding.
15+
16+
The ``to_tensor`` Function
17+
--------------------------
18+
19+
The ``to_tensor`` function is responsible for taking in the output of the ``collate`` function (a dictionary of lists) and converting the lists for each requested field into a numpy array. Please note that ``to_tensor`` is a misnomer from an earlier period of development and that the function will be renamed in the future. The function is customizable and acts as the last step in the data flow for the user to modify how data is transformed before being passed into the model.
20+
21+
Model Input and Output Pipeline
22+
-------------------------------
23+
24+
The data is then either sent to ``pytorch`` ignite for training or to ``onnx`` for inference. In the case of training, the data is converted into a tensor and passed into the model's ``train_step`` function. The model processes the data and returns the output predictions. If the user wishes to perform data augmentations, these can be set up in the model's ``train_step`` function as well.
25+
26+
In the ``onnx`` case, the data remains a numpy array throughout the model evaluation and to the result. Both paths result in the output being a numpy array.
27+
28+
Data Flow Diagram
29+
-----------------
30+
.. figure:: _static/hyrax_data_flow.png
31+
:width: 100%
32+
:alt: The data flow through hyrax from disk to model training.

0 commit comments

Comments
 (0)