DeepLearnPhysics
diff --git a/‎.gitignore‎
Lines changed: 171 additions & 0 deletions b/‎.gitignore‎
Lines changed: 171 additions & 0 deletions
diff --git a/‎DATASET.md‎
Lines changed: 113 additions & 0 deletions b/‎DATASET.md‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
@@ -0,0 +1,171 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+# PyPI configuration file
+.pypirc
@@ -0,0 +1,113 @@
+# PILArNet-Medium
+
+We provide the 156 GB **PILArNet-M** dataset, an continuation of [PILArNet](https://arxiv.org/abs/2006.01993), consisting of >1M [LArTPC](https://www.symmetrymagazine.org/article/october-2012/time-projection-chambers-a-milestone-in-particle-detector-technology?language_content_entity=und) events. Download the dataset from this [link](https://drive.google.com/drive/folders/1nec9WYPRqMn-_3m6TdM12TmpoInHDosb?usp=drive_link) or via the following command:
+
+```bash
+gdown --folder 1nec9WYPRqMn-_3m6TdM12TmpoInHDosb -O /path/to/save/dataset
+```
+
+> [!NOTE] 
+> `gdown` must be installed (e.g., `pip install gdown` or `conda install gdown`). If you encounter issues with large file downloads from Google Drive, check for any quota limitations or network issues. You may also need to verify that your gdown version supports folder downloads.
+
+## Indexing the Dataset
+
+Before loading the dataset, it's necessary to index the point cloud sizes. This step ensures that only events with a sufficient number of points are used—effectively filtering out sparse events that might otherwise introduce noise or errors into downstream analysis. Run the following command to create an index:
+
+```bash
+python -m polarmae.datasets.build_index /path/to/dataset/**/*.h5 -j N
+```
+
+- The `-j N` argument is optional and allows for parallel processing, speeding up the indexing process.
+- This command will create a corresponding `*_points.npy` file for each `.h5` file. These `.npy` files contain the number of points in each event and are used by the dataloader to decide which events to include.
+
+## Directory Structure
+
+The dataset is stored in HDF5 format and organized as follows:
+
+```plaintext
+/path/to/dataset/
+    /train/
+        /generic_v2_196200_v1.h5
+        /generic_v2_153600_v1.h5
+        ...
+    /val/
+        /generic_v2_10880_v1.h5
+        ...
+```
+
+Here, the number preceding `v1` indicates the number of events contained in the file. The dataset is split into train and validation sets with:
+- 1,199,200 events in the train set
+- 10,880 events in the validation set
+
+## Dataset Attributes
+
+Each HDF5 file contains two main attributes:
+
+- **`point`:**  
+  Each entry corresponds to the number of spacepoints in a single event, containing:
+  - 3D point coordinates
+  - Energy deposit
+  - Absolute time
+  - Number of electrons
+  - dx
+  
+  The raw data is stored as a flattened 1D array, which should be reshaped to `(N, 8)` for an event with `N` points. For example:
+  
+```python
+import numpy as np
+
+# Assuming `data` is the flattened array loaded from an event
+N = len(data) // 8
+reshaped_points = data.reshape((N, 8))
+```
+  
+- **`cluster`:**  
+  Each entry corresponds to a cluster of spacepoints, containing:
+  - Number of points in the cluster
+  - Fragment ID
+  - Group ID
+  - Interaction ID
+  - Semantic type
+  
+  Similarly, reshape the flattened array to `(N, 5)` for an event with `N` clusters:
+  
+  ```python
+  # Assuming `cluster_data` is the flattened array loaded from an event
+  N = len(cluster_data) // 5
+  reshaped_clusters = cluster_data.reshape((N, 5))
+  ```
+  
+*Note:* Points in the `point` array are ordered by the cluster they belong to, enabling an association with the corresponding attributes in `cluster`.
+
+A [Colab notebook](https://colab.research.google.com/drive/1x8WatdJa5D7Fxd3sLX5XSJiMkT_sG_im) is provided for a hands-on introduction to loading and inspecting the dataset.
+
+## Usage in PoLAr-MAE
+
+The dataset and its corresponding dataloader are used in this repository. The data path is specified in the config file as follows:
+
+```yaml
+data:
+  class_path: polarmae.datasets.PILArNetDataModule
+  init_args:
+    data_path: /path/to/dataset/train/*.h5
+    batch_size: 32
+    num_workers: 4
+    dataset_kwargs:
+      energy_threshold: 0.13        # Minimum energy for a point to be included.
+      remove_low_energy_scatters: true  # Filter out points with low energy deposits (semantic ID 4).
+      emin: 1.0e-2                  # Lower bound for energy
+      emax: 20.0                    # Upper bound for energy
+      maxlen: 10000                 # Maximum number of events to load.
+      min_points: 1024              # Minimum number of points per event.
+      return_semantic_id: false     # Set to true if semantic segmentation labels are needed.
+      return_cluster_id: false      # Set to true if cluster identification is required.
+```
+
+Keyword arguments:
+
+- **`energy_threshold`:** Helps exclude low-interest points by ensuring only points with sufficient energy deposits are processed.
+- **`remove_low_energy_scatters`:** Low energy scatters appear as scattered points in each image that seemingly have no relation to other particle trajectories, and are thus often removed.
+- **`emin` and `emax`:** Define the energy range for log-transformation, aiding in numerical stability and performance.
+- **`maxlen`:** Allows quick iterations by limiting the dataset size during testing or debugging.
+- **`min_points`:** Ensures that only events with enough data points are used, which is critical for reliable analysis.
+- **`return_semantic_id` & `return_cluster_id`:** Toggle additional labels depending on the downstream task (e.g., segmentation vs. clustering).
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 DeepLearnPhysics
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.