Skip to content

Commit e709c86

Browse files
committed
Initial upload
0 parents  commit e709c86

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+5378
-0
lines changed

.gitignore

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# UV
98+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
#uv.lock
102+
103+
# poetry
104+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105+
# This is especially recommended for binary packages to ensure reproducibility, and is more
106+
# commonly ignored for libraries.
107+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108+
#poetry.lock
109+
110+
# pdm
111+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112+
#pdm.lock
113+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114+
# in version control.
115+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116+
.pdm.toml
117+
.pdm-python
118+
.pdm-build/
119+
120+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121+
__pypackages__/
122+
123+
# Celery stuff
124+
celerybeat-schedule
125+
celerybeat.pid
126+
127+
# SageMath parsed files
128+
*.sage.py
129+
130+
# Environments
131+
.env
132+
.venv
133+
env/
134+
venv/
135+
ENV/
136+
env.bak/
137+
venv.bak/
138+
139+
# Spyder project settings
140+
.spyderproject
141+
.spyproject
142+
143+
# Rope project settings
144+
.ropeproject
145+
146+
# mkdocs documentation
147+
/site
148+
149+
# mypy
150+
.mypy_cache/
151+
.dmypy.json
152+
dmypy.json
153+
154+
# Pyre type checker
155+
.pyre/
156+
157+
# pytype static type analyzer
158+
.pytype/
159+
160+
# Cython debug symbols
161+
cython_debug/
162+
163+
# PyCharm
164+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166+
# and can be added to the global gitignore or merged into this file. For a more nuclear
167+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
168+
#.idea/
169+
170+
# PyPI configuration file
171+
.pypirc

DATASET.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# PILArNet-Medium
2+
3+
We provide the 156 GB **PILArNet-M** dataset, an continuation of [PILArNet](https://arxiv.org/abs/2006.01993), consisting of >1M [LArTPC](https://www.symmetrymagazine.org/article/october-2012/time-projection-chambers-a-milestone-in-particle-detector-technology?language_content_entity=und) events. Download the dataset from this [link](https://drive.google.com/drive/folders/1nec9WYPRqMn-_3m6TdM12TmpoInHDosb?usp=drive_link) or via the following command:
4+
5+
```bash
6+
gdown --folder 1nec9WYPRqMn-_3m6TdM12TmpoInHDosb -O /path/to/save/dataset
7+
```
8+
9+
> [!NOTE]
10+
> `gdown` must be installed (e.g., `pip install gdown` or `conda install gdown`). If you encounter issues with large file downloads from Google Drive, check for any quota limitations or network issues. You may also need to verify that your gdown version supports folder downloads.
11+
12+
## Indexing the Dataset
13+
14+
Before loading the dataset, it's necessary to index the point cloud sizes. This step ensures that only events with a sufficient number of points are used—effectively filtering out sparse events that might otherwise introduce noise or errors into downstream analysis. Run the following command to create an index:
15+
16+
```bash
17+
python -m polarmae.datasets.build_index /path/to/dataset/**/*.h5 -j N
18+
```
19+
20+
- The `-j N` argument is optional and allows for parallel processing, speeding up the indexing process.
21+
- This command will create a corresponding `*_points.npy` file for each `.h5` file. These `.npy` files contain the number of points in each event and are used by the dataloader to decide which events to include.
22+
23+
## Directory Structure
24+
25+
The dataset is stored in HDF5 format and organized as follows:
26+
27+
```plaintext
28+
/path/to/dataset/
29+
/train/
30+
/generic_v2_196200_v1.h5
31+
/generic_v2_153600_v1.h5
32+
...
33+
/val/
34+
/generic_v2_10880_v1.h5
35+
...
36+
```
37+
38+
Here, the number preceding `v1` indicates the number of events contained in the file. The dataset is split into train and validation sets with:
39+
- 1,199,200 events in the train set
40+
- 10,880 events in the validation set
41+
42+
## Dataset Attributes
43+
44+
Each HDF5 file contains two main attributes:
45+
46+
- **`point`:**
47+
Each entry corresponds to the number of spacepoints in a single event, containing:
48+
- 3D point coordinates
49+
- Energy deposit
50+
- Absolute time
51+
- Number of electrons
52+
- dx
53+
54+
The raw data is stored as a flattened 1D array, which should be reshaped to `(N, 8)` for an event with `N` points. For example:
55+
56+
```python
57+
import numpy as np
58+
59+
# Assuming `data` is the flattened array loaded from an event
60+
N = len(data) // 8
61+
reshaped_points = data.reshape((N, 8))
62+
```
63+
64+
- **`cluster`:**
65+
Each entry corresponds to a cluster of spacepoints, containing:
66+
- Number of points in the cluster
67+
- Fragment ID
68+
- Group ID
69+
- Interaction ID
70+
- Semantic type
71+
72+
Similarly, reshape the flattened array to `(N, 5)` for an event with `N` clusters:
73+
74+
```python
75+
# Assuming `cluster_data` is the flattened array loaded from an event
76+
N = len(cluster_data) // 5
77+
reshaped_clusters = cluster_data.reshape((N, 5))
78+
```
79+
80+
*Note:* Points in the `point` array are ordered by the cluster they belong to, enabling an association with the corresponding attributes in `cluster`.
81+
82+
A [Colab notebook](https://colab.research.google.com/drive/1x8WatdJa5D7Fxd3sLX5XSJiMkT_sG_im) is provided for a hands-on introduction to loading and inspecting the dataset.
83+
84+
## Usage in PoLAr-MAE
85+
86+
The dataset and its corresponding dataloader are used in this repository. The data path is specified in the config file as follows:
87+
88+
```yaml
89+
data:
90+
class_path: polarmae.datasets.PILArNetDataModule
91+
init_args:
92+
data_path: /path/to/dataset/train/*.h5
93+
batch_size: 32
94+
num_workers: 4
95+
dataset_kwargs:
96+
energy_threshold: 0.13 # Minimum energy for a point to be included.
97+
remove_low_energy_scatters: true # Filter out points with low energy deposits (semantic ID 4).
98+
emin: 1.0e-2 # Lower bound for energy
99+
emax: 20.0 # Upper bound for energy
100+
maxlen: 10000 # Maximum number of events to load.
101+
min_points: 1024 # Minimum number of points per event.
102+
return_semantic_id: false # Set to true if semantic segmentation labels are needed.
103+
return_cluster_id: false # Set to true if cluster identification is required.
104+
```
105+
106+
Keyword arguments:
107+
108+
- **`energy_threshold`:** Helps exclude low-interest points by ensuring only points with sufficient energy deposits are processed.
109+
- **`remove_low_energy_scatters`:** Low energy scatters appear as scattered points in each image that seemingly have no relation to other particle trajectories, and are thus often removed.
110+
- **`emin` and `emax`:** Define the energy range for log-transformation, aiding in numerical stability and performance.
111+
- **`maxlen`:** Allows quick iterations by limiting the dataset size during testing or debugging.
112+
- **`min_points`:** Ensures that only events with enough data points are used, which is critical for reliable analysis.
113+
- **`return_semantic_id` & `return_cluster_id`:** Toggle additional labels depending on the downstream task (e.g., segmentation vs. clustering).

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 DeepLearnPhysics
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

0 commit comments

Comments
 (0)