|
| 1 | +# PILArNet-Medium |
| 2 | + |
| 3 | +We provide the 156 GB **PILArNet-M** dataset, an continuation of [PILArNet](https://arxiv.org/abs/2006.01993), consisting of >1M [LArTPC](https://www.symmetrymagazine.org/article/october-2012/time-projection-chambers-a-milestone-in-particle-detector-technology?language_content_entity=und) events. Download the dataset from this [link](https://drive.google.com/drive/folders/1nec9WYPRqMn-_3m6TdM12TmpoInHDosb?usp=drive_link) or via the following command: |
| 4 | + |
| 5 | +```bash |
| 6 | +gdown --folder 1nec9WYPRqMn-_3m6TdM12TmpoInHDosb -O /path/to/save/dataset |
| 7 | +``` |
| 8 | + |
| 9 | +> [!NOTE] |
| 10 | +> `gdown` must be installed (e.g., `pip install gdown` or `conda install gdown`). If you encounter issues with large file downloads from Google Drive, check for any quota limitations or network issues. You may also need to verify that your gdown version supports folder downloads. |
| 11 | +
|
| 12 | +## Indexing the Dataset |
| 13 | + |
| 14 | +Before loading the dataset, it's necessary to index the point cloud sizes. This step ensures that only events with a sufficient number of points are used—effectively filtering out sparse events that might otherwise introduce noise or errors into downstream analysis. Run the following command to create an index: |
| 15 | + |
| 16 | +```bash |
| 17 | +python -m polarmae.datasets.build_index /path/to/dataset/**/*.h5 -j N |
| 18 | +``` |
| 19 | + |
| 20 | +- The `-j N` argument is optional and allows for parallel processing, speeding up the indexing process. |
| 21 | +- This command will create a corresponding `*_points.npy` file for each `.h5` file. These `.npy` files contain the number of points in each event and are used by the dataloader to decide which events to include. |
| 22 | + |
| 23 | +## Directory Structure |
| 24 | + |
| 25 | +The dataset is stored in HDF5 format and organized as follows: |
| 26 | + |
| 27 | +```plaintext |
| 28 | +/path/to/dataset/ |
| 29 | + /train/ |
| 30 | + /generic_v2_196200_v1.h5 |
| 31 | + /generic_v2_153600_v1.h5 |
| 32 | + ... |
| 33 | + /val/ |
| 34 | + /generic_v2_10880_v1.h5 |
| 35 | + ... |
| 36 | +``` |
| 37 | + |
| 38 | +Here, the number preceding `v1` indicates the number of events contained in the file. The dataset is split into train and validation sets with: |
| 39 | +- 1,199,200 events in the train set |
| 40 | +- 10,880 events in the validation set |
| 41 | + |
| 42 | +## Dataset Attributes |
| 43 | + |
| 44 | +Each HDF5 file contains two main attributes: |
| 45 | + |
| 46 | +- **`point`:** |
| 47 | + Each entry corresponds to the number of spacepoints in a single event, containing: |
| 48 | + - 3D point coordinates |
| 49 | + - Energy deposit |
| 50 | + - Absolute time |
| 51 | + - Number of electrons |
| 52 | + - dx |
| 53 | + |
| 54 | + The raw data is stored as a flattened 1D array, which should be reshaped to `(N, 8)` for an event with `N` points. For example: |
| 55 | + |
| 56 | +```python |
| 57 | +import numpy as np |
| 58 | + |
| 59 | +# Assuming `data` is the flattened array loaded from an event |
| 60 | +N = len(data) // 8 |
| 61 | +reshaped_points = data.reshape((N, 8)) |
| 62 | +``` |
| 63 | + |
| 64 | +- **`cluster`:** |
| 65 | + Each entry corresponds to a cluster of spacepoints, containing: |
| 66 | + - Number of points in the cluster |
| 67 | + - Fragment ID |
| 68 | + - Group ID |
| 69 | + - Interaction ID |
| 70 | + - Semantic type |
| 71 | + |
| 72 | + Similarly, reshape the flattened array to `(N, 5)` for an event with `N` clusters: |
| 73 | + |
| 74 | + ```python |
| 75 | + # Assuming `cluster_data` is the flattened array loaded from an event |
| 76 | + N = len(cluster_data) // 5 |
| 77 | + reshaped_clusters = cluster_data.reshape((N, 5)) |
| 78 | + ``` |
| 79 | + |
| 80 | +*Note:* Points in the `point` array are ordered by the cluster they belong to, enabling an association with the corresponding attributes in `cluster`. |
| 81 | + |
| 82 | +A [Colab notebook](https://colab.research.google.com/drive/1x8WatdJa5D7Fxd3sLX5XSJiMkT_sG_im) is provided for a hands-on introduction to loading and inspecting the dataset. |
| 83 | + |
| 84 | +## Usage in PoLAr-MAE |
| 85 | + |
| 86 | +The dataset and its corresponding dataloader are used in this repository. The data path is specified in the config file as follows: |
| 87 | + |
| 88 | +```yaml |
| 89 | +data: |
| 90 | + class_path: polarmae.datasets.PILArNetDataModule |
| 91 | + init_args: |
| 92 | + data_path: /path/to/dataset/train/*.h5 |
| 93 | + batch_size: 32 |
| 94 | + num_workers: 4 |
| 95 | + dataset_kwargs: |
| 96 | + energy_threshold: 0.13 # Minimum energy for a point to be included. |
| 97 | + remove_low_energy_scatters: true # Filter out points with low energy deposits (semantic ID 4). |
| 98 | + emin: 1.0e-2 # Lower bound for energy |
| 99 | + emax: 20.0 # Upper bound for energy |
| 100 | + maxlen: 10000 # Maximum number of events to load. |
| 101 | + min_points: 1024 # Minimum number of points per event. |
| 102 | + return_semantic_id: false # Set to true if semantic segmentation labels are needed. |
| 103 | + return_cluster_id: false # Set to true if cluster identification is required. |
| 104 | +``` |
| 105 | +
|
| 106 | +Keyword arguments: |
| 107 | +
|
| 108 | +- **`energy_threshold`:** Helps exclude low-interest points by ensuring only points with sufficient energy deposits are processed. |
| 109 | +- **`remove_low_energy_scatters`:** Low energy scatters appear as scattered points in each image that seemingly have no relation to other particle trajectories, and are thus often removed. |
| 110 | +- **`emin` and `emax`:** Define the energy range for log-transformation, aiding in numerical stability and performance. |
| 111 | +- **`maxlen`:** Allows quick iterations by limiting the dataset size during testing or debugging. |
| 112 | +- **`min_points`:** Ensures that only events with enough data points are used, which is critical for reliable analysis. |
| 113 | +- **`return_semantic_id` & `return_cluster_id`:** Toggle additional labels depending on the downstream task (e.g., segmentation vs. clustering). |
0 commit comments