|
| 1 | +# Python processing CSV |
| 2 | + |
| 3 | +Tutorials on processing CSV using common python data science tools |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +## CSV for analysis |
| 8 | + |
| 9 | +The CSV (Comma-Separated Values) format is exceptionally convenient for data processing. |
| 10 | +It is simple, yet processed efficiently, supported by many analysis and introspection tools, |
| 11 | +and is human-readable even without them. |
| 12 | +However, CSV expects a fixed number of columns while our data is often presented hierarchically, |
| 13 | +resembling a tree: events contain particles, which have hypotheses corresponding to tracks, |
| 14 | +which in turn are composed of clusters composed of hits. |
| 15 | + |
| 16 | +Consequently, we cannot directly convert our EDM4EIC ROOT files to CSV. |
| 17 | +Instead, we first process, refine and flatten this hierarchical data structure |
| 18 | +to get something more simple and more table-wise. And work pleasantly. |
| 19 | + |
| 20 | +## Introduction: From CSV Files as a Database |
| 21 | + |
| 22 | +For analyzing data, we can work with multiple CSV files that contain related information. |
| 23 | +For example a CSV file containing MC level event information (xBj, Q2, -t), another table |
| 24 | +containing reconstructed level information, and table representing lambda decay information linked |
| 25 | +together by event numbers. |
| 26 | + |
| 27 | +```mermaid |
| 28 | +erDiagram |
| 29 | + MC_Events { |
| 30 | + int event_id PK "Event Number" |
| 31 | + float xBj "True x" |
| 32 | + float Q2 "True Q2" |
| 33 | + float t "True t" |
| 34 | + } |
| 35 | + Reconstructed_Events { |
| 36 | + int event_id PK "Event Number" |
| 37 | + float xBj "Reco x" |
| 38 | + float Q2 "Reco Q2" |
| 39 | + float t "Reco -t" |
| 40 | + } |
| 41 | + Lambda_Decays { |
| 42 | + int event_id FK "Event Number" |
| 43 | + int lambda_id PK "Lambda Number" |
| 44 | + float momentum "Lambda reco data" |
| 45 | + } |
| 46 | +
|
| 47 | + MC_Events ||--|| Reconstructed_Events : "links to" |
| 48 | + MC_Events ||--o{ Lambda_Decays : "links to" |
| 49 | +``` |
| 50 | +These CSV files are essentially **database tables**, |
| 51 | +and understanding this relationship helps us organize and analyze data more effectively. |
| 52 | + |
| 53 | +With python and pandas it is easy to organize them joined tables like |
| 54 | +`MCvsReconstructed events` |
| 55 | + |
| 56 | +## Meson Structure data tables |
| 57 | + |
| 58 | +### 2. MC DIS Parameters Table (`dis_parameters*.csv`) |
| 59 | +Contains Deep Inelastic Scattering parameters for each event: |
| 60 | +- **One event has exactly one set of DIS parameters** (one-to-one relationship) |
| 61 | +- Each row represents one complete event |
| 62 | +- Includes kinematic variables: Q², x_Bjorken, energy, etc. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +### 1. Lambda Particle Table (`mcpart_lambda*.csv`) |
| 67 | +Contains detailed information about Lambda particles found in each event: |
| 68 | +- **One event can have multiple Lambda particles** (one-to-many relationship) |
| 69 | +- Each row represents one Lambda particle |
| 70 | +- Includes particle properties: momentum, position, decay products, etc. |
| 71 | + |
| 72 | + |
| 73 | +## Database Relationship Diagram |
| 74 | + |
| 75 | +```mermaid |
| 76 | +erDiagram |
| 77 | + EVENTS ||--o{ LAMBDA_PARTICLES : "contains" |
| 78 | + EVENTS ||--|| DIS_PARAMETERS : "has" |
| 79 | + |
| 80 | + EVENTS { |
| 81 | + int evt PK "Event ID (Primary Key)" |
| 82 | + } |
| 83 | + |
| 84 | + LAMBDA_PARTICLES { |
| 85 | + int evt FK "Event ID (Foreign Key)" |
| 86 | + int lam_id "Lambda particle ID" |
| 87 | + int lam_pdg "Particle type (3122 for Λ⁰)" |
| 88 | + float lam_px "Momentum X" |
| 89 | + float lam_py "Momentum Y" |
| 90 | + float lam_pz "Momentum Z" |
| 91 | + int prot_id "Proton from decay" |
| 92 | + int pimin_id "Pi-minus from decay" |
| 93 | + string file_name "Source file" |
| 94 | + } |
| 95 | + |
| 96 | + DIS_PARAMETERS { |
| 97 | + int evt FK "Event ID (Foreign Key)" |
| 98 | + float q2 "Momentum transfer squared" |
| 99 | + float xbj "Bjorken x" |
| 100 | + float nu "Energy transfer" |
| 101 | + float w "Invariant mass" |
| 102 | + float y_d "Inelasticity" |
| 103 | + string file_name "Source file" |
| 104 | + } |
| 105 | +``` |
| 106 | + |
| 107 | +## Combine Multiple Files |
| 108 | + |
| 109 | +The Key Challenge: Multiple Files = Broken Relationships |
| 110 | + |
| 111 | +When we have multiple CSV files from different runs or datasets, each file starts its event numbering from 0: |
| 112 | + |
| 113 | +``` |
| 114 | +File 1: evt = [0, 1, 2, 3, 4, ...] |
| 115 | +File 2: evt = [0, 1, 2, 3, 4, ...] ← ID Collision! |
| 116 | +File 3: evt = [0, 1, 2, 3, 4, ...] ← ID Collision! |
| 117 | +``` |
| 118 | + |
| 119 | +**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2, but they have the same ID! |
| 120 | + |
| 121 | +**Solution**: Global Unique Event IDs |
| 122 | + |
| 123 | +We need to create globally unique event IDs across all files: |
| 124 | + |
| 125 | +```python |
| 126 | +import pandas as pd |
| 127 | +import glob |
| 128 | + |
| 129 | +def concat_csvs_with_unique_events(files): |
| 130 | + """Load and concatenate CSV files with globally unique event IDs""" |
| 131 | + dfs = [] |
| 132 | + offset = 0 |
| 133 | + |
| 134 | + for file in files: |
| 135 | + df = pd.read_csv(file) |
| 136 | + df['evt'] = df['evt'] + offset # Make IDs globally unique |
| 137 | + offset = df['evt'].max() + 1 # Set offset for next file |
| 138 | + dfs.append(df) |
| 139 | + |
| 140 | + return pd.concat(dfs, ignore_index=True) |
| 141 | + |
| 142 | +# Load both tables with unique event IDs |
| 143 | +lambda_df = concat_csvs_with_unique_events(sorted(glob.glob("mcpart_lambda*.csv"))) |
| 144 | +dis_df = concat_csvs_with_unique_events(sorted(glob.glob("dis_parameters*.csv"))) |
| 145 | +``` |
| 146 | + |
| 147 | +**Result**: Now we have globally unique event IDs: |
| 148 | +``` |
| 149 | +File 1: evt = [0, 1, 2, 3, 4] |
| 150 | +File 2: evt = [5, 6, 7, 8, 9] ← No collision! |
| 151 | +File 3: evt = [10, 11, 12, 13, 14] ← No collision! |
| 152 | +``` |
| 153 | + |
| 154 | +## Database Operations in Pandas |
| 155 | + |
| 156 | +Now we can perform standard database operations: |
| 157 | + |
| 158 | +### 1. Inner Join (SQL equivalent: `INNER JOIN`) |
| 159 | +Get Lambda particles with their corresponding DIS parameters: |
| 160 | + |
| 161 | +```python |
| 162 | +# Join tables on event ID |
| 163 | +joined = lambda_df.merge(dis_df, on='evt', how='inner') |
| 164 | +print(f"Found {len(joined)} lambda particles with DIS data") |
| 165 | +``` |
| 166 | + |
| 167 | +### 2. Filter and Join (SQL equivalent: `WHERE` + `JOIN`) |
| 168 | +Find Lambda particles in high-Q² events: |
| 169 | + |
| 170 | +```python |
| 171 | +# High-Q² events |
| 172 | +high_q2_events = dis_df[dis_df['q2'] > 50] |
| 173 | + |
| 174 | +# Lambda particles in those events |
| 175 | +high_q2_lambdas = lambda_df.merge(high_q2_events[['evt', 'q2']], on='evt') |
| 176 | +print(f"Found {len(high_q2_lambdas)} lambdas in high-Q² events") |
| 177 | +``` |
| 178 | + |
| 179 | +### 3. Aggregation (SQL equivalent: `GROUP BY`) |
| 180 | +Count Lambda particles per event: |
| 181 | + |
| 182 | +```python |
| 183 | +lambdas_per_event = lambda_df.groupby('evt').size() |
| 184 | +print(f"Average lambdas per event: {lambdas_per_event.mean():.2f}") |
| 185 | +``` |
| 186 | + |
| 187 | +## Real-World Example: Physics Analysis |
| 188 | + |
| 189 | +Let's analyze Lambda particle production in different kinematic regions: |
| 190 | + |
| 191 | +```python |
| 192 | +# Join lambda and DIS data |
| 193 | +physics_data = lambda_df.merge(dis_df, on='evt', how='inner') |
| 194 | + |
| 195 | +# Define kinematic regions |
| 196 | +low_x = physics_data[physics_data['xbj'] < 0.1] |
| 197 | +high_x = physics_data[physics_data['xbj'] > 0.3] |
| 198 | + |
| 199 | +print("Lambda production rates:") |
| 200 | +print(f"Low-x region (x < 0.1): {len(low_x)} lambdas in {low_x['evt'].nunique()} events") |
| 201 | +print(f"High-x region (x > 0.3): {len(high_x)} lambdas in {high_x['evt'].nunique()} events") |
| 202 | + |
| 203 | +# Calculate production rates |
| 204 | +low_x_rate = len(low_x) / low_x['evt'].nunique() |
| 205 | +high_x_rate = len(high_x) / high_x['evt'].nunique() |
| 206 | +print(f"Production rates: {low_x_rate:.3f} vs {high_x_rate:.3f} lambdas/event") |
| 207 | +``` |
| 208 | + |
| 209 | +## Key Database Concepts Applied |
| 210 | + |
| 211 | +| Database Concept | Our Implementation | Example | |
| 212 | +|------------------|-------------------|---------| |
| 213 | +| **Primary Key** | `evt` column | Unique identifier for each event | |
| 214 | +| **Foreign Key** | `evt` in both tables | Links lambda particles to their events | |
| 215 | +| **One-to-Many** | Event → Lambda particles | One event can have 0, 1, or many lambdas | |
| 216 | +| **One-to-One** | Event → DIS parameters | Each event has exactly one set of DIS data | |
| 217 | +| **JOIN** | `pandas.merge()` | Combine related data from both tables | |
| 218 | +| **Index** | Setting `evt` as index | Fast lookups and joins | |
| 219 | + |
| 220 | +## Best Practices |
| 221 | + |
| 222 | +1. **Always ensure unique IDs** when combining multiple files |
| 223 | +2. **Keep original file information** for traceability |
| 224 | +3. **Validate relationships** after joining (check for missing data) |
| 225 | +4. **Use appropriate join types**: |
| 226 | + - `inner`: Only events with both lambda and DIS data |
| 227 | + - `left`: All lambda particles, even if no DIS data |
| 228 | + - `outer`: All events from both tables |
| 229 | + |
| 230 | +## Summary |
| 231 | + |
| 232 | +Thinking of CSV files as database tables helps organize complex particle physics analyses: |
| 233 | + |
| 234 | +- **CSV files** = Database tables |
| 235 | +- **evt column** = Primary/Foreign key linking related data |
| 236 | +- **pandas operations** = SQL queries |
| 237 | +- **Global unique IDs** = Solution for multi-file datasets |
| 238 | + |
| 239 | +This approach scales well from small analyses to large datasets with millions of events across hundreds of files! |
0 commit comments