11# CSV Data
22
3+ We provide the relevant part ` *.EDM4EIC.root ` data converted to the CSV format
34
5+ - The CVS files are located in the same place as ` *.edm4eic.root ` files
6+ - File names correspond to each other. E.g. ` k_lambda_5x41_5000evt_001.* `
7+ - Access to the CSV files is the same. See [ DATA ACCESS] ( data ) page
8+ - CSV table names are embedded in extension before ` .csv ` ,
9+ e.g. ` *.mcdis.csv ` , ` *.mcpart_lambda.csv `
10+ - Column names are listed in the first line of the file (standard for CSV)
411
5- The CSV (Comma-Separated Values) format is exceptionally convenient for data processing.
6- It is simple, yet processed efficiently, supported by many analysis and introspection tools,
7- and is human-readable even without them.
8- However, CSV expects a fixed number of columns while our data is often presented hierarchically,
9- resembling a tree: events contain particles, which have hypotheses corresponding to tracks,
10- which in turn are composed of clusters composed of hits.
12+ Example file names:
1113
12- Consequently, we cannot directly convert our ROOT files to CSV.
13- Instead, we first process, refine and flatten this hierarchical data structure
14- to get something more simple and more table-wise. And work pleasantly.
14+ ``` bash
15+ # Original file
16+ k_lambda_5x41_5000evt_001.edm4eic.root
17+
18+ # Related CSV-s
19+ k_lambda_5x41_5000evt_001.mcdis.csv
20+ k_lambda_5x41_5000evt_001.mcpart_lambda.csv
21+ ```
1522
16- ## Introduction: From CSV Files as a Database
1723
24+
25+ ## Table definitions
26+
1827For analyzing data, we can work with multiple CSV files that contain related information.
19- For example a CSV file containing MC level event information (xBj, Q2, -t), another table
20- containing reconstructed level information, and table representing lambda decay information linked
21- together by event numbers.
28+ The files are linked relationally. The first columns of a CSV table is always
29+ a primary key (e.g. event number). Or a composite key (e.g. event number + particle index).
30+ For example, all data related to e.g. ` k_lambda_5x41_5000evt_001.* `
31+ will refer the same events.
2232
2333``` mermaid
2434erDiagram
2535 MC_Events {
2636 int event_id PK "Event Number"
2737 float xBj "True x"
2838 float Q2 "True Q2"
29- float t "True t "
39+ float etc "True values "
3040 }
3141 Reconstructed_Events {
3242 int event_id PK "Event Number"
33- float xBj "Reco x"
34- float Q2 "Reco Q2"
35- float t "Reco -t "
43+ float xBj "Reconstructed x"
44+ float Q2 "Reconstructed Q2"
45+ float etc "Reconstructed values "
3646 }
3747 Lambda_Decays {
3848 int event_id FK "Event Number"
3949 int lambda_id PK "Lambda Number"
40- float momentum "Lambda reco data"
50+ float info "Lambda reco data"
4151 }
4252
4353 MC_Events ||--|| Reconstructed_Events : "links to"
@@ -49,74 +59,110 @@ and understanding this relationship helps us organize and analyze data more effe
4959With python and pandas it is easy to organize them joined tables like
5060` MCvsReconstructed events `
5161
52- ## Meson Structure data tables
53-
54- ### 2. MC DIS Parameters Table (` dis_parameters*.csv ` )
55- Contains Deep Inelastic Scattering parameters for each event:
56- - ** One event has exactly one set of DIS parameters** (one-to-one relationship)
57- - Each row represents one complete event
58- - Includes kinematic variables: Q², x_Bjorken, energy, etc.
62+ ## mcdis
5963
6064
65+ Files: ` *.mcdis.csv `
6166
62- ### 1. Lambda Particle Table (` mcpart_lambda*.csv ` )
63- Contains detailed information about Lambda particles found in each event:
64- - ** One event can have multiple Lambda particles** (one-to-many relationship)
65- - Each row represents one Lambda particle
66- - Includes particle properties: momentum, position, decay products, etc.
67+ True event level values that come from the event generator.
68+ ` evt ` - evnet id in file, the rest of the names correspond to table:
69+ [ mc-variables] ( http://localhost:5173/meson-structure/mc-variables.html )
6770
71+ Columns:
6872
69- ## Database Relationship Diagram
73+ ```
74+ evt
75+ alphas
76+ mx2
77+ nu
78+ p_rt
79+ pdrest
80+ pperps
81+ pperpz
82+ q2
83+ s_e,s_q
84+ tempvar
85+ tprime
86+ tspectator
87+ twopdotk
88+ twopdotq
89+ w
90+ x_d
91+ xbj
92+ y_d
93+ yplus
94+ ```
7095
71- ``` mermaid
72- erDiagram
73- EVENTS ||--o{ LAMBDA_PARTICLES : "contains"
74- EVENTS ||--|| DIS_PARAMETERS : "has"
75-
76- EVENTS {
77- int evt PK "Event ID (Primary Key)"
78- }
79-
80- LAMBDA_PARTICLES {
81- int evt FK "Event ID (Foreign Key)"
82- int lam_id "Lambda particle ID"
83- int lam_pdg "Particle type (3122 for Λ⁰)"
84- float lam_px "Momentum X"
85- float lam_py "Momentum Y"
86- float lam_pz "Momentum Z"
87- int prot_id "Proton from decay"
88- int pimin_id "Pi-minus from decay"
89- string file_name "Source file"
90- }
91-
92- DIS_PARAMETERS {
93- int evt FK "Event ID (Foreign Key)"
94- float q2 "Momentum transfer squared"
95- float xbj "Bjorken x"
96- float nu "Energy transfer"
97- float w "Invariant mass"
98- float y_d "Inelasticity"
99- string file_name "Source file"
100- }
96+ ## mcpart_lambda
97+
98+ Files: ` *.mcpart_lambda.csv `
99+
100+ Full chane lambda decays by using ` MCParticles ` EDM4EIC table.
101+ MCParticles has relations like daughters and parents. Those relations are
102+ flattened for lambda decays. The column represent possible lambda decays are grouped by particles:
103+
104+ Prefixes (each has the same parameters after)
105+
106+ 1 . ` lam ` - Λ
107+ 1 . ` prot ` - p (if pπ- decay or nulls)
108+ 1 . ` pimin ` - π- (if pπ- decay or nulls)
109+ 1 . ` neut ` - Neutron (if n π0 decay)
110+ 1 . ` pizero ` - pi0 - (if n π0 decay)
111+ 1 . ` gamone ` - γ one from π0 decay (if pi0 decays)
112+ 1 . ` gamtwo ` - γ two from π0 decay (if pi0 decays)
113+
114+ For each particle prefix, the next columns are saved:
115+
116+ 01 . ` {0}_id ` - id - particle index in MCParticles table
117+ 02 . ` {0}_pdg ` - pdg - particle PDG
118+ 03 . ` {0}_gen ` - gen - Generator Status (1 stable... probably)
119+ 04 . ` {0}_sim ` - sim - Simulation Status (by Geant4)
120+ 05 . ` {0}_px ` - px - Momentum
121+ 06 . ` {0}_py ` - py
122+ 07 . ` {0}_pz ` - pz
123+ 08 . ` {0}_vx ` - vx - Origin vertex information
124+ 09 . ` {0}_vy ` - vy
125+ 10 . ` {0}_vz ` - vz
126+ 11 . ` {0}_epx ` - epx - End Point (decay, or out of detector)
127+ 12 . ` {0}_epy ` - epy
128+ 13 . ` {0}_epz ` - epz
129+ 14 . ` {0}_time ` - time - Time of origin
130+ 15 . ` {0}_nd ` - nd - Number of daughters
131+
132+ So in the end the columns are:
133+
134+ ``` yaml
135+ evt,
136+ lam_id,lam_pdg,lam_gen,lam_sim,lam_px,lam_py,lam_pz,lam_vx,lam_vy,lam_vz,lam_epx,lam_epy,lam_epz,lam_time,lam_nd,
137+ prot_id,prot_pdg,prot_gen,prot_sim,prot_px,prot_py,prot_pz,prot_vx,prot_vy,prot_vz,prot_epx,prot_epy,prot_epz,prot_time,prot_nd,
138+ pimin_id,pimin_pdg,pimin_gen,pimin_sim,pimin_px,pimin_py,pimin_pz,pimin_vx,pimin_vy,pimin_vz,pimin_epx,pimin_epy,pimin_epz,pimin_time,pimin_nd,neut_id,
139+ neut_pdg,neut_gen,neut_sim,neut_px,neut_py,neut_pz,neut_vx,neut_vy,neut_vz,neut_epx,neut_epy,neut_epz,neut_time,neut_nd,
140+ pizero_id,pizero_pdg,pizero_gen,pizero_sim,pizero_px,pizero_py,pizero_pz,pizero_vx,pizero_vy,pizero_vz,pizero_epx,pizero_epy,pizero_epz,pizero_time,pizero_nd,
141+ gamone_id,gamone_pdg,gamone_gen,gamone_sim,gamone_px,gamone_py,gamone_pz,gamone_vx,gamone_vy,gamone_vz,gamone_epx,gamone_epy,gamone_epz,gamone_time,gamone_nd,
142+ gamtwo_id,gamtwo_pdg,gamtwo_gen,gamtwo_sim,gamtwo_px,gamtwo_py,gamtwo_pz,gamtwo_vx,gamtwo_vy,gamtwo_vz,gamtwo_epx,gamtwo_epy,gamtwo_epz,gamtwo_time,gamtwo_nd
101143```
102144
103- ## Combine Multiple Files
145+ Notes:
146+
147+ - Particles may not be decayed. E.g. Lambda may just go outside of detector designated volume,
148+ in this case ` lam_nd ` - Number of daughters will be 0 and the rest of columns will be null
104149
105- The Key Challenge: Multiple Files = Broken Relationships
106150
107- When we have multiple CSV files from different runs or datasets, each file starts its event numbering from 0:
151+ ## Combine Multiple Files
152+
153+ When we have multiple CSV files from different runs or datasets,
154+ each file starts its event numbering from 0:
108155
109156```
110157File 1: evt = [0, 1, 2, 3, 4, ...]
111158File 2: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
112159File 3: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
113160```
114161
115- ** Problem** : Event 0 from File 1 is completely different from Event 0 from File 2, but they have the same ID!
116-
117- ** Solution** : Global Unique Event IDs
162+ ** Problem** : Event 0 from File 1 is completely different from Event 0 from File 2,
163+ but they have the same ID if read in pandas directly!
118164
119- We need to create globally unique event IDs across all files:
165+ Use functions like this to read multiple files in one DF
120166
121167``` python
122168import pandas as pd
@@ -146,90 +192,3 @@ File 1: evt = [0, 1, 2, 3, 4]
146192File 2: evt = [5, 6, 7, 8, 9] ← No collision!
147193File 3: evt = [10, 11, 12, 13, 14] ← No collision!
148194```
149-
150- ## Database Operations in Pandas
151-
152- Now we can perform standard database operations:
153-
154- ### 1. Inner Join (SQL equivalent: ` INNER JOIN ` )
155- Get Lambda particles with their corresponding DIS parameters:
156-
157- ``` python
158- # Join tables on event ID
159- joined = lambda_df.merge(dis_df, on = ' evt' , how = ' inner' )
160- print (f " Found { len (joined)} lambda particles with DIS data " )
161- ```
162-
163- ### 2. Filter and Join (SQL equivalent: ` WHERE ` + ` JOIN ` )
164- Find Lambda particles in high-Q² events:
165-
166- ``` python
167- # High-Q² events
168- high_q2_events = dis_df[dis_df[' q2' ] > 50 ]
169-
170- # Lambda particles in those events
171- high_q2_lambdas = lambda_df.merge(high_q2_events[[' evt' , ' q2' ]], on = ' evt' )
172- print (f " Found { len (high_q2_lambdas)} lambdas in high-Q² events " )
173- ```
174-
175- ### 3. Aggregation (SQL equivalent: ` GROUP BY ` )
176- Count Lambda particles per event:
177-
178- ``` python
179- lambdas_per_event = lambda_df.groupby(' evt' ).size()
180- print (f " Average lambdas per event: { lambdas_per_event.mean():.2f } " )
181- ```
182-
183- ## Real-World Example: Physics Analysis
184-
185- Let's analyze Lambda particle production in different kinematic regions:
186-
187- ``` python
188- # Join lambda and DIS data
189- physics_data = lambda_df.merge(dis_df, on = ' evt' , how = ' inner' )
190-
191- # Define kinematic regions
192- low_x = physics_data[physics_data[' xbj' ] < 0.1 ]
193- high_x = physics_data[physics_data[' xbj' ] > 0.3 ]
194-
195- print (" Lambda production rates:" )
196- print (f " Low-x region (x < 0.1): { len (low_x)} lambdas in { low_x[' evt' ].nunique()} events " )
197- print (f " High-x region (x > 0.3): { len (high_x)} lambdas in { high_x[' evt' ].nunique()} events " )
198-
199- # Calculate production rates
200- low_x_rate = len (low_x) / low_x[' evt' ].nunique()
201- high_x_rate = len (high_x) / high_x[' evt' ].nunique()
202- print (f " Production rates: { low_x_rate:.3f } vs { high_x_rate:.3f } lambdas/event " )
203- ```
204-
205- ## Key Database Concepts Applied
206-
207- | Database Concept | Our Implementation | Example |
208- | ------------------| -------------------| ---------|
209- | ** Primary Key** | ` evt ` column | Unique identifier for each event |
210- | ** Foreign Key** | ` evt ` in both tables | Links lambda particles to their events |
211- | ** One-to-Many** | Event → Lambda particles | One event can have 0, 1, or many lambdas |
212- | ** One-to-One** | Event → DIS parameters | Each event has exactly one set of DIS data |
213- | ** JOIN** | ` pandas.merge() ` | Combine related data from both tables |
214- | ** Index** | Setting ` evt ` as index | Fast lookups and joins |
215-
216- ## Best Practices
217-
218- 1 . ** Always ensure unique IDs** when combining multiple files
219- 2 . ** Keep original file information** for traceability
220- 3 . ** Validate relationships** after joining (check for missing data)
221- 4 . ** Use appropriate join types** :
222- - ` inner ` : Only events with both lambda and DIS data
223- - ` left ` : All lambda particles, even if no DIS data
224- - ` outer ` : All events from both tables
225-
226- ## Summary
227-
228- Thinking of CSV files as database tables helps organize complex particle physics analyses:
229-
230- - ** CSV files** = Database tables
231- - ** evt column** = Primary/Foreign key linking related data
232- - ** pandas operations** = SQL queries
233- - ** Global unique IDs** = Solution for multi-file datasets
234-
235- This approach scales well from small analyses to large datasets with millions of events across hundreds of files!
0 commit comments