Skip to content

Commit 69dba8f

Browse files
committed
CSV updates
1 parent fe3bbc6 commit 69dba8f

File tree

1 file changed

+114
-155
lines changed

1 file changed

+114
-155
lines changed

docs/data-csv.md

Lines changed: 114 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,53 @@
11
# CSV Data
22

3+
We provide the relevant part `*.EDM4EIC.root` data converted to the CSV format
34

5+
- The CVS files are located in the same place as `*.edm4eic.root` files
6+
- File names correspond to each other. E.g. `k_lambda_5x41_5000evt_001.*`
7+
- Access to the CSV files is the same. See [DATA ACCESS](data) page
8+
- CSV table names are embedded in extension before `.csv` ,
9+
e.g. `*.mcdis.csv`, `*.mcpart_lambda.csv`
10+
- Column names are listed in the first line of the file (standard for CSV)
411

5-
The CSV (Comma-Separated Values) format is exceptionally convenient for data processing.
6-
It is simple, yet processed efficiently, supported by many analysis and introspection tools,
7-
and is human-readable even without them.
8-
However, CSV expects a fixed number of columns while our data is often presented hierarchically,
9-
resembling a tree: events contain particles, which have hypotheses corresponding to tracks,
10-
which in turn are composed of clusters composed of hits.
12+
Example file names:
1113

12-
Consequently, we cannot directly convert our ROOT files to CSV.
13-
Instead, we first process, refine and flatten this hierarchical data structure
14-
to get something more simple and more table-wise. And work pleasantly.
14+
```bash
15+
# Original file
16+
k_lambda_5x41_5000evt_001.edm4eic.root
17+
18+
# Related CSV-s
19+
k_lambda_5x41_5000evt_001.mcdis.csv
20+
k_lambda_5x41_5000evt_001.mcpart_lambda.csv
21+
```
1522

16-
## Introduction: From CSV Files as a Database
1723

24+
25+
## Table definitions
26+
1827
For analyzing data, we can work with multiple CSV files that contain related information.
19-
For example a CSV file containing MC level event information (xBj, Q2, -t), another table
20-
containing reconstructed level information, and table representing lambda decay information linked
21-
together by event numbers.
28+
The files are linked relationally. The first columns of a CSV table is always
29+
a primary key (e.g. event number). Or a composite key (e.g. event number + particle index).
30+
For example, all data related to e.g. `k_lambda_5x41_5000evt_001.*`
31+
will refer the same events.
2232

2333
```mermaid
2434
erDiagram
2535
MC_Events {
2636
int event_id PK "Event Number"
2737
float xBj "True x"
2838
float Q2 "True Q2"
29-
float t "True t"
39+
float etc "True values"
3040
}
3141
Reconstructed_Events {
3242
int event_id PK "Event Number"
33-
float xBj "Reco x"
34-
float Q2 "Reco Q2"
35-
float t "Reco -t"
43+
float xBj "Reconstructed x"
44+
float Q2 "Reconstructed Q2"
45+
float etc "Reconstructed values"
3646
}
3747
Lambda_Decays {
3848
int event_id FK "Event Number"
3949
int lambda_id PK "Lambda Number"
40-
float momentum "Lambda reco data"
50+
float info "Lambda reco data"
4151
}
4252
4353
MC_Events ||--|| Reconstructed_Events : "links to"
@@ -49,74 +59,110 @@ and understanding this relationship helps us organize and analyze data more effe
4959
With python and pandas it is easy to organize them joined tables like
5060
`MCvsReconstructed events`
5161

52-
## Meson Structure data tables
53-
54-
### 2. MC DIS Parameters Table (`dis_parameters*.csv`)
55-
Contains Deep Inelastic Scattering parameters for each event:
56-
- **One event has exactly one set of DIS parameters** (one-to-one relationship)
57-
- Each row represents one complete event
58-
- Includes kinematic variables: Q², x_Bjorken, energy, etc.
62+
## mcdis
5963

6064

65+
Files: `*.mcdis.csv`
6166

62-
### 1. Lambda Particle Table (`mcpart_lambda*.csv`)
63-
Contains detailed information about Lambda particles found in each event:
64-
- **One event can have multiple Lambda particles** (one-to-many relationship)
65-
- Each row represents one Lambda particle
66-
- Includes particle properties: momentum, position, decay products, etc.
67+
True event level values that come from the event generator.
68+
`evt` - evnet id in file, the rest of the names correspond to table:
69+
[mc-variables](http://localhost:5173/meson-structure/mc-variables.html)
6770

71+
Columns:
6872

69-
## Database Relationship Diagram
73+
```
74+
evt
75+
alphas
76+
mx2
77+
nu
78+
p_rt
79+
pdrest
80+
pperps
81+
pperpz
82+
q2
83+
s_e,s_q
84+
tempvar
85+
tprime
86+
tspectator
87+
twopdotk
88+
twopdotq
89+
w
90+
x_d
91+
xbj
92+
y_d
93+
yplus
94+
```
7095

71-
```mermaid
72-
erDiagram
73-
EVENTS ||--o{ LAMBDA_PARTICLES : "contains"
74-
EVENTS ||--|| DIS_PARAMETERS : "has"
75-
76-
EVENTS {
77-
int evt PK "Event ID (Primary Key)"
78-
}
79-
80-
LAMBDA_PARTICLES {
81-
int evt FK "Event ID (Foreign Key)"
82-
int lam_id "Lambda particle ID"
83-
int lam_pdg "Particle type (3122 for Λ⁰)"
84-
float lam_px "Momentum X"
85-
float lam_py "Momentum Y"
86-
float lam_pz "Momentum Z"
87-
int prot_id "Proton from decay"
88-
int pimin_id "Pi-minus from decay"
89-
string file_name "Source file"
90-
}
91-
92-
DIS_PARAMETERS {
93-
int evt FK "Event ID (Foreign Key)"
94-
float q2 "Momentum transfer squared"
95-
float xbj "Bjorken x"
96-
float nu "Energy transfer"
97-
float w "Invariant mass"
98-
float y_d "Inelasticity"
99-
string file_name "Source file"
100-
}
96+
## mcpart_lambda
97+
98+
Files: `*.mcpart_lambda.csv`
99+
100+
Full chane lambda decays by using `MCParticles` EDM4EIC table.
101+
MCParticles has relations like daughters and parents. Those relations are
102+
flattened for lambda decays. The column represent possible lambda decays are grouped by particles:
103+
104+
Prefixes (each has the same parameters after)
105+
106+
1. `lam` - Λ
107+
1. `prot` - p (if pπ- decay or nulls)
108+
1. `pimin` - π- (if pπ- decay or nulls)
109+
1. `neut` - Neutron (if n π0 decay)
110+
1. `pizero` - pi0 - (if n π0 decay)
111+
1. `gamone` - γ one from π0 decay (if pi0 decays)
112+
1. `gamtwo` - γ two from π0 decay (if pi0 decays)
113+
114+
For each particle prefix, the next columns are saved:
115+
116+
01. `{0}_id` - id - particle index in MCParticles table
117+
02. `{0}_pdg` - pdg - particle PDG
118+
03. `{0}_gen` - gen - Generator Status (1 stable... probably)
119+
04. `{0}_sim` - sim - Simulation Status (by Geant4)
120+
05. `{0}_px` - px - Momentum
121+
06. `{0}_py` - py
122+
07. `{0}_pz` - pz
123+
08. `{0}_vx` - vx - Origin vertex information
124+
09. `{0}_vy` - vy
125+
10. `{0}_vz` - vz
126+
11. `{0}_epx` - epx - End Point (decay, or out of detector)
127+
12. `{0}_epy` - epy
128+
13. `{0}_epz` - epz
129+
14. `{0}_time` - time - Time of origin
130+
15. `{0}_nd` - nd - Number of daughters
131+
132+
So in the end the columns are:
133+
134+
```yaml
135+
evt,
136+
lam_id,lam_pdg,lam_gen,lam_sim,lam_px,lam_py,lam_pz,lam_vx,lam_vy,lam_vz,lam_epx,lam_epy,lam_epz,lam_time,lam_nd,
137+
prot_id,prot_pdg,prot_gen,prot_sim,prot_px,prot_py,prot_pz,prot_vx,prot_vy,prot_vz,prot_epx,prot_epy,prot_epz,prot_time,prot_nd,
138+
pimin_id,pimin_pdg,pimin_gen,pimin_sim,pimin_px,pimin_py,pimin_pz,pimin_vx,pimin_vy,pimin_vz,pimin_epx,pimin_epy,pimin_epz,pimin_time,pimin_nd,neut_id,
139+
neut_pdg,neut_gen,neut_sim,neut_px,neut_py,neut_pz,neut_vx,neut_vy,neut_vz,neut_epx,neut_epy,neut_epz,neut_time,neut_nd,
140+
pizero_id,pizero_pdg,pizero_gen,pizero_sim,pizero_px,pizero_py,pizero_pz,pizero_vx,pizero_vy,pizero_vz,pizero_epx,pizero_epy,pizero_epz,pizero_time,pizero_nd,
141+
gamone_id,gamone_pdg,gamone_gen,gamone_sim,gamone_px,gamone_py,gamone_pz,gamone_vx,gamone_vy,gamone_vz,gamone_epx,gamone_epy,gamone_epz,gamone_time,gamone_nd,
142+
gamtwo_id,gamtwo_pdg,gamtwo_gen,gamtwo_sim,gamtwo_px,gamtwo_py,gamtwo_pz,gamtwo_vx,gamtwo_vy,gamtwo_vz,gamtwo_epx,gamtwo_epy,gamtwo_epz,gamtwo_time,gamtwo_nd
101143
```
102144

103-
## Combine Multiple Files
145+
Notes:
146+
147+
- Particles may not be decayed. E.g. Lambda may just go outside of detector designated volume,
148+
in this case `lam_nd` - Number of daughters will be 0 and the rest of columns will be null
104149

105-
The Key Challenge: Multiple Files = Broken Relationships
106150

107-
When we have multiple CSV files from different runs or datasets, each file starts its event numbering from 0:
151+
## Combine Multiple Files
152+
153+
When we have multiple CSV files from different runs or datasets,
154+
each file starts its event numbering from 0:
108155

109156
```
110157
File 1: evt = [0, 1, 2, 3, 4, ...]
111158
File 2: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
112159
File 3: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
113160
```
114161

115-
**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2, but they have the same ID!
116-
117-
**Solution**: Global Unique Event IDs
162+
**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2,
163+
but they have the same ID if read in pandas directly!
118164

119-
We need to create globally unique event IDs across all files:
165+
Use functions like this to read multiple files in one DF
120166

121167
```python
122168
import pandas as pd
@@ -146,90 +192,3 @@ File 1: evt = [0, 1, 2, 3, 4]
146192
File 2: evt = [5, 6, 7, 8, 9] ← No collision!
147193
File 3: evt = [10, 11, 12, 13, 14] ← No collision!
148194
```
149-
150-
## Database Operations in Pandas
151-
152-
Now we can perform standard database operations:
153-
154-
### 1. Inner Join (SQL equivalent: `INNER JOIN`)
155-
Get Lambda particles with their corresponding DIS parameters:
156-
157-
```python
158-
# Join tables on event ID
159-
joined = lambda_df.merge(dis_df, on='evt', how='inner')
160-
print(f"Found {len(joined)} lambda particles with DIS data")
161-
```
162-
163-
### 2. Filter and Join (SQL equivalent: `WHERE` + `JOIN`)
164-
Find Lambda particles in high-Q² events:
165-
166-
```python
167-
# High-Q² events
168-
high_q2_events = dis_df[dis_df['q2'] > 50]
169-
170-
# Lambda particles in those events
171-
high_q2_lambdas = lambda_df.merge(high_q2_events[['evt', 'q2']], on='evt')
172-
print(f"Found {len(high_q2_lambdas)} lambdas in high-Q² events")
173-
```
174-
175-
### 3. Aggregation (SQL equivalent: `GROUP BY`)
176-
Count Lambda particles per event:
177-
178-
```python
179-
lambdas_per_event = lambda_df.groupby('evt').size()
180-
print(f"Average lambdas per event: {lambdas_per_event.mean():.2f}")
181-
```
182-
183-
## Real-World Example: Physics Analysis
184-
185-
Let's analyze Lambda particle production in different kinematic regions:
186-
187-
```python
188-
# Join lambda and DIS data
189-
physics_data = lambda_df.merge(dis_df, on='evt', how='inner')
190-
191-
# Define kinematic regions
192-
low_x = physics_data[physics_data['xbj'] < 0.1]
193-
high_x = physics_data[physics_data['xbj'] > 0.3]
194-
195-
print("Lambda production rates:")
196-
print(f"Low-x region (x < 0.1): {len(low_x)} lambdas in {low_x['evt'].nunique()} events")
197-
print(f"High-x region (x > 0.3): {len(high_x)} lambdas in {high_x['evt'].nunique()} events")
198-
199-
# Calculate production rates
200-
low_x_rate = len(low_x) / low_x['evt'].nunique()
201-
high_x_rate = len(high_x) / high_x['evt'].nunique()
202-
print(f"Production rates: {low_x_rate:.3f} vs {high_x_rate:.3f} lambdas/event")
203-
```
204-
205-
## Key Database Concepts Applied
206-
207-
| Database Concept | Our Implementation | Example |
208-
|------------------|-------------------|---------|
209-
| **Primary Key** | `evt` column | Unique identifier for each event |
210-
| **Foreign Key** | `evt` in both tables | Links lambda particles to their events |
211-
| **One-to-Many** | Event → Lambda particles | One event can have 0, 1, or many lambdas |
212-
| **One-to-One** | Event → DIS parameters | Each event has exactly one set of DIS data |
213-
| **JOIN** | `pandas.merge()` | Combine related data from both tables |
214-
| **Index** | Setting `evt` as index | Fast lookups and joins |
215-
216-
## Best Practices
217-
218-
1. **Always ensure unique IDs** when combining multiple files
219-
2. **Keep original file information** for traceability
220-
3. **Validate relationships** after joining (check for missing data)
221-
4. **Use appropriate join types**:
222-
- `inner`: Only events with both lambda and DIS data
223-
- `left`: All lambda particles, even if no DIS data
224-
- `outer`: All events from both tables
225-
226-
## Summary
227-
228-
Thinking of CSV files as database tables helps organize complex particle physics analyses:
229-
230-
- **CSV files** = Database tables
231-
- **evt column** = Primary/Foreign key linking related data
232-
- **pandas operations** = SQL queries
233-
- **Global unique IDs** = Solution for multi-file datasets
234-
235-
This approach scales well from small analyses to large datasets with millions of events across hundreds of files!

0 commit comments

Comments
 (0)