Skip to content

Commit fe3bbc6

Browse files
committed
Website update
1 parent 4766e9f commit fe3bbc6

File tree

3 files changed

+245
-4
lines changed

3 files changed

+245
-4
lines changed

docs/.vitepress/config.mts

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,11 @@ export default withMermaid({
4949
link: '/data',
5050
items: [
5151
{ text: 'Data Access', link: '/data' },
52+
{ text: 'CSV Data', link: '/data-csv' },
5253
{ text: 'MC Variables', link: '/mc-variables' },
5354
{ text: 'EDM4EIC Tree', link: '/edm4eic-tree' },
5455
{ text: 'EDM4EIC Diagram', link: '/edm4eic-diagram' },
56+
5557
{ text: 'Analysis', link: '/analysis' },
5658

5759
]
@@ -61,10 +63,10 @@ export default withMermaid({
6163
link: '/tutorials',
6264
items: [
6365
{ text: 'Overview', link: '/tutorials' },
64-
{ text: 'py-edm4eic-01 Uproot', link: '/tutorials/01_using_uproot' },
65-
{ text: 'py-edm4eic-02 Metadata', link: '/tutorials/02_metadata' },
66-
{ text: 'py-edm4eic-03 References', link: '/tutorials/03_references' },
67-
{ text: 'cpp1 EDM4EIC', link: '/tutorials/cpp01_edm4eic' },
66+
{ text: '01 Python CSV', link: '/tutorials/py-csv' },
67+
{ text: '02 Python EDM4EIC', link: '/tutorials/py-edm4eic-uproot' },
68+
{ text: '03 C++ EDM4EIC', link: '/tutorials/cpp-edm4eic' },
69+
6870
]
6971
},
7072
{
File renamed without changes.

docs/tutorials/py-csv.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# Python processing CSV
2+
3+
Tutorials on processing CSV using common python data science tools
4+
5+
6+
7+
## CSV for analysis
8+
9+
The CSV (Comma-Separated Values) format is exceptionally convenient for data processing.
10+
It is simple, yet processed efficiently, supported by many analysis and introspection tools,
11+
and is human-readable even without them.
12+
However, CSV expects a fixed number of columns while our data is often presented hierarchically,
13+
resembling a tree: events contain particles, which have hypotheses corresponding to tracks,
14+
which in turn are composed of clusters composed of hits.
15+
16+
Consequently, we cannot directly convert our EDM4EIC ROOT files to CSV.
17+
Instead, we first process, refine and flatten this hierarchical data structure
18+
to get something more simple and more table-wise. And work pleasantly.
19+
20+
## Introduction: From CSV Files as a Database
21+
22+
For analyzing data, we can work with multiple CSV files that contain related information.
23+
For example a CSV file containing MC level event information (xBj, Q2, -t), another table
24+
containing reconstructed level information, and table representing lambda decay information linked
25+
together by event numbers.
26+
27+
```mermaid
28+
erDiagram
29+
MC_Events {
30+
int event_id PK "Event Number"
31+
float xBj "True x"
32+
float Q2 "True Q2"
33+
float t "True t"
34+
}
35+
Reconstructed_Events {
36+
int event_id PK "Event Number"
37+
float xBj "Reco x"
38+
float Q2 "Reco Q2"
39+
float t "Reco -t"
40+
}
41+
Lambda_Decays {
42+
int event_id FK "Event Number"
43+
int lambda_id PK "Lambda Number"
44+
float momentum "Lambda reco data"
45+
}
46+
47+
MC_Events ||--|| Reconstructed_Events : "links to"
48+
MC_Events ||--o{ Lambda_Decays : "links to"
49+
```
50+
These CSV files are essentially **database tables**,
51+
and understanding this relationship helps us organize and analyze data more effectively.
52+
53+
With python and pandas it is easy to organize them joined tables like
54+
`MCvsReconstructed events`
55+
56+
## Meson Structure data tables
57+
58+
### 2. MC DIS Parameters Table (`dis_parameters*.csv`)
59+
Contains Deep Inelastic Scattering parameters for each event:
60+
- **One event has exactly one set of DIS parameters** (one-to-one relationship)
61+
- Each row represents one complete event
62+
- Includes kinematic variables: Q², x_Bjorken, energy, etc.
63+
64+
65+
66+
### 1. Lambda Particle Table (`mcpart_lambda*.csv`)
67+
Contains detailed information about Lambda particles found in each event:
68+
- **One event can have multiple Lambda particles** (one-to-many relationship)
69+
- Each row represents one Lambda particle
70+
- Includes particle properties: momentum, position, decay products, etc.
71+
72+
73+
## Database Relationship Diagram
74+
75+
```mermaid
76+
erDiagram
77+
EVENTS ||--o{ LAMBDA_PARTICLES : "contains"
78+
EVENTS ||--|| DIS_PARAMETERS : "has"
79+
80+
EVENTS {
81+
int evt PK "Event ID (Primary Key)"
82+
}
83+
84+
LAMBDA_PARTICLES {
85+
int evt FK "Event ID (Foreign Key)"
86+
int lam_id "Lambda particle ID"
87+
int lam_pdg "Particle type (3122 for Λ⁰)"
88+
float lam_px "Momentum X"
89+
float lam_py "Momentum Y"
90+
float lam_pz "Momentum Z"
91+
int prot_id "Proton from decay"
92+
int pimin_id "Pi-minus from decay"
93+
string file_name "Source file"
94+
}
95+
96+
DIS_PARAMETERS {
97+
int evt FK "Event ID (Foreign Key)"
98+
float q2 "Momentum transfer squared"
99+
float xbj "Bjorken x"
100+
float nu "Energy transfer"
101+
float w "Invariant mass"
102+
float y_d "Inelasticity"
103+
string file_name "Source file"
104+
}
105+
```
106+
107+
## Combine Multiple Files
108+
109+
The Key Challenge: Multiple Files = Broken Relationships
110+
111+
When we have multiple CSV files from different runs or datasets, each file starts its event numbering from 0:
112+
113+
```
114+
File 1: evt = [0, 1, 2, 3, 4, ...]
115+
File 2: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
116+
File 3: evt = [0, 1, 2, 3, 4, ...] ← ID Collision!
117+
```
118+
119+
**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2, but they have the same ID!
120+
121+
**Solution**: Global Unique Event IDs
122+
123+
We need to create globally unique event IDs across all files:
124+
125+
```python
126+
import pandas as pd
127+
import glob
128+
129+
def concat_csvs_with_unique_events(files):
130+
"""Load and concatenate CSV files with globally unique event IDs"""
131+
dfs = []
132+
offset = 0
133+
134+
for file in files:
135+
df = pd.read_csv(file)
136+
df['evt'] = df['evt'] + offset # Make IDs globally unique
137+
offset = df['evt'].max() + 1 # Set offset for next file
138+
dfs.append(df)
139+
140+
return pd.concat(dfs, ignore_index=True)
141+
142+
# Load both tables with unique event IDs
143+
lambda_df = concat_csvs_with_unique_events(sorted(glob.glob("mcpart_lambda*.csv")))
144+
dis_df = concat_csvs_with_unique_events(sorted(glob.glob("dis_parameters*.csv")))
145+
```
146+
147+
**Result**: Now we have globally unique event IDs:
148+
```
149+
File 1: evt = [0, 1, 2, 3, 4]
150+
File 2: evt = [5, 6, 7, 8, 9] ← No collision!
151+
File 3: evt = [10, 11, 12, 13, 14] ← No collision!
152+
```
153+
154+
## Database Operations in Pandas
155+
156+
Now we can perform standard database operations:
157+
158+
### 1. Inner Join (SQL equivalent: `INNER JOIN`)
159+
Get Lambda particles with their corresponding DIS parameters:
160+
161+
```python
162+
# Join tables on event ID
163+
joined = lambda_df.merge(dis_df, on='evt', how='inner')
164+
print(f"Found {len(joined)} lambda particles with DIS data")
165+
```
166+
167+
### 2. Filter and Join (SQL equivalent: `WHERE` + `JOIN`)
168+
Find Lambda particles in high-Q² events:
169+
170+
```python
171+
# High-Q² events
172+
high_q2_events = dis_df[dis_df['q2'] > 50]
173+
174+
# Lambda particles in those events
175+
high_q2_lambdas = lambda_df.merge(high_q2_events[['evt', 'q2']], on='evt')
176+
print(f"Found {len(high_q2_lambdas)} lambdas in high-Q² events")
177+
```
178+
179+
### 3. Aggregation (SQL equivalent: `GROUP BY`)
180+
Count Lambda particles per event:
181+
182+
```python
183+
lambdas_per_event = lambda_df.groupby('evt').size()
184+
print(f"Average lambdas per event: {lambdas_per_event.mean():.2f}")
185+
```
186+
187+
## Real-World Example: Physics Analysis
188+
189+
Let's analyze Lambda particle production in different kinematic regions:
190+
191+
```python
192+
# Join lambda and DIS data
193+
physics_data = lambda_df.merge(dis_df, on='evt', how='inner')
194+
195+
# Define kinematic regions
196+
low_x = physics_data[physics_data['xbj'] < 0.1]
197+
high_x = physics_data[physics_data['xbj'] > 0.3]
198+
199+
print("Lambda production rates:")
200+
print(f"Low-x region (x < 0.1): {len(low_x)} lambdas in {low_x['evt'].nunique()} events")
201+
print(f"High-x region (x > 0.3): {len(high_x)} lambdas in {high_x['evt'].nunique()} events")
202+
203+
# Calculate production rates
204+
low_x_rate = len(low_x) / low_x['evt'].nunique()
205+
high_x_rate = len(high_x) / high_x['evt'].nunique()
206+
print(f"Production rates: {low_x_rate:.3f} vs {high_x_rate:.3f} lambdas/event")
207+
```
208+
209+
## Key Database Concepts Applied
210+
211+
| Database Concept | Our Implementation | Example |
212+
|------------------|-------------------|---------|
213+
| **Primary Key** | `evt` column | Unique identifier for each event |
214+
| **Foreign Key** | `evt` in both tables | Links lambda particles to their events |
215+
| **One-to-Many** | Event → Lambda particles | One event can have 0, 1, or many lambdas |
216+
| **One-to-One** | Event → DIS parameters | Each event has exactly one set of DIS data |
217+
| **JOIN** | `pandas.merge()` | Combine related data from both tables |
218+
| **Index** | Setting `evt` as index | Fast lookups and joins |
219+
220+
## Best Practices
221+
222+
1. **Always ensure unique IDs** when combining multiple files
223+
2. **Keep original file information** for traceability
224+
3. **Validate relationships** after joining (check for missing data)
225+
4. **Use appropriate join types**:
226+
- `inner`: Only events with both lambda and DIS data
227+
- `left`: All lambda particles, even if no DIS data
228+
- `outer`: All events from both tables
229+
230+
## Summary
231+
232+
Thinking of CSV files as database tables helps organize complex particle physics analyses:
233+
234+
- **CSV files** = Database tables
235+
- **evt column** = Primary/Foreign key linking related data
236+
- **pandas operations** = SQL queries
237+
- **Global unique IDs** = Solution for multi-file datasets
238+
239+
This approach scales well from small analyses to large datasets with millions of events across hundreds of files!

0 commit comments

Comments
 (0)