CSV updates

DraTeots · DraTeots · commit 69dba8fca8cc · 2025-06-13T13:57:44.000-04:00
diff --git a/docs/data-csv.md b/docs/data-csv.md
@@ -1,43 +1,53 @@
 # CSV Data
 
+We provide the relevant part `*.EDM4EIC.root` data converted to the CSV format 
 
+- The CVS files are located in the same place as `*.edm4eic.root` files
+- File names correspond to each other. E.g. `k_lambda_5x41_5000evt_001.*`
+- Access to the CSV files is the same. See [DATA ACCESS](data) page
+- CSV table names are embedded in extension before `.csv` , 
+  e.g. `*.mcdis.csv`, `*.mcpart_lambda.csv`
+- Column names are listed in the first line of the file (standard for CSV)
 
-The CSV (Comma-Separated Values) format is exceptionally convenient for data processing. 
-It is simple, yet processed efficiently, supported by many analysis and introspection tools, 
-and is human-readable even without them. 
-However, CSV expects a fixed number of columns while our data is often presented hierarchically, 
-resembling a tree: events contain particles, which have hypotheses corresponding to tracks, 
-which in turn are composed of clusters composed of hits. 
+Example file names: 
 
-Consequently, we cannot directly convert our ROOT files to CSV. 
-Instead, we first process, refine and flatten this hierarchical data structure 
-to get something more simple and more table-wise. And work pleasantly.
+```bash
+# Original file
+k_lambda_5x41_5000evt_001.edm4eic.root
+
+# Related CSV-s
+k_lambda_5x41_5000evt_001.mcdis.csv
+k_lambda_5x41_5000evt_001.mcpart_lambda.csv
+```
 
-## Introduction: From CSV Files as a Database
 
+
+## Table definitions
+ 
 For analyzing data, we can work with multiple CSV files that contain related information.
-For example a CSV file containing MC level event information (xBj, Q2, -t), another table
-containing reconstructed level information, and table representing lambda decay information linked
-together by event numbers. 
+The files are linked relationally. The first columns of a CSV table is always 
+a primary key (e.g. event number). Or a composite key (e.g. event number + particle index). 
+For example, all data related to e.g. `k_lambda_5x41_5000evt_001.*` 
+will refer the same events. 
 
 ```mermaid
 erDiagram
     MC_Events {
         int event_id PK "Event Number"
         float xBj "True x"
         float Q2 "True Q2"
-        float t "True t"
+        float etc "True values"
     }
     Reconstructed_Events {
         int event_id PK "Event Number"
-        float xBj "Reco x"
-        float Q2 "Reco Q2"
-        float t "Reco -t"
+        float xBj "Reconstructed x"
+        float Q2 "Reconstructed Q2"
+        float etc "Reconstructed values"
     }
     Lambda_Decays {
         int event_id FK "Event Number"
         int lambda_id PK "Lambda Number"
-        float momentum "Lambda reco data"
+        float info "Lambda reco data"
     }
 
     MC_Events ||--|| Reconstructed_Events : "links to"
@@ -49,74 +59,110 @@ and understanding this relationship helps us organize and analyze data more effe
 With python and pandas it is easy to organize them joined tables like 
 `MCvsReconstructed events`
 
-## Meson Structure data tables
-
-### 2. MC DIS Parameters Table (`dis_parameters*.csv`)  
-Contains Deep Inelastic Scattering parameters for each event:
-- **One event has exactly one set of DIS parameters** (one-to-one relationship)
-- Each row represents one complete event
-- Includes kinematic variables: Q², x_Bjorken, energy, etc.
+## mcdis
 
 
+Files: `*.mcdis.csv`
 
-### 1. Lambda Particle Table (`mcpart_lambda*.csv`)
-Contains detailed information about Lambda particles found in each event:
-- **One event can have multiple Lambda particles** (one-to-many relationship)
-- Each row represents one Lambda particle
-- Includes particle properties: momentum, position, decay products, etc.
+True event level values that come from the event generator.
+`evt` - evnet id in file, the rest of the names correspond to table: 
+[mc-variables](http://localhost:5173/meson-structure/mc-variables.html)
 
+Columns: 
 
-## Database Relationship Diagram
+```
+evt
+alphas
+mx2
+nu
+p_rt
+pdrest
+pperps
+pperpz
+q2
+s_e,s_q
+tempvar
+tprime
+tspectator
+twopdotk
+twopdotq
+w
+x_d
+xbj
+y_d
+yplus
+```
 
-```mermaid
-erDiagram
-    EVENTS ||--o{ LAMBDA_PARTICLES : "contains"
-    EVENTS ||--|| DIS_PARAMETERS : "has"
-    
-    EVENTS {
-        int evt PK "Event ID (Primary Key)"
-    }
-    
-    LAMBDA_PARTICLES {
-        int evt FK "Event ID (Foreign Key)"
-        int lam_id "Lambda particle ID"
-        int lam_pdg "Particle type (3122 for Λ⁰)"
-        float lam_px "Momentum X"
-        float lam_py "Momentum Y" 
-        float lam_pz "Momentum Z"
-        int prot_id "Proton from decay"
-        int pimin_id "Pi-minus from decay"
-        string file_name "Source file"
-    }
-    
-    DIS_PARAMETERS {
-        int evt FK "Event ID (Foreign Key)"
-        float q2 "Momentum transfer squared"
-        float xbj "Bjorken x"
-        float nu "Energy transfer"
-        float w "Invariant mass"
-        float y_d "Inelasticity"
-        string file_name "Source file"
-    }
+## mcpart_lambda
+
+Files: `*.mcpart_lambda.csv`
+
+Full chane lambda decays by using `MCParticles` EDM4EIC table. 
+MCParticles has relations like daughters and parents. Those relations are 
+flattened for lambda decays. The column represent possible lambda decays are grouped by particles: 
+
+Prefixes (each has the same parameters after)
+
+1. `lam` - Λ 
+1. `prot` - p (if pπ- decay or nulls)
+1. `pimin` - π- (if pπ- decay or nulls)
+1. `neut` - Neutron (if n π0 decay)
+1. `pizero` - pi0 - (if n π0 decay)
+1. `gamone` - γ one from π0 decay (if pi0 decays)
+1. `gamtwo` - γ two from π0 decay (if pi0 decays)
+
+For each particle prefix, the next columns are saved: 
+
+01. `{0}_id`     -   id - particle index in MCParticles table
+02. `{0}_pdg`    -   pdg - particle PDG
+03. `{0}_gen`    -   gen - Generator Status (1 stable... probably)
+04. `{0}_sim`    -   sim - Simulation Status (by Geant4)
+05. `{0}_px`     -   px - Momentum
+06. `{0}_py`     -   py
+07. `{0}_pz`     -   pz
+08. `{0}_vx`     -   vx - Origin vertex information
+09. `{0}_vy`     -   vy
+10. `{0}_vz`     -   vz
+11. `{0}_epx`    -   epx - End Point (decay, or out of detector)
+12. `{0}_epy`    -   epy
+13. `{0}_epz`    -   epz
+14. `{0}_time`   -   time - Time of origin
+15. `{0}_nd`     -   nd - Number of daughters
+
+So in the end the columns are: 
+
+```yaml
+evt,
+lam_id,lam_pdg,lam_gen,lam_sim,lam_px,lam_py,lam_pz,lam_vx,lam_vy,lam_vz,lam_epx,lam_epy,lam_epz,lam_time,lam_nd,
+prot_id,prot_pdg,prot_gen,prot_sim,prot_px,prot_py,prot_pz,prot_vx,prot_vy,prot_vz,prot_epx,prot_epy,prot_epz,prot_time,prot_nd,
+pimin_id,pimin_pdg,pimin_gen,pimin_sim,pimin_px,pimin_py,pimin_pz,pimin_vx,pimin_vy,pimin_vz,pimin_epx,pimin_epy,pimin_epz,pimin_time,pimin_nd,neut_id,
+neut_pdg,neut_gen,neut_sim,neut_px,neut_py,neut_pz,neut_vx,neut_vy,neut_vz,neut_epx,neut_epy,neut_epz,neut_time,neut_nd,
+pizero_id,pizero_pdg,pizero_gen,pizero_sim,pizero_px,pizero_py,pizero_pz,pizero_vx,pizero_vy,pizero_vz,pizero_epx,pizero_epy,pizero_epz,pizero_time,pizero_nd,
+gamone_id,gamone_pdg,gamone_gen,gamone_sim,gamone_px,gamone_py,gamone_pz,gamone_vx,gamone_vy,gamone_vz,gamone_epx,gamone_epy,gamone_epz,gamone_time,gamone_nd,
+gamtwo_id,gamtwo_pdg,gamtwo_gen,gamtwo_sim,gamtwo_px,gamtwo_py,gamtwo_pz,gamtwo_vx,gamtwo_vy,gamtwo_vz,gamtwo_epx,gamtwo_epy,gamtwo_epz,gamtwo_time,gamtwo_nd
 ```
 
-## Combine Multiple Files
+Notes:
+
+- Particles may not be decayed. E.g. Lambda may just go outside of detector designated volume, 
+  in this case `lam_nd` - Number of daughters will be 0 and the rest of columns will be null 
 
-The Key Challenge: Multiple Files = Broken Relationships
 
-When we have multiple CSV files from different runs or datasets, each file starts its event numbering from 0:
+## Combine Multiple Files
+
+When we have multiple CSV files from different runs or datasets, 
+each file starts its event numbering from 0:
 
 ```
 File 1: evt = [0, 1, 2, 3, 4, ...]
 File 2: evt = [0, 1, 2, 3, 4, ...]  ← ID Collision!
 File 3: evt = [0, 1, 2, 3, 4, ...]  ← ID Collision!
 ```
 
-**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2, but they have the same ID!
-
-**Solution**: Global Unique Event IDs
+**Problem**: Event 0 from File 1 is completely different from Event 0 from File 2, 
+but they have the same ID if read in pandas directly!
 
-We need to create globally unique event IDs across all files:
+Use functions like this to read multiple files in one DF
 
 ```python
 import pandas as pd
@@ -146,90 +192,3 @@ File 1: evt = [0, 1, 2, 3, 4]
 File 2: evt = [5, 6, 7, 8, 9]     ← No collision!  
 File 3: evt = [10, 11, 12, 13, 14] ← No collision!
 ```
-
-## Database Operations in Pandas
-
-Now we can perform standard database operations:
-
-### 1. Inner Join (SQL equivalent: `INNER JOIN`)
-Get Lambda particles with their corresponding DIS parameters:
-
-```python
-# Join tables on event ID
-joined = lambda_df.merge(dis_df, on='evt', how='inner')
-print(f"Found {len(joined)} lambda particles with DIS data")
-```
-
-### 2. Filter and Join (SQL equivalent: `WHERE` + `JOIN`)
-Find Lambda particles in high-Q² events:
-
-```python
-# High-Q² events
-high_q2_events = dis_df[dis_df['q2'] > 50]
-
-# Lambda particles in those events
-high_q2_lambdas = lambda_df.merge(high_q2_events[['evt', 'q2']], on='evt')
-print(f"Found {len(high_q2_lambdas)} lambdas in high-Q² events")
-```
-
-### 3. Aggregation (SQL equivalent: `GROUP BY`)
-Count Lambda particles per event:
-
-```python
-lambdas_per_event = lambda_df.groupby('evt').size()
-print(f"Average lambdas per event: {lambdas_per_event.mean():.2f}")
-```
-
-## Real-World Example: Physics Analysis
-
-Let's analyze Lambda particle production in different kinematic regions:
-
-```python
-# Join lambda and DIS data
-physics_data = lambda_df.merge(dis_df, on='evt', how='inner')
-
-# Define kinematic regions
-low_x = physics_data[physics_data['xbj'] < 0.1]
-high_x = physics_data[physics_data['xbj'] > 0.3]
-
-print("Lambda production rates:")
-print(f"Low-x region (x < 0.1): {len(low_x)} lambdas in {low_x['evt'].nunique()} events")
-print(f"High-x region (x > 0.3): {len(high_x)} lambdas in {high_x['evt'].nunique()} events")
-
-# Calculate production rates
-low_x_rate = len(low_x) / low_x['evt'].nunique()
-high_x_rate = len(high_x) / high_x['evt'].nunique() 
-print(f"Production rates: {low_x_rate:.3f} vs {high_x_rate:.3f} lambdas/event")
-```
-
-## Key Database Concepts Applied
-
-| Database Concept | Our Implementation | Example |
-|------------------|-------------------|---------|
-| **Primary Key** | `evt` column | Unique identifier for each event |
-| **Foreign Key** | `evt` in both tables | Links lambda particles to their events |
-| **One-to-Many** | Event → Lambda particles | One event can have 0, 1, or many lambdas |
-| **One-to-One** | Event → DIS parameters | Each event has exactly one set of DIS data |
-| **JOIN** | `pandas.merge()` | Combine related data from both tables |
-| **Index** | Setting `evt` as index | Fast lookups and joins |
-
-## Best Practices
-
-1. **Always ensure unique IDs** when combining multiple files
-2. **Keep original file information** for traceability
-3. **Validate relationships** after joining (check for missing data)
-4. **Use appropriate join types**:
-   - `inner`: Only events with both lambda and DIS data
-   - `left`: All lambda particles, even if no DIS data
-   - `outer`: All events from both tables
-
-## Summary
-
-Thinking of CSV files as database tables helps organize complex particle physics analyses:
-
-- **CSV files** = Database tables
-- **evt column** = Primary/Foreign key linking related data  
-- **pandas operations** = SQL queries
-- **Global unique IDs** = Solution for multi-file datasets
-
-This approach scales well from small analyses to large datasets with millions of events across hundreds of files!