gridfm
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/components/cli.md‎
Lines changed: 28 additions & 33 deletions b/‎docs/components/cli.md‎
Lines changed: 28 additions & 33 deletions
diff --git a/‎docs/manual/admittance_perturbations.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/manual/admittance_perturbations.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/manual/generation_perturbations.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/manual/generation_perturbations.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/manual/getting_started.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/manual/getting_started.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/manual/outputs.md‎
Lines changed: 24 additions & 11 deletions b/‎docs/manual/outputs.md‎
Lines changed: 24 additions & 11 deletions
diff --git a/‎docs/manual/topology_perturbations.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/manual/topology_perturbations.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎gridfm_datakit/process/process_network.py‎
Lines changed: 2 additions & 2 deletions b/‎gridfm_datakit/process/process_network.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎scripts/compare_parquet_files.py‎
Lines changed: 4 additions & 4 deletions b/‎scripts/compare_parquet_files.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎scripts/config/Texas2k_case1_2016summerpeak.yaml‎
Lines changed: 4 additions & 4 deletions b/‎scripts/config/Texas2k_case1_2016summerpeak.yaml‎
Lines changed: 4 additions & 4 deletions
@@ -182,9 +182,9 @@ settings:
   large_chunk_size: 1000 # Number of load scenarios processed before saving
   overwrite: true # If true, overwrites existing files, if false, appends to files
   mode: "pf" # Mode of the script; options: pf, opf. pf: power flow data where one or more operating limits – the inequality constraints defined in OPF, e.g., voltage magnitude or branch limits – may be violated. opf: generates datapoints for training OPF solvers, with cost-optimal dispatches that satisfy all operating limits (OPF-feasible)
-  include_dc_res: true # If true, also stores the results of dc power flow (in addition to the results AC power flow). does not work with mode "opf"
+  include_dc_res: true # If true, also stores the results of dc power flow or dc optimal power flow
   enable_solver_logs: true # If true, write OPF/PF logs to {data_dir}/solver_log; PF fast and DCPF fast do not log.
-  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks e.g. case10000_goc do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
+  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks (typically large ones e.g. case10000_goc) do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
   dcpf_fast: true # Whether to use fast DCPF solver by default (compute_dc_pf from PowerModels.jl)
   max_iter: 200 # Max iterations for Ipopt-based solvers
 ```
 
@@ -30,21 +30,47 @@ gridfm-datakit validate path/to/data/directory [--n-partitions N] [--sn-mva 100]
 
 **Arguments:**
 - `data_path`: Path to directory containing generated CSV files
-- `--n-partitions N`: Number of partitions to sample for validation (default: 100). Use 0 to validate all partitions.
+- `--n-partitions N`: Number of partitions (of 200 scenarios) to sample for validation (default: 100). Use 0 to validate all partitions.
 - `--sn-mva`: Base MVA used to scale power quantities (default: 100).
 
 **Examples:**
 ```bash
 # Validate with default sampling (100 partitions)
 gridfm-datakit validate ./data_out/case24_ieee_rts/raw
 
-# Validate with custom partition sampling
+# Validate custom number of partitions
 gridfm-datakit validate ./data_out/case24_ieee_rts/raw --n-partitions 50
 
 # Validate all partitions (slower but complete)
 gridfm-datakit validate ./data_out/case24_ieee_rts/raw --n-partitions 0
 ```
 
+The validation command performs the following checks:
+
+#### Y-Bus Consistency
+- Consistency of bus admittance matrix with branch admittance data
+- Y-bus matrix structure validation
+
+#### Branch Constraints
+- Deactivated lines have zero power flows and admittances
+- Computed vs stored power flow consistency
+- Branch loading limits (OPF mode only)
+
+#### Generator Constraints
+- Deactivated generators have zero power output
+- Generator power limits validation
+- Reactive power limits (OPF mode only)
+
+#### Power Balance
+- Bus generation consistency between bus_data and gen_data
+- Power Balance
+
+#### Data Integrity
+- Scenario indexing consistency across all files
+- Bus indexing consistency
+- Data completeness and missing value checks
+
+
 ### Stats
 
 Compute and display statistics from generated power flow data:
@@ -90,34 +116,3 @@ gridfm-datakit plots ./data_out/case24_ieee_rts/raw --sn-mva 100
 ```
 
 This command reads `bus_data.parquet`, normalizes power columns by `sn_mva`, and writes violin plots named `distribution_{feature_name}.png` to the output directory for quick visualization of feature distributions.
-
-## Validation Checks
-
-The validation command performs the following checks:
-
-### Y-Bus Consistency
-- Consistency of bus admittance matrix with branch admittance data
-- Y-bus matrix structure validation
-
-### Branch Constraints
-- Deactivated lines have zero power flows and admittances
-- Computed vs stored power flow consistency
-- Branch loading limits (OPF mode only)
-
-### Generator Constraints
-- Deactivated generators have zero power output
-- Generator power limits validation
-- Reactive power limits (OPF mode only)
-
-### Power Balance
-- Bus generation consistency between bus_data and gen_data
-- Power Balance
-
-### Data Integrity
-- Scenario indexing consistency across all files
-- Bus indexing consistency
-- Data completeness and missing value checks
-
-### `main`
-
-::: gridfm_datakit.cli.main
@@ -1,10 +1,10 @@
 # Admittance Perturbations
 
 ## Overview
-Admittance perturbations introduce changes to line admittance values by applying random scaling factors to the resistance ($R$) and reactance ($X$) parameters of grid lines. Admittance ($Y$) is related to impedance ($Z$) through $Y=1/Z$, and the impedance, in turn, is related to resistance and reactance through $Z=R+jX$. This results in more variance and diversity in power flow solutions which is beneficial for training ML models to improve generalization. Admittance perturbations are applied to the existing topology and generation perturbations.
+Admittance perturbations introduce changes to branch admittance values by applying random scaling factors to the resistance ($R$) and reactance ($X$) parameters of grid branches. This results in more variance and diversity in power flow solutions which is beneficial for training ML models to improve generalization.
 
 The module provides two options for admittance perturbation strategies:
 
-- `NoAdmittancePerturbationGenerator` yields the original example produced by the generation perturbation generator without any additional changes in line admittances.
+- `NoAdmittancePerturbationGenerator` yields the original example without any additional changes in branch admittances.
 
-- `PerturbAdmittanceGenerator` applies a scaling factor to all resistance and reactance values of network lines. The scaling factor is sampled from a uniform distribution with a range given by `[max(0, 1-sigma), 1+sigma)`, where `sigma` is a user-defined adjustable parameter.
+- `PerturbAdmittanceGenerator` applies a scaling factor to all resistance and reactance values of network branches. The scaling factor is sampled from a uniform distribution with a range given by `[max(0, 1-sigma), 1+sigma)`, where `sigma` is a user-defined adjustable parameter.
@@ -5,7 +5,7 @@ Generation perturbations introduce random changes to the cost functions of gener
 
 The module provides three options for generation perturbation strategies:
 
-- `NoGenPerturbationGenerator` yields the original example produced by the topology perturbation generator without any additional changes in generation cost.
+- `NoGenPerturbationGenerator` yields the original example without any additional changes in generation cost.
 
 - `PermuteGenCostGenerator` randomly permutes the generator cost coefficients across and among generator elements.
 
 
@@ -91,17 +91,17 @@ The `mode` parameter controls how the power flow scenarios are generated and val
 - **Constraints**: Since the topology perturbations are performed after solving OPF, the inequality constraints of OPF (e.g. branch loading, voltage magnitude at PQ buses, generator bounds on reactive power, etc) might be violated.
 - **Use Case**: Training data for power flow, contingency analysis, etc
 - **Performance**: Faster as it avoids re-solving OPF for each perturbed scenario
-- **PF Solver Choice**: Controlled by `settings.pf_fast`. If `true`, uses the fast `compute_ac_pf` path. If `false`, uses the Ipopt-based AC PF for higher fidelity at the cost of speed.
+- **PF Solver Choice**: Controlled by `settings.pf_fast`. If `true`, uses the fast `compute_ac_pf` path. If `false`, uses the Ipopt-based AC PF which is slower for smaller grids but has better convergence properties for large grids.
 
 ## Data Validation
 
 The generated data can be validated using the CLI validation command:
 
 ```bash
-# Validate with default sampling (100 partitions)
+# Validate with default sampling (100 partitions of 200 scenarios)
 gridfm-datakit validate ./data_out/case24_ieee_rts/raw
 
-# Validate with custom partition sampling
+# Validate with custom number of partitions
 gridfm-datakit validate ./data_out/case24_ieee_rts/raw --n-partitions 50
 
 # Validate all partitions (slower but complete)
 
@@ -31,10 +31,13 @@ Metadata file containing the total number of scenarios (used for efficient parti
 
 ### Network Data Files
 
+**Note**: All network data files are saved as partitioned parquet directories. Each file includes a `scenario_partition` column used for partitioning, which groups scenarios into partitions (default: 200 scenarios per partition).
+
 #### `bus_data.parquet`
-Bus-level features for each processed scenario. Columns (BUS_COLUMNS):
+Bus-level features for each processed scenario. Columns:
 
-- **scenario**: Index of the scenario (unique identifier of the power flow case)
+- **scenario**: Global scenario index (unique identifier)
+- **load_scenario_idx**: Index of the load scenario
 - **bus**: Index of the bus
 - **Pd**: Active power demand at the bus (MW)
 - **Qd**: Reactive power demand at the bus (MVAr)
@@ -56,9 +59,10 @@ If `settings.include_dc_res=True`, also includes DC power flow columns (DC_BUS_C
 - **Pg_dc**: DC active power generation at the bus (MW)
 
 #### `gen_data.parquet`
-Generator features per scenario. Columns (GEN_COLUMNS):
+Generator features per scenario. Columns:
 
-- **scenario**: Index of the scenario
+- **scenario**: Global scenario index (unique identifier)
+- **load_scenario_idx**: Index of the load scenario
 - **idx**: Generator row index (0-based)
 - **bus**: Bus index where the generator is connected
 - **p_mw**: Active power output (MW)
@@ -77,9 +81,10 @@ If `settings.include_dc_res=True`, also includes DC generator column (DC_GEN_COL
 - **p_mw_dc**: Active power from DC solution (MW)
 
 #### `branch_data.parquet`
-Branch features per scenario. Columns (BRANCH_COLUMNS):
+Branch features per scenario. Columns:
 
-- **scenario**: Index of the scenario
+- **scenario**: Global scenario index (unique identifier)
+- **load_scenario_idx**: Index of the load scenario
 - **idx**: Branch row index (0-based)
 - **from_bus**: Index of the source bus
 - **to_bus**: Index of the destination bus
@@ -110,9 +115,10 @@ If `settings.include_dc_res=True`, also includes DC branch columns (DC_BRANCH_CO
 - **pt_dc**: DC active power flow from destination to source (MW)
 
 #### `y_bus_data.parquet`
-Nonzero Y-bus entries per scenario with columns:
+Nonzero Y-bus entries per scenario. Columns:
 
-- **scenario**: Index of the scenario
+- **scenario**: Global scenario index (unique identifier)
+- **load_scenario_idx**: Index of the load scenario
 - **index1**: Row index in the Y-bus matrix
 - **index2**: Column index in the Y-bus matrix
 - **G**: Conductance value (p.u.)
@@ -121,7 +127,14 @@ Nonzero Y-bus entries per scenario with columns:
 ### Runtime Data Files
 
 #### `runtime_data.parquet`
-Runtime data for each scenario (AC and DC solver execution times).
+Runtime data for each scenario. Columns:
+
+- **scenario**: Global scenario index (unique identifier)
+- **load_scenario_idx**: Index of the load scenario
+- **ac**: AC solver execution time (seconds)
+
+If `settings.include_dc_res=True`, also includes DC runtime column (DC_RUNTIME_COLUMNS):
+- **dc**: DC solver execution time (seconds)
 
 ### Statistics Files
 
@@ -134,8 +147,8 @@ Aggregated statistics collected during generation (if `settings.no_stats=False`)
 - Maximum loading values
 - Other network performance metrics
 
-#### `stats_plot.html`
-HTML dashboard of the aggregated statistics (if `settings.no_stats=False`).
+#### `stats_plot.png`
+Visualization of the aggregated statistics (if `settings.no_stats=False`).
 
 ### Feature Visualization
 
 
@@ -2,7 +2,7 @@
 
 ## Overview
 
-Topology perturbations generate variations of the original network by altering its structure. These variations simulate contingencies and component failures, and are useful for robustness testing, contingency analysis, and training ML models on diverse grid conditions.
+Topology perturbations generate variations of the original network by altering its topology. These variations simulate contingencies and component failures, and are useful for robustness testing, contingency analysis, and training ML models on diverse grid conditions.
 
 The module provides three topology perturbation strategies:
 
 
@@ -741,9 +741,9 @@ def pf_post_processing(
     X_gen[:, 6] = net.gens[:, PMAX]
     X_gen[:, 7] = net.gens[:, QMIN]
     X_gen[:, 8] = net.gens[:, QMAX]
-    X_gen[:, 9] = net.gencosts[:, COST]
+    X_gen[:, 9] = net.gencosts[:, COST + 2]
     X_gen[:, 10] = net.gencosts[:, COST + 1]
-    X_gen[:, 11] = net.gencosts[:, COST + 2]
+    X_gen[:, 11] = net.gencosts[:, COST]
     X_gen[net.idx_gens_in_service, 12] = 1
 
     # slack gen (can be any generator connected to the ref node)
 
@@ -37,7 +37,7 @@
 generation_perturbation:
   type: "none" # Type of generation perturbation; options: cost_permutation, cost_perturbation, none
   # WARNING: the following parameter is only used if type is "cost_permutation"
-  sigma: 1.0 # Size of range use for sampling scaling factor
+  sigma: 1.0 # Size of range used for sampling scaling factor
 
 admittance_perturbation:
   type: "none" # Type of admittance perturbation; options: random_perturbation, none
@@ -49,10 +49,10 @@
   data_dir: "./testdelll" # Directory to save generated data relative to the project root
   large_chunk_size: 1000 # Number of load scenarios processed before saving
   overwrite: true # If true, overwrites existing files, if false, appends to files
-  mode: "pf" # Mode of the script; options: pf, opf. pf: power flow data where one or more operating limits – the inequality constraints defined in OPF, e.g., voltage magnitude or branch limits – may be violated. opf:  datapoints for training OPF solvers, with cost-optimal dispatches that satisfy all operating limits (OPF-feasible)
-  include_dc_res: true # If true, also stores the results of dc power flow (in addition to the results AC power flow). does not work with mode "opf"
+  mode: "pf" # Mode of the script; options: pf, opf. pf: power flow data where one or more operating limits – the inequality constraints defined in OPF, e.g., voltage magnitude or branch limits – may be violated. opf:  generates datapoints for training OPF solvers, with cost-optimal dispatches that satisfy all operating limits (OPF-feasible)
+  include_dc_res: true # If true, also stores the results of dc power flow or dc optimal power flow
   enable_solver_logs: true # If true, write OPF/PF logs to {data_dir}/solver_log; PF fast and DCPF fast do not log.
-  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks e.g. case10000_goc do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
+  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks (typically large ones e.g. case10000_goc) do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
   dcpf_fast: true # Whether to use fast DCPF solver by default (compute_dc_pf from PowerModels.jl)
   max_iter: 200 # Max iterations for Ipopt-based solvers
 
 
@@ -27,7 +27,7 @@ topology_perturbation:
 generation_perturbation:
   type: "cost_permutation" # Type of generation perturbation; options: cost_permutation, cost_perturbation, none
   # WARNING: the following parameter is only used if type is "cost_permutation"
-  sigma: 1.0 # Size of range use for sampling scaling factor
+  sigma: 1.0 # Size of range used for sampling scaling factor
 
 admittance_perturbation:
   type: "random_perturbation" # Type of admittance perturbation; options: random_perturbation, none
@@ -39,9 +39,9 @@ settings:
   data_dir: "./baseline_perturbations" # Directory to save generated data relative to the project root
   large_chunk_size: 10000 # Number of load scenarios processed before saving
   overwrite: true # If true, overwrites existing files, if false, appends to files
-  mode: "pf" # Mode of the script; options: pf, opf. pf: power flow data where one or more operating limits – the inequality constraints defined in OPF, e.g., voltage magnitude or branch limits – may be violated. opf:  datapoints for training OPF solvers, with cost-optimal dispatches that satisfy all operating limits (OPF-feasible)
-  include_dc_res: true # If true, also stores the results of dc power flow (in addition to the results AC power flow). does not work with mode "opf"
+  mode: "pf" # Mode of the script; options: pf, opf. pf: power flow data where one or more operating limits – the inequality constraints defined in OPF, e.g., voltage magnitude or branch limits – may be violated. opf:  generates datapoints for training OPF solvers, with cost-optimal dispatches that satisfy all operating limits (OPF-feasible)
+  include_dc_res: true # If true, also stores the results of dc power flow or dc optimal power flow
   enable_solver_logs: false # If true, write OPF/PF logs to {data_dir}/solver_log; PF fast and DCPF fast do not log.
-  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks e.g. case10000_goc do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
+  pf_fast: true # Whether to use fast PF solver by default (compute_ac_pf from powermodels.jl); if false, uses Ipopt-based PF. Some networks (typically large ones e.g. case10000_goc) do not work with pf_fast: true. pf_fast is faster and more accurate than the Ipopt-based PF.
   dcpf_fast: true # Whether to use fast DCPF solver by default (compute_dc_pf from PowerModels.jl)
   max_iter: 200 # Max iterations for Ipopt-based solvers