Skip to content

Commit 62fe311

Browse files
Internal environment variable handling and documentation (TGSAI#730)
* Add isolated environment variable getters * Fix pre-commit issues * Remove unnecessary intermediate variable * Refactor environment managmenet to expose basic functions for lighter imports * Add documentation page on environment variables * pre-commit * Update to use pydantic-settings * Remove pre-v1 special case environment variable * Remove MDIO__SEGY__SPEC from docs * Remove helper functions in favor of direct attribute access and pydantic validation * Update to instantiate MDIOSettings object at beginning of functions for easy access * Fix pre-commit * move settings to core * remove extra boolean parsing because pydantic handles it. * rename env to settings * remove validation tests since pydantic already guarantees that. * consistent version string for pydantic-settings * rename settings to config * rename config and add reference to Pydantic Settings object * refactor doc page * reorder tables --------- Co-authored-by: Altay Sansal <[email protected]>
1 parent 8c40d61 commit 62fe311

File tree

12 files changed

+556
-260
lines changed

12 files changed

+556
-260
lines changed

docs/configuration.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
```{eval-rst}
2+
:tocdepth: 2
3+
```
4+
5+
```{currentModule} mdio.core.config
6+
7+
```
8+
9+
# Configuration
10+
11+
MDIO can be configured using environment variables to customize behavior for import, export,
12+
and validation operations. These variables provide runtime control without requiring code changes.
13+
You can find a summary of the available variables and their defaults below.
14+
15+
| **Variable** | **Type** | **Default** |
16+
| ------------------------------------- | -------- | -------------------------------- |
17+
| `MDIO__IMPORT__CPU_COUNT` | `int` | Number of logical CPUs available |
18+
| `MDIO__EXPORT__CPU_COUNT` | `int` | Number of logical CPUs available |
19+
| `MDIO__GRID__SPARSITY_RATIO_WARN` | `float` | `2.0` |
20+
| `MDIO__GRID__SPARSITY_RATIO_LIMIT` | `float` | `10.0` |
21+
| `MDIO__IMPORT__SAVE_SEGY_FILE_HEADER` | `bool` | `False` |
22+
| `MDIO__IMPORT__CLOUD_NATIVE` | `bool` | `False` |
23+
| `MDIO__IMPORT__RAW_HEADERS` | `bool` | `False` |
24+
| `MDIO_IGNORE_CHECKS` | `bool` | `False` |
25+
26+
## CPU and Performance
27+
28+
### `MDIO__EXPORT__CPU_COUNT`
29+
30+
Controls the number of CPUs used during SEG-Y export operations. Adjust this to balance
31+
performance with system resource availability.
32+
33+
```shell
34+
$ export MDIO__EXPORT__CPU_COUNT=8
35+
$ mdio segy export input.mdio output.segy
36+
```
37+
38+
### `MDIO__IMPORT__CPU_COUNT`
39+
40+
Controls the number of CPUs used during SEG-Y import operations. Higher values can
41+
significantly speed up ingestion of large datasets.
42+
43+
```shell
44+
$ export MDIO__IMPORT__CPU_COUNT=16
45+
$ mdio segy import input.segy output.mdio --header-locations 189,193
46+
```
47+
48+
## Grid Validation
49+
50+
### `MDIO__GRID__SPARSITY_RATIO_WARN`
51+
52+
Sparsity ratio threshold that triggers warnings during grid validation. The sparsity ratio
53+
measures how sparse the trace grid is compared to a dense grid. Values above this threshold
54+
will log warnings but won't prevent operations.
55+
56+
```shell
57+
$ export MDIO__GRID__SPARSITY_RATIO_WARN=3.0
58+
```
59+
60+
### `MDIO__GRID__SPARSITY_RATIO_LIMIT`
61+
62+
Sparsity ratio threshold that triggers errors and prevents operations. Use this to enforce
63+
quality standards and prevent ingestion of excessively sparse datasets that may indicate
64+
data quality issues.
65+
66+
```shell
67+
$ export MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
68+
```
69+
70+
## SEG-Y Processing
71+
72+
### `MDIO__IMPORT__SAVE_SEGY_FILE_HEADER`
73+
74+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
75+
76+
When enabled, preserves the original SEG-Y textual file header during import.
77+
This is useful for maintaining full SEG-Y standard compliance and preserving survey metadata.
78+
79+
```shell
80+
$ export MDIO__IMPORT__SAVE_SEGY_FILE_HEADER=true
81+
$ mdio segy import input.segy output.mdio --header-locations 189,193
82+
```
83+
84+
### `MDIO__IMPORT__CLOUD_NATIVE`
85+
86+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
87+
88+
Enables buffered reads during SEG-Y header scans to optimize performance when reading from or
89+
writing to cloud object storage (S3, GCS, Azure). This mode balances bandwidth usage with read
90+
latency by processing the file twice: first to determine optimal buffering, then to perform the
91+
actual ingestion.
92+
93+
```{note}
94+
This variable is designed for cloud storage I/O, regardless of where the compute is running.
95+
```
96+
97+
**When to use:**
98+
99+
- Reading from cloud storage (e.g., `s3://bucket/input.segy`)
100+
- Writing to cloud storage (e.g., `gs://bucket/output.mdio`)
101+
102+
**When to skip:**
103+
104+
- Local file paths on fast storage
105+
- Very slow network connections where bandwidth is the primary bottleneck
106+
107+
```shell
108+
$ export MDIO__IMPORT__CLOUD_NATIVE=true
109+
$ mdio segy import s3://bucket/input.segy output.mdio --header-locations 189,193
110+
```
111+
112+
## Development and Testing
113+
114+
### `MDIO_IGNORE_CHECKS`
115+
116+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
117+
118+
Bypasses validation checks during MDIO operations. This is primarily intended for development,
119+
testing, or debugging scenarios where you need to work with non-standard data.
120+
121+
```{warning}
122+
Disabling checks can lead to corrupted output or unexpected behavior. Only use this
123+
when you understand the implications and are working in a controlled environment.
124+
```
125+
126+
```shell
127+
$ export MDIO_IGNORE_CHECKS=true
128+
$ mdio segy import input.segy output.mdio --header-locations 189,193
129+
```
130+
131+
## Deprecated Features
132+
133+
### `MDIO__IMPORT__RAW_HEADERS`
134+
135+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
136+
137+
```{warning}
138+
This is a deprecated feature and is expected to be removed without warning in a future release.
139+
```
140+
141+
## Configuration Best Practices
142+
143+
### Setting Multiple Variables
144+
145+
You can configure multiple environment variables at once:
146+
147+
```shell
148+
# Set for current session
149+
export MDIO__IMPORT__CPU_COUNT=16
150+
export MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
151+
export MDIO__IMPORT__CLOUD_NATIVE=true
152+
153+
# Run MDIO commands
154+
mdio segy import input.segy output.mdio --header-locations 189,193
155+
```
156+
157+
### Persistent Configuration
158+
159+
To make environment variables permanent, add them to your shell profile:
160+
161+
**Bash/Zsh:**
162+
163+
```shell
164+
# Add to ~/.bashrc or ~/.zshrc
165+
export MDIO__IMPORT__CPU_COUNT=16
166+
export MDIO__IMPORT__CLOUD_NATIVE=true
167+
```
168+
169+
**Windows:**
170+
171+
```console
172+
# Set permanently in PowerShell (run as Administrator)
173+
[System.Environment]::SetEnvironmentVariable('MDIO__IMPORT__CPU_COUNT', '16', 'User')
174+
```
175+
176+
### Project-Specific Configuration
177+
178+
For project-specific settings, use a `.env` file with tools like `python-dotenv`:
179+
180+
```python
181+
# example_import.py
182+
from dotenv import load_dotenv
183+
import mdio
184+
185+
load_dotenv() # Load environment variables from .env file
186+
# Your MDIO operations here
187+
```
188+
189+
```shell
190+
# .env file
191+
MDIO__IMPORT__CPU_COUNT=16
192+
MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
193+
MDIO__IMPORT__CLOUD_NATIVE=true
194+
```
195+
196+
## Reference
197+
198+
```{eval-rst}
199+
.. autopydantic_model:: MDIOSettings
200+
:inherited-members: BaseSettings
201+
```

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ end-before: <!-- github-only -->
1616
1717
installation
1818
cli_usage
19+
configuration
1920
```
2021

2122
```{toctree}

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ dependencies = [
2525
"pint>=0.25.0",
2626
"psutil>=7.1.0",
2727
"pydantic>=2.12.0",
28+
"pydantic-settings>=2.6.1",
2829
"rich>=14.1.0",
2930
"segy>=0.5.3",
3031
"tqdm>=4.67.1",

src/mdio/converters/mdio.py

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,15 @@
22

33
from __future__ import annotations
44

5-
import os
65
from tempfile import TemporaryDirectory
76
from typing import TYPE_CHECKING
87

98
import numpy as np
10-
from psutil import cpu_count
119
from tqdm.dask import TqdmCallback
1210

1311
from mdio.api.io import _normalize_path
1412
from mdio.api.io import open_mdio
13+
from mdio.core.config import MDIOSettings
1514
from mdio.segy.blocked_io import to_segy
1615
from mdio.segy.creation import concat_files
1716
from mdio.segy.creation import mdio_spec_to_segy
@@ -29,10 +28,6 @@
2928
from upath import UPath
3029

3130

32-
default_cpus = cpu_count(logical=True)
33-
NUM_CPUS = int(os.getenv("MDIO__EXPORT__CPU_COUNT", default_cpus))
34-
35-
3631
def mdio_to_segy( # noqa: PLR0912, PLR0913, PLR0915
3732
segy_spec: SegySpec,
3833
input_path: UPath | Path | str,
@@ -73,6 +68,8 @@ def mdio_to_segy( # noqa: PLR0912, PLR0913, PLR0915
7368
>>> output_path = UPath("prefix/file.segy")
7469
>>> mdio_to_segy(input_path, output_path)
7570
"""
71+
settings = MDIOSettings()
72+
7673
input_path = _normalize_path(input_path)
7774
output_path = _normalize_path(output_path)
7875

@@ -148,7 +145,7 @@ def mdio_to_segy( # noqa: PLR0912, PLR0913, PLR0915
148145
if client is not None:
149146
block_records = block_records.compute()
150147
else:
151-
block_records = block_records.compute(num_workers=NUM_CPUS)
148+
block_records = block_records.compute(num_workers=settings.export_cpus)
152149

153150
ordered_files = [rec.path for rec in block_records.ravel() if rec != 0]
154151
ordered_files = [output_path] + ordered_files

src/mdio/converters/segy.py

Lines changed: 12 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44

55
import base64
66
import logging
7-
import os
87
from typing import TYPE_CHECKING
98

109
import numpy as np
@@ -28,10 +27,10 @@
2827
from mdio.builder.schemas.v1.variable import VariableMetadata
2928
from mdio.builder.xarray_builder import to_xarray_dataset
3029
from mdio.constants import ZarrFormat
31-
from mdio.converters.exceptions import EnvironmentFormatError
3230
from mdio.converters.exceptions import GridTraceCountError
3331
from mdio.converters.exceptions import GridTraceSparsityError
3432
from mdio.converters.type_converter import to_structured_type
33+
from mdio.core.config import MDIOSettings
3534
from mdio.core.grid import Grid
3635
from mdio.core.utils_write import MAX_COORDINATES_BYTES
3736
from mdio.core.utils_write import MAX_SIZE_LIVE_MASK
@@ -92,30 +91,18 @@ def grid_density_qc(grid: Grid, num_traces: int) -> None:
9291
Raises:
9392
GridTraceSparsityError: If the sparsity ratio exceeds `MDIO__GRID__SPARSITY_RATIO_LIMIT`
9493
and `MDIO_IGNORE_CHECKS` is not set to a truthy value (e.g., "1", "true").
95-
EnvironmentFormatError: If `MDIO__GRID__SPARSITY_RATIO_WARN` or
96-
`MDIO__GRID__SPARSITY_RATIO_LIMIT` cannot be converted to a float.
9794
"""
95+
settings = MDIOSettings()
9896
# Calculate total possible traces in the grid (excluding sample dimension)
9997
grid_traces = np.prod(grid.shape[:-1], dtype=np.uint64)
10098

10199
# Handle division by zero if num_traces is 0
102100
sparsity_ratio = float("inf") if num_traces == 0 else grid_traces / num_traces
103101

104102
# Fetch and validate environment variables
105-
warning_ratio_env = os.getenv("MDIO__GRID__SPARSITY_RATIO_WARN", "2")
106-
error_ratio_env = os.getenv("MDIO__GRID__SPARSITY_RATIO_LIMIT", "10")
107-
ignore_checks_env = os.getenv("MDIO_IGNORE_CHECKS", "false").lower()
108-
ignore_checks = ignore_checks_env in ("1", "true", "yes", "on")
109-
110-
try:
111-
warning_ratio = float(warning_ratio_env)
112-
except ValueError as e:
113-
raise EnvironmentFormatError("MDIO__GRID__SPARSITY_RATIO_WARN", "float") from e # noqa: EM101
114-
115-
try:
116-
error_ratio = float(error_ratio_env)
117-
except ValueError as e:
118-
raise EnvironmentFormatError("MDIO__GRID__SPARSITY_RATIO_LIMIT", "float") from e # noqa: EM101
103+
warning_ratio = settings.grid_sparsity_ratio_warn
104+
error_ratio = settings.grid_sparsity_ratio_limit
105+
ignore_checks = settings.ignore_checks
119106

120107
# Check sparsity
121108
should_warn = sparsity_ratio > warning_ratio
@@ -373,8 +360,9 @@ def _populate_coordinates(
373360

374361

375362
def _add_segy_file_headers(xr_dataset: xr_Dataset, segy_file_info: SegyFileInfo) -> xr_Dataset:
376-
save_file_header = os.getenv("MDIO__IMPORT__SAVE_SEGY_FILE_HEADER", "") in ("1", "true", "yes", "on")
377-
if not save_file_header:
363+
settings = MDIOSettings()
364+
365+
if not settings.save_segy_file_header:
378366
return xr_dataset
379367

380368
expected_rows = 40
@@ -398,7 +386,7 @@ def _add_segy_file_headers(xr_dataset: xr_Dataset, segy_file_info: SegyFileInfo)
398386
"binaryHeader": segy_file_info.binary_header_dict,
399387
}
400388
)
401-
if os.getenv("MDIO__IMPORT__RAW_HEADERS") in ("1", "true", "yes", "on"):
389+
if settings.raw_headers:
402390
raw_binary_base64 = base64.b64encode(segy_file_info.raw_binary_headers).decode("ascii")
403391
xr_dataset["segy_file_header"].attrs.update({"rawBinaryHeader": raw_binary_base64})
404392

@@ -536,6 +524,8 @@ def segy_to_mdio( # noqa PLR0913
536524
Raises:
537525
FileExistsError: If the output location already exists and overwrite is False.
538526
"""
527+
settings = MDIOSettings()
528+
539529
_validate_spec_in_template(segy_spec, mdio_template)
540530

541531
input_path = _normalize_path(input_path)
@@ -565,7 +555,7 @@ def segy_to_mdio( # noqa PLR0913
565555
_, non_dim_coords = _get_coordinates(grid, segy_headers, mdio_template)
566556
header_dtype = to_structured_type(segy_spec.trace.header.dtype)
567557

568-
if os.getenv("MDIO__IMPORT__RAW_HEADERS") in ("1", "true", "yes", "on"):
558+
if settings.raw_headers:
569559
if zarr.config.get("default_zarr_format") == ZarrFormat.V2:
570560
logger.warning("Raw headers are only supported for Zarr v3. Skipping raw headers.")
571561
else:

0 commit comments

Comments
 (0)