Skip to content

Commit 90508c9

Browse files
authored
Merge branch 'main' into numpy_ingestion
2 parents 9868df2 + e10c50c commit 90508c9

32 files changed

+727
-394
lines changed

docs/configuration.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
```{eval-rst}
2+
:tocdepth: 2
3+
```
4+
5+
```{currentModule} mdio.core.config
6+
7+
```
8+
9+
# Configuration
10+
11+
MDIO can be configured using environment variables to customize behavior for import, export,
12+
and validation operations. These variables provide runtime control without requiring code changes.
13+
You can find a summary of the available variables and their defaults below.
14+
15+
| **Variable** | **Type** | **Default** |
16+
| ------------------------------------- | -------- | -------------------------------- |
17+
| `MDIO__IMPORT__CPU_COUNT` | `int` | Number of logical CPUs available |
18+
| `MDIO__EXPORT__CPU_COUNT` | `int` | Number of logical CPUs available |
19+
| `MDIO__GRID__SPARSITY_RATIO_WARN` | `float` | `2.0` |
20+
| `MDIO__GRID__SPARSITY_RATIO_LIMIT` | `float` | `10.0` |
21+
| `MDIO__IMPORT__SAVE_SEGY_FILE_HEADER` | `bool` | `False` |
22+
| `MDIO__IMPORT__CLOUD_NATIVE` | `bool` | `False` |
23+
| `MDIO__IMPORT__RAW_HEADERS` | `bool` | `False` |
24+
| `MDIO_IGNORE_CHECKS` | `bool` | `False` |
25+
26+
## CPU and Performance
27+
28+
### `MDIO__EXPORT__CPU_COUNT`
29+
30+
Controls the number of CPUs used during SEG-Y export operations. Adjust this to balance
31+
performance with system resource availability.
32+
33+
```shell
34+
$ export MDIO__EXPORT__CPU_COUNT=8
35+
$ mdio segy export input.mdio output.segy
36+
```
37+
38+
### `MDIO__IMPORT__CPU_COUNT`
39+
40+
Controls the number of CPUs used during SEG-Y import operations. Higher values can
41+
significantly speed up ingestion of large datasets.
42+
43+
```shell
44+
$ export MDIO__IMPORT__CPU_COUNT=16
45+
$ mdio segy import input.segy output.mdio --header-locations 189,193
46+
```
47+
48+
## Grid Validation
49+
50+
### `MDIO__GRID__SPARSITY_RATIO_WARN`
51+
52+
Sparsity ratio threshold that triggers warnings during grid validation. The sparsity ratio
53+
measures how sparse the trace grid is compared to a dense grid. Values above this threshold
54+
will log warnings but won't prevent operations.
55+
56+
```shell
57+
$ export MDIO__GRID__SPARSITY_RATIO_WARN=3.0
58+
```
59+
60+
### `MDIO__GRID__SPARSITY_RATIO_LIMIT`
61+
62+
Sparsity ratio threshold that triggers errors and prevents operations. Use this to enforce
63+
quality standards and prevent ingestion of excessively sparse datasets that may indicate
64+
data quality issues.
65+
66+
```shell
67+
$ export MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
68+
```
69+
70+
## SEG-Y Processing
71+
72+
### `MDIO__IMPORT__SAVE_SEGY_FILE_HEADER`
73+
74+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
75+
76+
When enabled, preserves the original SEG-Y textual file header during import.
77+
This is useful for maintaining full SEG-Y standard compliance and preserving survey metadata.
78+
79+
```shell
80+
$ export MDIO__IMPORT__SAVE_SEGY_FILE_HEADER=true
81+
$ mdio segy import input.segy output.mdio --header-locations 189,193
82+
```
83+
84+
### `MDIO__IMPORT__CLOUD_NATIVE`
85+
86+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
87+
88+
Enables buffered reads during SEG-Y header scans to optimize performance when reading from or
89+
writing to cloud object storage (S3, GCS, Azure). This mode balances bandwidth usage with read
90+
latency by processing the file twice: first to determine optimal buffering, then to perform the
91+
actual ingestion.
92+
93+
```{note}
94+
This variable is designed for cloud storage I/O, regardless of where the compute is running.
95+
```
96+
97+
**When to use:**
98+
99+
- Reading from cloud storage (e.g., `s3://bucket/input.segy`)
100+
- Writing to cloud storage (e.g., `gs://bucket/output.mdio`)
101+
102+
**When to skip:**
103+
104+
- Local file paths on fast storage
105+
- Very slow network connections where bandwidth is the primary bottleneck
106+
107+
```shell
108+
$ export MDIO__IMPORT__CLOUD_NATIVE=true
109+
$ mdio segy import s3://bucket/input.segy output.mdio --header-locations 189,193
110+
```
111+
112+
## Development and Testing
113+
114+
### `MDIO_IGNORE_CHECKS`
115+
116+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
117+
118+
Bypasses validation checks during MDIO operations. This is primarily intended for development,
119+
testing, or debugging scenarios where you need to work with non-standard data.
120+
121+
```{warning}
122+
Disabling checks can lead to corrupted output or unexpected behavior. Only use this
123+
when you understand the implications and are working in a controlled environment.
124+
```
125+
126+
```shell
127+
$ export MDIO_IGNORE_CHECKS=true
128+
$ mdio segy import input.segy output.mdio --header-locations 189,193
129+
```
130+
131+
## Deprecated Features
132+
133+
### `MDIO__IMPORT__RAW_HEADERS`
134+
135+
**Accepted values:** `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`
136+
137+
```{warning}
138+
This is a deprecated feature and is expected to be removed without warning in a future release.
139+
```
140+
141+
## Configuration Best Practices
142+
143+
### Setting Multiple Variables
144+
145+
You can configure multiple environment variables at once:
146+
147+
```shell
148+
# Set for current session
149+
export MDIO__IMPORT__CPU_COUNT=16
150+
export MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
151+
export MDIO__IMPORT__CLOUD_NATIVE=true
152+
153+
# Run MDIO commands
154+
mdio segy import input.segy output.mdio --header-locations 189,193
155+
```
156+
157+
### Persistent Configuration
158+
159+
To make environment variables permanent, add them to your shell profile:
160+
161+
**Bash/Zsh:**
162+
163+
```shell
164+
# Add to ~/.bashrc or ~/.zshrc
165+
export MDIO__IMPORT__CPU_COUNT=16
166+
export MDIO__IMPORT__CLOUD_NATIVE=true
167+
```
168+
169+
**Windows:**
170+
171+
```console
172+
# Set permanently in PowerShell (run as Administrator)
173+
[System.Environment]::SetEnvironmentVariable('MDIO__IMPORT__CPU_COUNT', '16', 'User')
174+
```
175+
176+
### Project-Specific Configuration
177+
178+
For project-specific settings, use a `.env` file with tools like `python-dotenv`:
179+
180+
```python
181+
# example_import.py
182+
from dotenv import load_dotenv
183+
import mdio
184+
185+
load_dotenv() # Load environment variables from .env file
186+
# Your MDIO operations here
187+
```
188+
189+
```shell
190+
# .env file
191+
MDIO__IMPORT__CPU_COUNT=16
192+
MDIO__GRID__SPARSITY_RATIO_LIMIT=15.0
193+
MDIO__IMPORT__CLOUD_NATIVE=true
194+
```
195+
196+
## Reference
197+
198+
```{eval-rst}
199+
.. autopydantic_model:: MDIOSettings
200+
:inherited-members: BaseSettings
201+
```

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ end-before: <!-- github-only -->
1616
1717
installation
1818
cli_usage
19+
configuration
1920
```
2021

2122
```{toctree}

docs/tutorials/custom_template.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"id": "85114119ae7a4db0",
66
"metadata": {},
77
"source": [
8-
"# Create and Register a Custom Template\n",
8+
"# MDIO Template Usage\n",
99
"\n",
1010
"```{article-info}\n",
1111
":author: Altay Sansal\n",

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "multidimio"
3-
version = "1.0.8"
3+
version = "1.0.9"
44
description = "Cloud-native, scalable, and user-friendly multi dimensional energy data!"
55
authors = [{ name = "Altay Sansal", email = "[email protected]" }]
66
requires-python = ">=3.11,<3.14"
@@ -25,6 +25,7 @@ dependencies = [
2525
"pint>=0.25.0",
2626
"psutil>=7.1.0",
2727
"pydantic>=2.12.0",
28+
"pydantic-settings>=2.6.1",
2829
"rich>=14.1.0",
2930
"segy>=0.5.3",
3031
"tqdm>=4.67.1",
@@ -182,7 +183,7 @@ init_typed = true
182183
warn_required_dynamic_aliases = true
183184

184185
[tool.bumpversion]
185-
current_version = "1.0.8"
186+
current_version = "1.0.9"
186187
allow_dirty = true
187188
commit = false
188189
tag = false

src/mdio/builder/schemas/v1/units.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ class DensityUnitEnum(UnitEnum):
5959
class SpeedUnitEnum(UnitEnum):
6060
"""Enum class representing units of speed."""
6161

62-
METER_PER_SECOND = ureg.meter / ureg.second
62+
METERS_PER_SECOND = ureg.meter / ureg.second
6363
FEET_PER_SECOND = ureg.feet / ureg.second
6464

6565

src/mdio/builder/template_registry.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@
2020
from typing import TYPE_CHECKING
2121

2222
from mdio.builder.formatting_html import template_registry_repr_html
23+
from mdio.builder.templates.seismic_2d_cdp import Seismic2DCdpGathersTemplate
2324
from mdio.builder.templates.seismic_2d_poststack import Seismic2DPostStackTemplate
24-
from mdio.builder.templates.seismic_2d_prestack_cdp import Seismic2DPreStackCDPTemplate
25-
from mdio.builder.templates.seismic_2d_prestack_shot import Seismic2DPreStackShotTemplate
25+
from mdio.builder.templates.seismic_2d_streamer_shot import Seismic2DStreamerShotGathersTemplate
26+
from mdio.builder.templates.seismic_3d_cdp import Seismic3DCdpGathersTemplate
27+
from mdio.builder.templates.seismic_3d_coca import Seismic3DCocaGathersTemplate
2628
from mdio.builder.templates.seismic_3d_poststack import Seismic3DPostStackTemplate
27-
from mdio.builder.templates.seismic_3d_prestack_cdp import Seismic3DPreStackCDPTemplate
28-
from mdio.builder.templates.seismic_3d_prestack_coca import Seismic3DPreStackCocaTemplate
29-
from mdio.builder.templates.seismic_3d_prestack_shot import Seismic3DPreStackShotTemplate
29+
from mdio.builder.templates.seismic_3d_streamer_shot import Seismic3DStreamerShotGathersTemplate
3030

3131
if TYPE_CHECKING:
3232
from mdio.builder.templates.base import AbstractDatasetTemplate
@@ -126,15 +126,15 @@ def _register_default_templates(self) -> None:
126126
# CDP/CMP Ordered Data
127127
for data_domain in ("time", "depth"):
128128
for gather_domain in ("offset", "angle"):
129-
self.register(Seismic3DPreStackCDPTemplate(data_domain, gather_domain))
130-
self.register(Seismic2DPreStackCDPTemplate(data_domain, gather_domain))
129+
self.register(Seismic3DCdpGathersTemplate(data_domain, gather_domain))
130+
self.register(Seismic2DCdpGathersTemplate(data_domain, gather_domain))
131131

132-
self.register(Seismic3DPreStackCocaTemplate("time"))
133-
self.register(Seismic3DPreStackCocaTemplate("depth"))
132+
self.register(Seismic3DCocaGathersTemplate("time"))
133+
self.register(Seismic3DCocaGathersTemplate("depth"))
134134

135135
# Field (shot) data
136-
self.register(Seismic2DPreStackShotTemplate("time"))
137-
self.register(Seismic3DPreStackShotTemplate("time"))
136+
self.register(Seismic2DStreamerShotGathersTemplate())
137+
self.register(Seismic3DStreamerShotGathersTemplate())
138138

139139
def get(self, template_name: str) -> AbstractDatasetTemplate:
140140
"""Get an instance of a template from the registry by its name.

src/mdio/builder/templates/base.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -148,14 +148,29 @@ def coordinate_names(self) -> tuple[str, ...]:
148148
@property
149149
def full_chunk_shape(self) -> tuple[int, ...]:
150150
"""Returns the chunk shape for the variables."""
151-
return copy.deepcopy(self._var_chunk_shape)
151+
# If dimension sizes are not set yet, return the stored shape as-is
152+
if len(self._dim_sizes) != len(self._dim_names):
153+
return self._var_chunk_shape
154+
155+
# Expand -1 values to full dimension sizes
156+
return tuple(
157+
dim_size if chunk_size == -1 else chunk_size
158+
for chunk_size, dim_size in zip(self._var_chunk_shape, self._dim_sizes, strict=False)
159+
)
152160

153161
@full_chunk_shape.setter
154162
def full_chunk_shape(self, shape: tuple[int, ...]) -> None:
155163
"""Sets the chunk shape for the variables."""
156-
if len(shape) != len(self._dim_sizes):
157-
msg = f"Chunk shape {shape} does not match dimension sizes {self._dim_sizes}"
164+
if len(shape) != len(self._dim_names):
165+
msg = f"Chunk shape {shape} has {len(shape)} dimensions, expected {len(self._dim_names)}"
158166
raise ValueError(msg)
167+
168+
# Validate that all values are positive integers or -1
169+
for chunk_size in shape:
170+
if chunk_size != -1 and chunk_size <= 0:
171+
msg = f"Chunk size must be positive integer or -1, got {chunk_size}"
172+
raise ValueError(msg)
173+
159174
self._var_chunk_shape = shape
160175

161176
@property

src/mdio/builder/templates/seismic_2d_prestack_cdp.py renamed to src/mdio/builder/templates/seismic_2d_cdp.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Seismic2DPreStackCDPTemplate MDIO v1 dataset templates."""
1+
"""Seismic2DCDPGathersTemplate MDIO v1 dataset templates."""
22

33
from typing import Any
44

@@ -7,7 +7,7 @@
77
from mdio.builder.templates.types import SeismicDataDomain
88

99

10-
class Seismic2DPreStackCDPTemplate(AbstractDatasetTemplate):
10+
class Seismic2DCdpGathersTemplate(AbstractDatasetTemplate):
1111
"""Seismic CDP pre-stack 2D time or depth Dataset template."""
1212

1313
def __init__(self, data_domain: SeismicDataDomain, gather_domain: CdpGatherDomain):
@@ -26,7 +26,7 @@ def __init__(self, data_domain: SeismicDataDomain, gather_domain: CdpGatherDomai
2626
def _name(self) -> str:
2727
gather_domain_suffix = self._gather_domain.capitalize()
2828
data_domain_suffix = self._data_domain.capitalize()
29-
return f"PreStackCdp{gather_domain_suffix}Gathers2D{data_domain_suffix}"
29+
return f"Cdp{gather_domain_suffix}Gathers2D{data_domain_suffix}"
3030

3131
def _load_dataset_attributes(self) -> dict[str, Any]:
3232
return {"surveyType": "2D", "gatherType": "cdp"}

src/mdio/builder/templates/seismic_2d_prestack_shot.py renamed to src/mdio/builder/templates/seismic_2d_streamer_shot.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Seismic2DPreStackShotTemplate MDIO v1 dataset templates."""
1+
"""Seismic2DStreamerShotGathersTemplate MDIO v1 dataset templates."""
22

33
from typing import Any
44

@@ -9,10 +9,10 @@
99
from mdio.builder.templates.types import SeismicDataDomain
1010

1111

12-
class Seismic2DPreStackShotTemplate(AbstractDatasetTemplate):
12+
class Seismic2DStreamerShotGathersTemplate(AbstractDatasetTemplate):
1313
"""Seismic Shot pre-stack 2D time or depth Dataset template."""
1414

15-
def __init__(self, data_domain: SeismicDataDomain):
15+
def __init__(self, data_domain: SeismicDataDomain = "time"):
1616
super().__init__(data_domain=data_domain)
1717

1818
self._dim_names = ("shot_point", "channel", self._data_domain)
@@ -21,7 +21,7 @@ def __init__(self, data_domain: SeismicDataDomain):
2121

2222
@property
2323
def _name(self) -> str:
24-
return f"PreStackShotGathers2D{self._data_domain.capitalize()}"
24+
return "StreamerShotGathers2D"
2525

2626
def _load_dataset_attributes(self) -> dict[str, Any]:
2727
return {"surveyType": "2D", "ensembleType": "common_source"}

0 commit comments

Comments
 (0)