Skip to content

Commit f4a5ad4

Browse files
authored
Parsing local and cloud SEG-Y files with new I/O library (#381)
* Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor text_header setter type check Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process. * Refactor workers for SEG-Y parsing Simplify `header_scan_worker` and `trace_worker` in SEG-Y module by removing unused imports and streamlining parameter list. Update functions to work directly with `SegyFile` instances and clean up data handling logic for efficiency. * Refactor SEG-Y parser and streamline imports Refactor the parsing functions in `src/mdio/segy/parsers.py` to simplify the codebase and improve maintainability. Redundant functions such as `parse_binary_header`, `parse_text_header`, and `get_trace_count` have been removed, while imports have been condensed to only essential modules. The `NUM_CORES` logic is updated to count logical cores instead of just physical ones. * Refactor SEG-Y converter and simplify imports Removed unused imports and functions in the SEG-Y converter module to enhance code maintainability. Simplified the arguments for the `segy_to_mdio` function to increase ease of use and readability. Reduced complexity by utilizing `SegyFile` class for SEG-Y file operations. * Refactor get_grid_plan and remove unused imports The get_grid_plan function in utilities.py has been refactored to accept a SegyFile instance instead of individual parameters for the file path. Unused imports were eliminated, and type checking imports are now conditional, improving readability and modularity. * use NDArray typing since we now return struct * Refactor to use 'segy' instead of 'segyio'. The changes involve major refactoring of the code base to use the 'segy' library instead of 'segyio'. Most notably, this included updating the handling of SEG-Y dtypes, byte order, and trace headers. Unused imports have been removed to clean up the code. A new multiprocessing chunk size has been introduced and set attributes to SegyFile instance instead of passing them as function arguments. * refactor override tests to use ndarray headers instead of a dictionary to make it work with 'segy'. * Remove unit tests for IBM/IEEE conversions and text headers * Refactor and simplify 6D tests related to SEG-Y * Refactor and simplify 6D tests related to SEG-Y * Upgrade segy package version The segy package version has been updated from 0.0.13 to 0.0.14 in the pyproject.toml file. This upgrade was performed to update software dependencies and to integrate the latest bug fixes and features delivered with the new version. * Refactored segy factory creation in mdio_to_segy function A new helper function, 'make_segy_factory', has been created to handle the generation of SegyFactory. This function accepts more parameters to provide better control over the creation of the SEG-Y based on the MDIO metadata. Changes also include updates in import declarations and reorganization of some code blocks in the 'mdio_spec_to_segy' function. * Update segy library version * Multiply sample_interval by 1000 in SegyFactory In the SegyFactory initialization within creation.py, the sample_interval parameter has been modified to be multiplied by 1000. This change ensures that the value is correctly represented in microseconds, aligning with the expected data format. * fix docstring errors * Update dependency package versions * update field name for segy data * import Endianness from new location * use bleeding edge segy during dev * allow configuring endianness on export * update binary header * Update the 'segy' git repository link * Update virtualenv version in constraints.txt * Update poetry version in workflow constraints * update RtD dependencies * switch myst-nb to stable * fix broken tests * fix broken tests * simplify factory usage and fix tests * add original segyio fields as spec * Add pytest-dependency to project dev dependencies * fix: headers were missed due to early return * streamline mdio segy spec * simplify mock 4d generation * enforce mdio segy spec * update type hints to the correct segy type. * revert api * update type hints * remove endian from segy import because its inferred * remove output format from seg-y export. we only export as its set in "binary header" * update endian kwarg name * revert to old api * enable all tests * Update get_grid_plan * Remove unused byte swapping function from segy creation module. * Remove now unused byte utils module * Add temporary safety check ignore for specific CVE The safety check in noxfile.py has been updated to temporarily ignore a specific Common Vulnerabilities and Exposures (CVE) number because it's not deemed critical. A TODO note is added to remind removal of this exception once the issue is resolved. * fix safety ignore syntax * make temp zarr files module scoped * revert to_segy endian api * simplify changes * Correct variable in default chunk selection * Update segy package version in pyproject.toml * use correct spec for factory * use new endian inference from `segy` * get header dtype from spec instead of reading a header * remove unnecessary cast * remove commented line * Implement dynamic CPU count for header parsing * backward_compat: revert text header to write as list[str] instead of str with newline * generate spec as needed and avoid singleton bugs * bump version * add missing return doc * Add future annotations import for type hints
1 parent 6bb1956 commit f4a5ad4

29 files changed

+1577
-2306
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
poetry==1.8.2
1+
poetry==1.8.3

.github/workflows/constraints.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
pip==24.0
22
nox==2024.4.15
33
nox-poetry==1.0.3
4-
virtualenv==20.26.1
4+
virtualenv==20.26.2

docs/requirements.txt

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
furo==2024.5.6
22
sphinx==7.3.7
3-
sphinx-click==5.1.0
3+
sphinx-click==6.0.0
44
sphinx-copybutton==0.5.2
5-
# myst-nb==0.17.2
6-
myst-nb @ git+https://github.com/executablebooks/MyST-NB@35ebd54
5+
myst-nb==1.1.0
76
linkify-it-py==2.0.3

noxfile.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,15 @@ def safety(session: Session) -> None:
144144
"""Scan dependencies for insecure packages."""
145145
requirements = session.poetry.export_requirements()
146146
session.install("safety")
147-
session.run("safety", "check", "--full-report", f"--file={requirements}")
147+
# TODO(Altay): Remove the CVE ignore once its resolved. Its not critical, so ignoring now.
148+
ignore = ["70612"]
149+
session.run(
150+
"safety",
151+
"check",
152+
"--full-report",
153+
f"--file={requirements}",
154+
f"--ignore={','.join(ignore)}",
155+
)
148156

149157

150158
@session(python=python_versions)
@@ -219,9 +227,7 @@ def docs_build(session: Session) -> None:
219227
"sphinx-click",
220228
"sphinx-copybutton",
221229
"furo",
222-
# TODO(Altay): Update this to v1.0.0 when its out. Right now we
223-
# use this because myst-nb stable doesn't work with Sphinx 7.
224-
"myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
230+
"myst-nb",
225231
"linkify-it-py",
226232
)
227233

@@ -243,9 +249,7 @@ def docs(session: Session) -> None:
243249
"sphinx-click",
244250
"sphinx-copybutton",
245251
"furo",
246-
# TODO(Altay): Update this to v1.0.0 when its out. Right now we
247-
# use this because myst-nb stable doesn't work with Sphinx 7.
248-
"myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
252+
"myst-nb",
249253
"linkify-it-py",
250254
)
251255

poetry.lock

Lines changed: 1059 additions & 492 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 31 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "multidimio"
3-
version = "0.7.4"
3+
version = "0.8.0"
44
description = "Cloud-native, scalable, and user-friendly multi dimensional energy data!"
55
authors = ["TGS <[email protected]>"]
66
maintainers = [
@@ -26,22 +26,21 @@ Changelog = "https://github.com/TGSAI/mdio-python/releases"
2626
python = ">=3.9,<3.13"
2727
click = "^8.1.7"
2828
click-params = "^0.5.0"
29-
zarr = "^2.16.1"
30-
dask = ">=2023.10.0"
31-
tqdm = "^4.66.1"
32-
segyio = "^1.9.3"
33-
numba = "^0.59.1"
34-
psutil = "^5.9.5"
35-
fsspec = ">=2023.9.1"
29+
zarr = "^2.18.2"
30+
dask = ">=2024.6.1"
31+
tqdm = "^4.66.4"
32+
psutil = "^6.0.0"
33+
fsspec = ">=2024.6.0"
34+
segy = "^0.1.4"
3635
rich = "^13.7.1"
3736
urllib3 = "^1.26.18" # Workaround for poetry-plugin-export/issues/183
3837

3938
# Extras
40-
distributed = {version = ">=2023.10.0", optional = true}
41-
bokeh = {version = "^3.2.2", optional = true}
42-
s3fs = {version = ">=2023.5.0", optional = true}
43-
gcsfs = {version = ">=2023.5.0", optional = true}
44-
adlfs = {version = ">=2023.4.0", optional = true}
39+
distributed = {version = ">=2024.6.1", optional = true}
40+
bokeh = {version = "^3.4.1", optional = true}
41+
s3fs = {version = ">=2024.6.0", optional = true}
42+
gcsfs = {version = ">=2024.6.0", optional = true}
43+
adlfs = {version = ">=2024.4.1", optional = true}
4544
zfpy = {version = "^0.5.5", optional = true}
4645

4746
[tool.poetry.extras]
@@ -51,30 +50,31 @@ lossy = ["zfpy"]
5150

5251
[tool.poetry.group.dev.dependencies]
5352
black = "^24.4.2"
54-
coverage = {version = "^7.4.0", extras = ["toml"]}
53+
coverage = {version = "^7.5.3", extras = ["toml"]}
5554
darglint = "^1.8.1"
56-
flake8 = "^7.0.0"
55+
flake8 = "^7.1.0"
5756
flake8-bandit = "^4.1.1"
58-
flake8-bugbear = "^23.12.2"
57+
flake8-bugbear = "^24.4.26"
5958
flake8-docstrings = "^1.7.0"
6059
flake8-rst-docstrings = "^0.3.0"
61-
furo = ">=2023.9.10"
60+
furo = ">=2024.5.6"
6261
isort = "^5.13.2"
63-
mypy = "^1.8.0"
64-
pep8-naming = "^0.13.3"
65-
pre-commit = "^3.6.0"
66-
pre-commit-hooks = "^4.5.0"
67-
pytest = "^7.4.4"
68-
pyupgrade = "^3.15.0"
69-
safety = "^2.3.5"
70-
sphinx-autobuild = "^2021.3.14"
71-
sphinx-click = "^5.1.0"
62+
mypy = "^1.10.0"
63+
pep8-naming = "^0.14.1"
64+
pre-commit = "^3.7.1"
65+
pre-commit-hooks = "^4.6.0"
66+
pytest = "^8.2.2"
67+
pytest-dependency = "^0.6.0"
68+
pyupgrade = "^3.16.0"
69+
safety = "^3.2.3"
70+
sphinx-autobuild = ">=2024.4.16"
71+
sphinx-click = "^6.0.0"
7272
sphinx-copybutton = "^0.5.2"
73-
typeguard = "^4.1.5"
74-
xdoctest = {version = "^1.1.2", extras = ["colors"]}
75-
myst-parser = "^2.0.0"
76-
Pygments = "^2.17.2"
77-
Sphinx = "^7.2.6"
73+
typeguard = "^4.3.0"
74+
xdoctest = {version = "^1.1.5", extras = ["colors"]}
75+
myst-parser = "^3.0.1"
76+
Pygments = "^2.18.0"
77+
Sphinx = "^7.3.7"
7878

7979
[tool.poetry.scripts]
8080
mdio = "mdio.__main__:main"

src/mdio/commands/segy.py

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -96,16 +96,6 @@
9696
help="Custom chunk size for bricked storage",
9797
type=IntListParamType(),
9898
)
99-
@option(
100-
"-endian",
101-
"--endian",
102-
required=False,
103-
default="big",
104-
help="Endianness of the SEG-Y file",
105-
type=Choice(["little", "big"]),
106-
show_default=True,
107-
show_choices=True,
108-
)
10999
@option(
110100
"-lossless",
111101
"--lossless",
@@ -152,7 +142,6 @@ def segy_import(
152142
header_types: list[str],
153143
header_names: list[str],
154144
chunk_size: list[int],
155-
endian: str,
156145
lossless: bool,
157146
compression_tolerance: float,
158147
storage_options: dict[str, Any],
@@ -356,7 +345,6 @@ def segy_import(
356345
index_types=header_types,
357346
index_names=header_names,
358347
chunksize=chunk_size,
359-
endian=endian,
360348
lossless=lossless,
361349
compression_tolerance=compression_tolerance,
362350
storage_options=storage_options,
@@ -377,16 +365,6 @@ def segy_import(
377365
type=STRING,
378366
show_default=True,
379367
)
380-
@option(
381-
"-format",
382-
"--segy-format",
383-
required=False,
384-
default="ibm32",
385-
help="SEG-Y sample format",
386-
type=Choice(["ibm32", "ieee32"]),
387-
show_default=True,
388-
show_choices=True,
389-
)
390368
@option(
391369
"-storage",
392370
"--storage-options",
@@ -408,7 +386,6 @@ def segy_export(
408386
mdio_file: str,
409387
segy_path: str,
410388
access_pattern: str,
411-
segy_format: str,
412389
storage_options: dict[str, Any],
413390
endian: str,
414391
):
@@ -438,7 +415,6 @@ def segy_export(
438415
mdio_path_or_buffer=mdio_file,
439416
output_segy_path=segy_path,
440417
access_pattern=access_pattern,
441-
out_sample_format=segy_format,
442418
storage_options=storage_options,
443419
endian=endian,
444420
)

src/mdio/converters/mdio.py

Lines changed: 4 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@
1212

1313
from mdio import MDIOReader
1414
from mdio.segy.blocked_io import to_segy
15-
from mdio.segy.byte_utils import ByteOrder
16-
from mdio.segy.byte_utils import Dtype
1715
from mdio.segy.creation import concat_files
1816
from mdio.segy.creation import mdio_spec_to_segy
1917
from mdio.segy.utilities import segy_export_rechunker
@@ -34,7 +32,6 @@ def mdio_to_segy( # noqa: C901
3432
output_segy_path: str,
3533
endian: str = "big",
3634
access_pattern: str = "012",
37-
out_sample_format: str = "ibm32",
3835
storage_options: dict = None,
3936
new_chunks: tuple[int, ...] = None,
4037
selection_mask: np.ndarray = None,
@@ -65,8 +62,6 @@ def mdio_to_segy( # noqa: C901
6562
endian. Default is 'big'.
6663
access_pattern: This specificies the chunk access pattern. Underlying
6764
zarr.Array must exist. Examples: '012', '01'
68-
out_sample_format: Output sample format.
69-
Currently support: {'ibm32', 'float32'}. Default is 'ibm32'.
7065
storage_options: Storage options for the cloud storage backend.
7166
Default: None (will assume anonymous access)
7267
new_chunks: Set manual chunksize. For development purposes only.
@@ -99,7 +94,6 @@ def mdio_to_segy( # noqa: C901
9994
... mdio_path_or_buffer="prefix2/file.mdio",
10095
... output_segy_path="prefix/file.segy",
10196
... selection_mask=boolean_mask,
102-
... out_sample_format="float32",
10397
... )
10498
10599
"""
@@ -117,25 +111,23 @@ def mdio_to_segy( # noqa: C901
117111
creation_args = [
118112
mdio_path_or_buffer,
119113
output_segy_path,
120-
endian,
121114
access_pattern,
122-
out_sample_format,
115+
endian,
123116
storage_options,
124117
new_chunks,
125-
selection_mask,
126118
backend,
127119
]
128120

129121
if client is not None:
130122
if distributed is not None:
131123
# This is in case we work with big data
132124
feature = client.submit(mdio_spec_to_segy, *creation_args)
133-
mdio, sample_format = feature.result()
125+
mdio, segy_factory = feature.result()
134126
else:
135127
msg = "Distributed client was provided, but `distributed` is not installed"
136128
raise ImportError(msg)
137129
else:
138-
mdio, sample_format = mdio_spec_to_segy(*creation_args)
130+
mdio, segy_factory = mdio_spec_to_segy(*creation_args)
139131

140132
live_mask = mdio.live_mask.compute()
141133

@@ -163,10 +155,6 @@ def mdio_to_segy( # noqa: C901
163155
selection_mask = selection_mask[dim_slices]
164156
live_mask = live_mask & selection_mask
165157

166-
# Parse output type and byte order
167-
out_dtype = Dtype[out_sample_format.upper()]
168-
out_byteorder = ByteOrder[endian.upper()]
169-
170158
# tmp file root
171159
out_dir = path.dirname(output_segy_path)
172160
tmp_dir = TemporaryDirectory(dir=out_dir)
@@ -177,8 +165,7 @@ def mdio_to_segy( # noqa: C901
177165
samples=samples,
178166
headers=headers,
179167
live_mask=live_mask,
180-
out_dtype=out_dtype,
181-
out_byteorder=out_byteorder,
168+
segy_factory=segy_factory,
182169
file_root=tmp_dir.name,
183170
axis=tuple(range(1, samples.ndim)),
184171
)

0 commit comments

Comments
 (0)