Skip to content

Hapi csv codec#277

Merged
jeandet merged 17 commits intoSciQLop:mainfrom
co-libri-org:hapi_csv_codec
Mar 20, 2026
Merged

Hapi csv codec#277
jeandet merged 17 commits intoSciQLop:mainfrom
co-libri-org:hapi_csv_codec

Conversation

@RichardHitier
Copy link
Contributor

Enhance hapi csv codec features: read mulitaxis variables, and write speasy to hapi csv.

@RichardHitier RichardHitier force-pushed the hapi_csv_codec branch 2 times, most recently from 93da373 to 3c454fe Compare March 3, 2026 11:02
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 83.75000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.48%. Comparing base (92d5bec) to head (36f946d).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...peasy/core/codecs/bundled_codecs/hapi_csv/codec.py 81.53% 8 Missing and 4 partials ⚠️
...easy/core/codecs/bundled_codecs/hapi_csv/writer.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #277      +/-   ##
==========================================
+ Coverage   84.34%   84.48%   +0.14%     
==========================================
  Files          69       69              
  Lines        4771     4834      +63     
  Branches      656      668      +12     
==========================================
+ Hits         4024     4084      +60     
+ Misses        516      508       -8     
- Partials      231      242      +11     
Flag Coverage Δ
unittests 84.48% <83.75%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RichardHitier RichardHitier force-pushed the hapi_csv_codec branch 3 times, most recently from c648d39 to ddc0ad1 Compare March 9, 2026 14:05
def test_spz_getdata_to_csv(self):
hapi_csv_codec: CodecInterface = get_codec('hapi/csv')
spz_var = spz.get_data("cda/STA_L1_HET/Proton_Flux", "2020-10-28", "2020-10-28T01")
hapi_csv_file = hapi_csv_codec.save_variables(variables=[spz_var], file='test_output.csv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tmp file or dir for auto cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

raise ValueError("No variables to save")
hapi_csv_file = HapiCsvFile()
hapi_csv_file.add_parameter(_make_hapi_csv_time_axis(variables[0].time))
hapi_csv_file.add_parameter(_make_hapi_csv_time_axis(variables[0].axes[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time is cleaner and show the intent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for ax in _get_variable_axes(variable, is_time_dependent=True):
meta = {
"name": _time_dependent_axis_name(ax),
"type": "double",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure hard-coding the type here is safe, ax.values can be of any numerical type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (refactored with _create_meta())

Comment on lines +19 to +21
json_str = json.dumps(headers, indent=2, ensure_ascii=False)
commented = "\n".join("#" + line for line in json_str.splitlines())
dest.write((commented + "\n").encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how we prepend the json headers as comment on top of .csv file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but why binary and multi-lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • multilines because I find it more human readable
  • binary because it was the orginal format sent by save_hapi_csv()

do you prefer

  • one-line header ?
  • change save_hapi_csv() for a text data type ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one line is faster so yes for this one.
For text, I don't know, it was just suspicious we had to do this, I want to be sure we do not miss anything that would break later.

time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
{"name": ax.name, "units": ax.unit, "centers": _time_dependent_axis_name(ax)}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

variables = hapi_csv_codec.load_variables(file=f, variables=['spectra_time_dependent_bins'], disable_cache=True)
self.assertIn('spectra_time_dependent_bins', variables)

def test_load_time_independant_axis(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_load_time_independant_axis(self):
def test_load_time_independent_axis(self):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the HAPI CSV codec to better support multi-axis (binned) variables on read and to export Speasy variables to HAPI CSV, with added regression tests and new CSV fixtures.

Changes:

  • Add tests and fixtures for time-independent and time-varying bin axes in HAPI CSV.
  • Extend HAPI CSV codec to decode bins into VariableAxis objects and emit bins metadata when saving.
  • Update HAPI CSV writer to output pretty-printed commented JSON headers and to expand multi-dimensional values into CSV columns.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tests/test_hapi_codecs.py Adds unit tests for bin-axis decoding and for saving variables back to HAPI CSV.
tests/resources/HAPI_ndData_TimeVarying_Axis.csv New fixture covering time-varying bin centers/ranges.
tests/resources/HAPI_ndData_TimeIndependent_Axis.csv New fixture covering time-independent bin centers/ranges.
speasy/core/codecs/bundled_codecs/hapi_csv/writer.py Updates header serialization and flattens multi-dim parameters into multiple CSV columns.
speasy/core/codecs/bundled_codecs/hapi_csv/codec.py Adds bins<->axes mapping for load/save and writes additional parameters for time-varying axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +19 to +21
json_str = json.dumps(headers, indent=2, ensure_ascii=False)
commented = "\n".join("#" + line for line in json_str.splitlines())
dest.write((commented + "\n").encode("utf-8"))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is written as bytes (encode('utf-8')) but pandas.DataFrame.to_csv typically writes strings to the same dest. If dest is a text stream, line 21 will raise TypeError; if dest is a binary stream, to_csv may fail when writing strings. Use a consistent text writer for both header and dataframe output (e.g., write commented + '\\n' as str, or wrap a binary stream with a TextIOWrapper and pass that same wrapper to both writes).

Copilot uses AI. Check for mistakes.
Comment on lines 33 to 34
df = pds.DataFrame(data)
df.to_csv(dest, index=False, header=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ', float_format='%.3g')
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is written as bytes (encode('utf-8')) but pandas.DataFrame.to_csv typically writes strings to the same dest. If dest is a text stream, line 21 will raise TypeError; if dest is a binary stream, to_csv may fail when writing strings. Use a consistent text writer for both header and dataframe output (e.g., write commented + '\\n' as str, or wrap a binary stream with a TextIOWrapper and pass that same wrapper to both writes).

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +31
vals = param.values
if vals.ndim == 1:
data[param.name] = vals
else:
for i in range(vals.shape[1]):
data[f"{param.name}_{i}"] = vals[:, i]
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-dimensional values are only expanded along shape[1] and sliced with vals[:, i]. For ndim > 2, vals[:, i] remains multi-dimensional and will produce object-like columns or inconsistent CSV output. Consider reshaping values to 2D (time, -1) and generating column names for all flattened components (or iterating across all non-time indices) so ndData variables with >2 dimensions serialize predictably.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +64
{"name": ax.name, "unit": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])

time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bin metadata uses the key unit, but HAPI parameter metadata (and your fixtures) use units. This will cause _decode_meta / _bin_to_axis to miss units and drop axis units on load. Rename unit to units in the created bins entries to match the HAPI schema and the rest of the codec.

Suggested change
{"name": ax.name, "unit": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])
time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
{"name": ax.name, "units": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])
time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "units": ax.unit, "centers": _time_dependent_axis_name(ax)}

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +64
{"name": ax.name, "unit": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])

time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bin metadata uses the key unit, but HAPI parameter metadata (and your fixtures) use units. This will cause _decode_meta / _bin_to_axis to miss units and drop axis units on load. Rename unit to units in the created bins entries to match the HAPI schema and the rest of the codec.

Suggested change
{"name": ax.name, "unit": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])
time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "unit": ax.unit, "centers": _time_dependent_axis_name(ax)}
{"name": ax.name, "units": ax.unit, "centers": ax.values.tolist()}
for ax in time_independent_axes
])
time_dependent_axes = _get_variable_axes(variable, is_time_dependent=True)
if time_dependent_axes:
bins.extend([
{"name": ax.name, "units": ax.unit, "centers": _time_dependent_axis_name(ax)}

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +86
def _bin_to_axis(json_bin: Dict[str, Any], hap_csv_file: HapiCsvFile) -> VariableAxis:
centers = json_bin.get("centers")
name = json_bin.get("name", "bin_axis")
if centers is None:
raise ValueError("Invalid bin specification: missing 'centers' field")
if isinstance(centers, str):
hapi_parameter = hap_csv_file.get_parameter(centers)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter name hap_csv_file looks like a typo and is easy to confuse with hapi_csv_file used elsewhere. Renaming it to hapi_csv_file would improve readability and reduce the chance of mistakes when editing these helpers.

Suggested change
def _bin_to_axis(json_bin: Dict[str, Any], hap_csv_file: HapiCsvFile) -> VariableAxis:
centers = json_bin.get("centers")
name = json_bin.get("name", "bin_axis")
if centers is None:
raise ValueError("Invalid bin specification: missing 'centers' field")
if isinstance(centers, str):
hapi_parameter = hap_csv_file.get_parameter(centers)
def _bin_to_axis(json_bin: Dict[str, Any], hapi_csv_file: HapiCsvFile) -> VariableAxis:
centers = json_bin.get("centers")
name = json_bin.get("name", "bin_axis")
if centers is None:
raise ValueError("Invalid bin specification: missing 'centers' field")
if isinstance(centers, str):
hapi_parameter = hapi_csv_file.get_parameter(centers)

Copilot uses AI. Check for mistakes.
return variable_axis


def _bins_to_axes(json_bins: List[Dict[str, Any]], hap_csv_file: HapiCsvFile) -> List[VariableAxis]:
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter name hap_csv_file looks like a typo and is easy to confuse with hapi_csv_file used elsewhere. Renaming it to hapi_csv_file would improve readability and reduce the chance of mistakes when editing these helpers.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +135
def test_spz_getdata_to_csv(self):
hapi_csv_codec: CodecInterface = get_codec('hapi/csv')
spz_var = spz.get_data("cda/STA_L1_HET/Proton_Flux", "2020-10-28", "2020-10-28T01")
hapi_csv_file = hapi_csv_codec.save_variables(variables=[spz_var], file='test_output.csv')
self.assertTrue(hapi_csv_file)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test performs a live spz.get_data(...) call (network / remote dependency) and writes to a fixed path (test_output.csv) in the working directory, which can make CI flaky and pollute the repo. Consider converting this into an integration test (skipped by default), mocking spz.get_data, and writing to a NamedTemporaryFile similar to the other save tests.

Copilot uses AI. Check for mistakes.
variables = hapi_csv_codec.load_variables(file=f, variables=['spectra_time_dependent_bins'], disable_cache=True)
self.assertIn('spectra_time_dependent_bins', variables)

def test_load_time_independant_axis(self):
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct spelling: 'independant' -> 'independent' in the test name.

Suggested change
def test_load_time_independant_axis(self):
def test_load_time_independent_axis(self):

Copilot uses AI. Check for mistakes.
#}

1970-01-01T00:00:00.000Z,0,1,0.5,0.3333333333333333,0.25,0.2,0.16666666666666666,0.14285714285714285,0.125,0.1111111111111111,11,13,15,17,19,21,23,25,27,29,10,12,12,14,14,16,16,18,18,20,20,22,22,24,24,26,26,28,28,30
1970-01-01T00:01:08.000Z,0,1,0.5,0.3333333333333333,0.25,0.2,0.16666666666666666,0.14285714285714285,0.125,0.1111111111111111,1,3,5,7,9,11,13,15,17,19 No newline at end of file
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header declares three parameters after Time: spectra_time_dependent_bins (10 values), frequency_centers_time_varying (10 values), and frequency_ranges_time_varying (10x2 = 20 values). Row 68 includes the additional 20 range values, but row 69 does not, so the CSV row length is inconsistent with the declared schema. Update row 69 to include the missing 20 values (or remove/adjust the declared ranges parameter) so the fixture can be parsed reliably.

Suggested change
1970-01-01T00:01:08.000Z,0,1,0.5,0.3333333333333333,0.25,0.2,0.16666666666666666,0.14285714285714285,0.125,0.1111111111111111,1,3,5,7,9,11,13,15,17,19
1970-01-01T00:01:08.000Z,0,1,0.5,0.3333333333333333,0.25,0.2,0.16666666666666666,0.14285714285714285,0.125,0.1111111111111111,1,3,5,7,9,11,13,15,17,19,0,2,2,4,4,6,6,8,8,10,10,12,12,14,14,16,16,18,18,20

Copilot uses AI. Check for mistakes.
@RichardHitier RichardHitier force-pushed the hapi_csv_codec branch 2 times, most recently from 6e92056 to e19a699 Compare March 13, 2026 10:18
@brenard-irap
Copy link
Collaborator

Thank you, Richard.

A few remarks that could be discussed.

Here is what I am doing:

import speasy as spz
imf_data = spz.get_data(spz.inventories.tree.amda.Parameters.ACE.MFI.ace_imf_all.imf, "2008-01-01", "2008-01-02")
hapi_csv_codec = spz.core.codecs.get_codec('hapi/csv')
hapi_csv_codec.save_variables(variables=[imf_data], file='./imf_data_hapi.csv')

1. Missing required fields in the header

Currently, we have the following header:

#{
#  "HAPI": "3.2",
#  "status": {
#    "code": 1200,
#    "message": "OK request successful"
#  },
#  "parameters": [
#    {
#      "name": "Time",
#      "type": "isotime",
#      "units": "UTC",
#      "length": 30,
#      "fill": null
#    },
#    {
#      "name": "imf",
#      "units": "nT",
#      "fill": [
#        NaN
#      ],
#      "description": "",
#      "type": "double",
#      "label": [
#        "bx",
#        "by",
#        "bz"
#      ],
#      "size": [
#        3
#      ]
#    }
#  ]
#}

In my view, the writer should directly produce an output that complies with the HAPI specification: https://github.com/hapi-server/data-specification/blob/master/hapi-3.2.0/HAPI-data-access-spec-3.2.0.md#372-response

As stated in the spec, “the contents of the header should be the same as returned from the info endpoint.”

Based on this, I would expect the header to include at least the required fields listed here: https://github.com/hapi-server/data-specification/blob/master/hapi-3.2.0/HAPI-data-access-spec-3.2.0.md#362-info-response-object

In the current response, the following fields are missing:

  • format: for the hapi/csv writer this should be "csv"
  • startDate: could default to the first time in the variable
  • stopDate: could default to the last time in the variable

2. Data precision in the written file

The first data row in the generated CSV file is:

2008-01-01T00:00:08.000000Z,-3.84,0.428,-1.9

However, when inspecting the first data point directly in Python, I get:

>>> imf_data.values[0]
array([-3.84 ,  0.428, -1.905], dtype=float32)

There is a noticeable loss of precision on the third component.
After checking, this comes from the float_format definition in the writer:
https://github.com/co-libri-org/speasy/blob/6e920569ae0172cf7f967b281ff1b74e4e364c5b/speasy/core/codecs/bundled_codecs/hapi_csv/writer.py#L34

3. Optional header

Since the "data" endpoint allows the header to be optional, it would be useful to add a writer option to enable or disable header generation.
This would make the writer more flexible and aligned with the endpoint’s behavior.

I’ll continue running some additional tests on my side.

elif np.issubdtype(dtype, np.floating):
return "double"
else:
raise ValueError(f"Unsupported data type {variable.values.dtype}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable is not defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry.
I have my own flake8 settings on pre-commit now ;-)

@@ -1,10 +1,17 @@
from datetime import datetime

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'datetime' is not used.
@sonarqubecloud
Copy link

@jeandet jeandet merged commit 93ce7a0 into SciQLop:main Mar 20, 2026
16 of 23 checks passed
@RichardHitier RichardHitier deleted the hapi_csv_codec branch March 23, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants