Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 25 additions & 15 deletions dataretrieval/waterdata/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,32 +668,40 @@ def _arrange_cols(
return df.rename(columns={"id": output_id})


def _cleanup_cols(df: pd.DataFrame, service: str = "daily") -> pd.DataFrame:
def _type_cols(df: pd.DataFrame) -> pd.DataFrame:
"""
Cleans and standardizes columns in a pandas DataFrame for water data endpoints.
Casts columns into appropriate types.

Parameters
----------
df : pd.DataFrame
The input DataFrame containing water data.
service : str, optional
The type of water data service (default is "daily").

Returns
-------
pd.DataFrame
The cleaned DataFrame with standardized columns.
The DataFrame with columns cast to appropriate types.

Notes
-----
- If the 'time' column exists and service is "daily", it is converted to date objects.
- The 'value' and 'contributing_drainage_area' columns are coerced to numeric types.
"""
if "time" in df.columns and service == "daily":
df["time"] = pd.to_datetime(df["time"]).dt.date
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R dataRetrieval was doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141

Copy link
Collaborator Author

@thodson-usgs thodson-usgs Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing that, or I am misunderstanding. When I run your branch and pull from get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".

check, md = waterdata.get_latest_daily(
    monitoring_location_id="USGS-05129115",
    parameter_code="00060"
)

I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct, dt.date does lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.

for col in ["value", "contributing_drainage_area"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
cols = set(df.columns)
numerical_cols = ["value", "contributing_drainage_area"]
time_cols = ["time", "datetime", "last_modified"]
Copy link
Collaborator

@ehinman ehinman Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a "datetime" column in any of the endpoints? I went through all of them (including continuous) and I don't see it. However, I do see some others:

Time series metadata: begin, end, begin_utc, end_utc
Monitoring locations: construction_date

And some additional numerics:

Monitoring locations: altitude, altitude_accuracy, drainage_area, well_constructed_depth, hole_constructed_depth

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of a datetime column, but see no harm in including it.

categorical_cols = [
"approval_status",
"monitoring_location_id",
"parameter_code",
"unit_of_measure",
]

for col in cols.intersection(time_cols):
df[col] = pd.to_datetime(df[col], errors="coerce")

for col in cols.intersection(numerical_cols):
df[col] = pd.to_numeric(df[col], errors="coerce")

for col in cols.intersection(categorical_cols):
df[col] = df[col].astype("category")

return df


Expand Down Expand Up @@ -749,8 +757,10 @@ def get_ogc_data(
)
# Manage some aspects of the returned dataset
return_list = _deal_with_empty(return_list, properties, service)

if convert_type:
return_list = _cleanup_cols(return_list, service=service)
return_list = _type_cols(return_list)

return_list = _arrange_cols(return_list, properties, output_id)
# Create metadata object from response
metadata = BaseMetadata(response)
Expand Down
Loading