-
Notifications
You must be signed in to change notification settings - Fork 52
Set waterdata data types #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
b29ed46
caa9626
86ed884
a7a42d0
9301ff6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -668,32 +668,40 @@ def _arrange_cols( | |
| return df.rename(columns={"id": output_id}) | ||
|
|
||
|
|
||
| def _cleanup_cols(df: pd.DataFrame, service: str = "daily") -> pd.DataFrame: | ||
| def _type_cols(df: pd.DataFrame) -> pd.DataFrame: | ||
| """ | ||
| Cleans and standardizes columns in a pandas DataFrame for water data endpoints. | ||
| Casts columns into appropriate types. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| The input DataFrame containing water data. | ||
| service : str, optional | ||
| The type of water data service (default is "daily"). | ||
|
|
||
| Returns | ||
| ------- | ||
| pd.DataFrame | ||
| The cleaned DataFrame with standardized columns. | ||
| The DataFrame with columns cast to appropriate types. | ||
|
|
||
| Notes | ||
| ----- | ||
| - If the 'time' column exists and service is "daily", it is converted to date objects. | ||
| - The 'value' and 'contributing_drainage_area' columns are coerced to numeric types. | ||
| """ | ||
| if "time" in df.columns and service == "daily": | ||
| df["time"] = pd.to_datetime(df["time"]).dt.date | ||
| for col in ["value", "contributing_drainage_area"]: | ||
| if col in df.columns: | ||
| df[col] = pd.to_numeric(df[col], errors="coerce") | ||
| cols = set(df.columns) | ||
| numerical_cols = ["value", "contributing_drainage_area"] | ||
thodson-usgs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| time_cols = ["time", "datetime", "last_modified"] | ||
|
||
| categorical_cols = [ | ||
| "approval_status", | ||
| "monitoring_location_id", | ||
| "parameter_code", | ||
| "unit_of_measure", | ||
| ] | ||
|
|
||
| for col in cols.intersection(time_cols): | ||
| df[col] = pd.to_datetime(df[col], errors="coerce") | ||
thodson-usgs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| for col in cols.intersection(numerical_cols): | ||
| df[col] = pd.to_numeric(df[col], errors="coerce") | ||
|
|
||
| for col in cols.intersection(categorical_cols): | ||
| df[col] = df[col].astype("category") | ||
thodson-usgs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thodson-usgs marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| return df | ||
|
|
||
|
|
||
|
|
@@ -749,8 +757,10 @@ def get_ogc_data( | |
| ) | ||
| # Manage some aspects of the returned dataset | ||
| return_list = _deal_with_empty(return_list, properties, service) | ||
|
|
||
| if convert_type: | ||
| return_list = _cleanup_cols(return_list, service=service) | ||
| return_list = _type_cols(return_list) | ||
|
|
||
| return_list = _arrange_cols(return_list, properties, output_id) | ||
| # Create metadata object from response | ||
| metadata = BaseMetadata(response) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R
dataRetrievalwas doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing that, or I am misunderstanding. When I run your branch and pull from
get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're correct,
dt.datedoes lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.