diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml index c7d487f2..fc467976 100644 --- a/.github/workflows/python-package.yml +++ b/.github/workflows/python-package.yml @@ -17,6 +17,10 @@ jobs: matrix: os: [ubuntu-latest, windows-latest] python-version: [3.8, 3.9, '3.10', 3.11, 3.12] + exclude: + - os: ubuntu-latest + python-version: 3.8 + steps: - uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 @@ -36,7 +40,5 @@ jobs: flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics - name: Test with pytest and report coverage run: | - cd tests - coverage run -m pytest + coverage run -m pytest tests/ coverage report -m - cd .. diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 00000000..2efdc76c --- /dev/null +++ b/NEWS.md @@ -0,0 +1,7 @@ +**11/24/2025:** `dataretrieval` is pleased to offer a new module, `waterdata`, which gives users access USGS's modernized [Water Data APIs](https://api.waterdata.usgs.gov/). The Water Data API endpoints include daily values, instantaneous values, field measurements (modernized groundwater levels service), time series metadata, and discrete water quality data from the Samples database. Though there will be a period of overlap, the functions within `waterdata` will eventually replace the `nwis` module, which currently provides access to the legacy [NWIS Water Services](https://waterservices.usgs.gov/). More example workflows and functions coming soon. Check `help(waterdata)` for more information. + +**09/03/2024:** The groundwater levels service has switched endpoints, and `dataretrieval` was updated accordingly in [`v1.0.10`](https://github.com/DOI-USGS/dataretrieval-python/releases/tag/v1.0.10). Older versions using the discontinued endpoint will return 503 errors for `nwis.get_gwlevels` or the `service='gwlevels'` argument. Visit [Water Data For the Nation](https://waterdata.usgs.gov/blog/wdfn-waterservices-2024/) for more information. + +**03/01/2024:** USGS data availability and format have changed on Water Quality Portal (WQP). Since March 2024, data obtained from WQP legacy profiles will not include new USGS data or recent updates to existing data. All USGS data (up to and beyond March 2024) are available using the new WQP beta services. You can access the beta services by setting `legacy=False` in the functions in the `wqp` module. + +To view the status of changes in data availability and code functionality, visit: https://doi-usgs.github.io/dataRetrieval/articles/Status.html \ No newline at end of file diff --git a/README.md b/README.md index f8c14a36..3821c478 100644 --- a/README.md +++ b/README.md @@ -4,123 +4,263 @@ ![Conda Version](https://img.shields.io/conda/v/conda-forge/dataretrieval) ![Downloads](https://static.pepy.tech/badge/dataretrieval) -:warning: USGS data availability and format have changed on Water Quality Portal (WQP). Since March 2024, data obtained from WQP legacy profiles will not include new USGS data or recent updates to existing data. All USGS data (up to and beyond March 2024) are available using the new WQP beta services. You can access the beta services by setting `legacy=False` in the functions in the `wqp` module. +## Latest Announcements -To view the status of changes in data availability and code functionality, visit: https://doi-usgs.github.io/dataRetrieval/articles/Status.html +:mega: **11/24/2025:** `dataretrieval` now features the new `waterdata` module, +which provides access to USGS's modernized [Water Data +APIs](https://api.waterdata.usgs.gov/). The Water Data API endpoints include +daily values, instantaneous values, field measurements, time series metadata, +and discrete water quality data from the Samples database. This new module will +eventually replace the `nwis` module, which provides access to the legacy [NWIS +Water Services](https://waterservices.usgs.gov/). -:mega: **09/03/2024:** The groundwater levels service has switched endpoints, and `dataretrieval` was updated accordingly in [`v1.0.10`](https://github.com/DOI-USGS/dataretrieval-python/releases/tag/v1.0.10). Older versions using the discontinued endpoint will return 503 errors for `nwis.get_gwlevels` or the `service='gwlevels'` argument. Visit [Water Data For the Nation](https://waterdata.usgs.gov/blog/wdfn-waterservices-2024/) for more information. +**Important:** Users of the Water Data APIs are strongly encouraged to obtain an +API key for higher rate limits and greater access to USGS data. [Register for +an API key](https://api.waterdata.usgs.gov/signup/) and set it as an +environment variable: + +```python +import os +os.environ["API_USGS_PAT"] = "your_api_key_here" +``` + +Check out the [NEWS](NEWS.md) file for all updates and announcements. ## What is dataretrieval? -`dataretrieval` was created to simplify the process of loading hydrologic data into the Python environment. -Like the original R version [`dataRetrieval`](https://github.com/DOI-USGS/dataRetrieval), -it is designed to retrieve the major data types of U.S. Geological Survey (USGS) hydrology -data that are available on the Web, as well as data from the Water -Quality Portal (WQP), which currently houses water quality data from the -Environmental Protection Agency (EPA), U.S. Department of Agriculture -(USDA), and USGS. Direct USGS data is obtained from a service called the -National Water Information System (NWIS). -Note that the python version is not a direct port of the original: it attempts to reproduce the functionality of the R package, -though its organization and interface often differ. +`dataretrieval` simplifies the process of loading hydrologic data into Python. +Like the original R version +[`dataRetrieval`](https://github.com/DOI-USGS/dataRetrieval), it retrieves major +U.S. Geological Survey (USGS) hydrology data types available on the Web, as well +as data from the Water Quality Portal (WQP) and Network Linked Data Index +(NLDI). -If there's a hydrologic or environmental data portal that you'd like dataretrieval to -work with, raise it as an [issue](https://github.com/USGS-python/dataretrieval/issues). +## Installation -Here's an example using `dataretrieval` to retrieve data from the National Water Information System (NWIS). +Install dataretrieval using pip: -```python -# first import the functions for downloading data from NWIS -import dataretrieval.nwis as nwis +```bash +pip install dataretrieval +``` + +Or using conda: + +```bash +conda install -c conda-forge dataretrieval +``` + +## Usage Examples + +### Water Data API (Recommended - Modern USGS Data) -# specify the USGS site code for which we want data. -site = '03339000' +The `waterdata` module provides access to modern USGS Water Data APIs. -# get instantaneous values (iv) -df = nwis.get_record(sites=site, service='iv', start='2017-12-31', end='2018-01-01') +The example below retrieves daily streamflow data for a specific monitoring +location for water year 2025, where a "/" between two dates in the "time" +input argument indicates a desired date range: -# get basic info about the site -df2 = nwis.get_record(sites=site, service='site') +```python +import dataretrieval.waterdata as waterdata + +# Get daily streamflow data (returns DataFrame and metadata) +df, metadata = waterdata.get_daily( + monitoring_location_id='USGS-01646500', + parameter_code='00060', # Discharge + time='2024-10-01/2025-09-30' +) + +print(f"Retrieved {len(df)} records") +print(f"Site: {df['monitoring_location_id'].iloc[0]}") +print(f"Mean discharge: {df['value'].mean():.2f} {df['unit_of_measure'].iloc[0]}") ``` -Services available from NWIS include: -- instantaneous values (iv) -- daily values (dv) -- statistics (stat) -- site info (site) -- discharge peaks (peaks) -- discharge measurements (measurements) - -Water quality data are available from: -- [Samples](https://waterdata.usgs.gov/download-samples/#dataProfile=site) - Discrete USGS water quality data only -- [Water Quality Portal](https://www.waterqualitydata.us/) - Discrete water quality data from USGS and EPA. Older data are available in the legacy WQX version 2 format; all data are available in the beta WQX3.0 format. - -To access the full functionality available from NWIS web services, nwis.get record appends any additional kwargs into the REST request. For example, this function call: +Fetch daily discharge data for multiple sites from a start date to present +using the following code: + ```python -nwis.get_record(sites='03339000', service='dv', start='2017-12-31', parameterCd='00060') +df, metadata = waterdata.get_daily( + monitoring_location_id=["USGS-13018750","USGS-13013650"], + parameter_code='00060', + time='2024-10-01/..' +) + +print(f"Retrieved {len(df)} records") ``` -...will download daily data with the parameter code 00060 (discharge). +The following example downloads location information for all monitoring +locations that are categorized as stream sites in the state of Maryland: -## Accessing the "Internal" NWIS -If you're connected to the USGS network, dataretrieval call pull from the internal (non-public) NWIS interface. -Most dataretrieval functions pass kwargs directly to NWIS's REST API, which provides simple access to internal data; simply specify "access='3'". -For example ```python -nwis.get_record(sites='05404147',service='iv', start='2021-01-01', end='2021-3-01', access='3') +# Get monitoring location information +locations, metadata = waterdata.get_monitoring_locations( + state_name='Maryland', + site_type_code='ST' # Stream sites +) + +print(f"Found {len(locations)} stream monitoring locations in Maryland") ``` +Visit the +[API Reference](https://doi-usgs.github.io/dataretrieval-python/reference/waterdata.html) +for more information and examples on available services and input parameters. -More services and documentation to come! +**NEW:** This new module implements +[logging](https://docs.python.org/3/howto/logging.html#logging-basic-tutorial) +in which users can view the URL requests sent to the USGS Water Data APIs +and the number of requests they have remaining each hour. These messages can +be helpful for troubleshooting and support. To enable logging in your python +console or notebook: -## Quick start +```python +import logging +logging.basicConfig(level=logging.INFO) +``` +To log messages to a file, you can specify a filename in the +`basicConfig` call: -dataretrieval can be installed using pip: - - $ python3 -m pip install -U dataretrieval +```python +logging.basicConfig(filename='waterdata.log', level=logging.INFO) +``` -or conda: +### NWIS Legacy Services (Deprecated but still functional) - $ conda install -c conda-forge dataretrieval +The `nwis` module accesses legacy NWIS Water Services: -More examples of use are include in [`demos`](https://github.com/USGS-python/dataretrieval/tree/main/demos). +```python +import dataretrieval.nwis as nwis -## Issue tracker +# Get site information +info, metadata = nwis.get_info(sites='01646500') + +print(f"Site name: {info['station_nm'].iloc[0]}") + +# Get daily values +dv, metadata = nwis.get_dv( + sites='01646500', + start='2024-10-01', + end='2024-10-02', + parameterCd='00060', +) + +print(f"Retrieved {len(dv)} daily values") +``` -Please report any bugs and enhancement ideas using the dataretrieval issue -tracker: +### Water Quality Portal (WQP) - https://github.com/USGS-python/dataretrieval/issues +Access water quality data from multiple agencies: -Feel free to also ask questions on the tracker. +```python +import dataretrieval.wqp as wqp +# Find water quality monitoring sites +sites = wqp.what_sites( + statecode='US:55', # Wisconsin + siteType='Stream' +) -## Contributing +print(f"Found {len(sites)} stream monitoring sites in Wisconsin") + +# Get water quality results +results = wqp.get_results( + siteid='USGS-05427718', + characteristicName='Temperature, water' +) -Any help in testing, development, documentation and other tasks is welcome. -For more details, see the file [CONTRIBUTING.md](CONTRIBUTING.md). +print(f"Retrieved {len(results)} temperature measurements") +``` +### Network Linked Data Index (NLDI) -## Need help? +Discover and navigate hydrologic networks: -The Water Mission Area of the USGS supports the development and maintenance of `dataretrieval`. Any questions can be directed to the Computational Tools team at -comptools@usgs.gov. +```python +import dataretrieval.nldi as nldi -Resources are available primarily for maintenance and responding to user questions. -Priorities on the development of new features are determined by the `dataretrieval` development team. +# Get watershed basin for a stream reach +basin = nldi.get_basin( + feature_source='comid', + feature_id='13293474' # NHD reach identifier +) +print(f"Basin contains {len(basin)} feature(s)") + +# Find upstream flowlines +flowlines = nldi.get_flowlines( + feature_source='comid', + feature_id='13293474', + navigation_mode='UT', # Upstream tributaries + distance=50 # km +) + +print(f"Found {len(flowlines)} upstream tributaries within 50km") +``` + +## Available Data Services + +### Modern USGS Water Data APIs (Recommended) +- **Daily values**: Daily statistical summaries (mean, min, max) +- **Field measurements**: Discrete measurements from field visits +- **Monitoring locations**: Site information and metadata +- **Time series metadata**: Information about available data parameters +- **Latest daily values**: Most recent daily statistical summary data +- **Latest instantaneous values**: Most recent high-frequency continuous data +- **Samples data**: Discrete USGS water quality data +- **Instantaneous values** (*COMING SOON*): High-frequency continuous data + +### Legacy NWIS Services (Deprecated) +- **Daily values (dv)**: Legacy daily statistical data +- **Instantaneous values (iv)**: Legacy continuous data +- **Site info (site)**: Basic site information +- **Statistics (stat)**: Statistical summaries +- **Discharge peaks (peaks)**: Annual peak discharge events +- **Discharge measurements (measurements)**: Direct flow measurements + +### Water Quality Portal +- **Results**: Water quality analytical results from USGS, EPA, and other agencies +- **Sites**: Monitoring location information +- **Organizations**: Data provider information +- **Projects**: Sampling project details + +### Network Linked Data Index (NLDI) +- **Basin delineation**: Watershed boundaries for any point +- **Flow navigation**: Upstream/downstream network traversal +- **Feature discovery**: Find monitoring sites, dams, and other features +- **Hydrologic connectivity**: Link data across the stream network + +## More Examples + +Explore additional examples in the +[`demos`](https://github.com/USGS-python/dataretrieval/tree/main/demos) +directory, including Jupyter notebooks demonstrating advanced usage patterns. + +## Getting Help + +- **Issue tracker**: Report bugs and request features at https://github.com/USGS-python/dataretrieval/issues +- **Documentation**: Full API documentation available in the source code docstrings + +## Contributing + +Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for +development guidelines. ## Acknowledgments -This material is partially based upon work supported by the National Science Foundation (NSF) under award 1931297. -Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. + +This material is partially based upon work supported by the National Science +Foundation (NSF) under award 1931297. Any opinions, findings, conclusions, or +recommendations expressed in this material are those of the authors and do not +necessarily reflect the views of the NSF. ## Disclaimer -This software is preliminary or provisional and is subject to revision. -It is being provided to meet the need for timely best science. -The software has not received final approval by the U.S. Geological Survey (USGS). -No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. -The software is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the software. +This software is preliminary or provisional and is subject to revision. It is +being provided to meet the need for timely best science. The software has not +received final approval by the U.S. Geological Survey (USGS). No warranty, +expressed or implied, is made by the USGS or the U.S. Government as to the +functionality of the software and related material nor shall the fact of release +constitute any such warranty. The software is provided on the condition that +neither the USGS nor the U.S. Government shall be held liable for any damages +resulting from the authorized or unauthorized use of the software. ## Citation -Hodson, T.O., Hariharan, J.A., Black, S., and Horsburgh, J.S., 2023, dataretrieval (Python): a Python package for discovering -and retrieving water data available from U.S. federal hydrologic web services: -U.S. Geological Survey software release, -https://doi.org/10.5066/P94I5TX3. +Hodson, T.O., Hariharan, J.A., Black, S., and Horsburgh, J.S., 2023, +dataretrieval (Python): a Python package for discovering and retrieving water +data available from U.S. federal hydrologic web services: U.S. Geological Survey +software release, https://doi.org/10.5066/P94I5TX3. diff --git a/dataretrieval/__init__.py b/dataretrieval/__init__.py index 07374f47..5c35a3f2 100644 --- a/dataretrieval/__init__.py +++ b/dataretrieval/__init__.py @@ -1,6 +1,12 @@ from importlib.metadata import PackageNotFoundError, version +try: + __version__ = version("dataretrieval") +except PackageNotFoundError: + __version__ = "version-unknown" + from dataretrieval.nadp import * +from dataretrieval.nldi import * from dataretrieval.nwis import * from dataretrieval.samples import * from dataretrieval.streamstats import * @@ -8,8 +14,3 @@ from dataretrieval.waterdata import * from dataretrieval.waterwatch import * from dataretrieval.wqp import * - -try: - __version__ = version("dataretrieval") -except PackageNotFoundError: - __version__ = "version-unknown" diff --git a/dataretrieval/nwis.py b/dataretrieval/nwis.py index 1189b790..e4615d10 100644 --- a/dataretrieval/nwis.py +++ b/dataretrieval/nwis.py @@ -2,13 +2,6 @@ .. _National Water Information System (NWIS): https://waterdata.usgs.gov/nwis - -.. todo:: - - * Create a test to check whether functions pull multiple sites - * Work on multi-index capabilities. - * Check that all timezones are handled properly for each service. - """ import re @@ -19,7 +12,7 @@ import pandas as pd import requests -from dataretrieval.utils import BaseMetadata, format_datetime, to_str +from dataretrieval.utils import BaseMetadata, format_datetime from .utils import query @@ -28,6 +21,14 @@ except ImportError: gpd = None +# Issue deprecation warning upon import +warnings.warn( + "The 'nwis' services are deprecated and being decommissioned. " + "Please use the 'waterdata' module to access the new services.", + DeprecationWarning, + stacklevel=2 +) + WATERDATA_BASE_URL = "https://nwis.waterdata.usgs.gov/" WATERDATA_URL = WATERDATA_BASE_URL + "nwis/" WATERSERVICE_URL = "https://waterservices.usgs.gov/nwis/" diff --git a/dataretrieval/samples.py b/dataretrieval/samples.py index c55c1a84..a6df85b3 100644 --- a/dataretrieval/samples.py +++ b/dataretrieval/samples.py @@ -11,18 +11,17 @@ import pandas as pd import warnings -from dataretrieval.utils import BaseMetadata, to_str -from dataretrieval.waterdata import get_samples +from dataretrieval.utils import BaseMetadata if TYPE_CHECKING: from typing import Optional, Tuple, Union - from dataretrieval.waterdata import _SERVICES, _PROFILES + from dataretrieval.waterdata import SERVICES, PROFILES from pandas import DataFrame def get_usgs_samples( ssl_check: bool = True, - service: _SERVICES = "results", - profile: _PROFILES = "fullphyschem", + service: SERVICES = "results", + profile: PROFILES = "fullphyschem", activityMediaName: Optional[Union[str, list[str]]] = None, activityStartDateLower: Optional[str] = None, activityStartDateUpper: Optional[str] = None, @@ -212,7 +211,8 @@ def get_usgs_samples( DeprecationWarning, stacklevel=2, ) - + + from dataretrieval.waterdata import get_samples result = get_samples( ssl_check=ssl_check, service=service, diff --git a/dataretrieval/waterdata.py b/dataretrieval/waterdata.py deleted file mode 100644 index ceed581e..00000000 --- a/dataretrieval/waterdata.py +++ /dev/null @@ -1,350 +0,0 @@ -"""Functions for downloading data from the Water Data APIs, including the USGS Aquarius Samples database. - -See https://api.waterdata.usgs.gov/ for API reference. -""" - -from __future__ import annotations - -import json -from io import StringIO -from typing import TYPE_CHECKING, Literal, get_args - -import pandas as pd -import requests -from requests.models import PreparedRequest - -from dataretrieval.utils import BaseMetadata, to_str - -if TYPE_CHECKING: - from typing import Optional, Tuple, Union - - from pandas import DataFrame - - -_BASE_URL = "https://api.waterdata.usgs.gov/samples-data" - -_CODE_SERVICES = Literal[ - "characteristicgroup", - "characteristics", - "counties", - "countries", - "observedproperty", - "samplemedia", - "sitetype", - "states", -] - - -_SERVICES = Literal["activities", "locations", "organizations", "projects", "results"] - -_PROFILES = Literal[ - "actgroup", - "actmetric", - "basicbio", - "basicphyschem", - "count", - "fullbio", - "fullphyschem", - "labsampleprep", - "narrow", - "organization", - "project", - "projectmonitoringlocationweight", - "resultdetectionquantitationlimit", - "sampact", - "site", -] - -_PROFILE_LOOKUP = { - "activities": ["sampact", "actmetric", "actgroup", "count"], - "locations": ["site", "count"], - "organizations": ["organization", "count"], - "projects": ["project", "projectmonitoringlocationweight"], - "results": [ - "fullphyschem", - "basicphyschem", - "fullbio", - "basicbio", - "narrow", - "resultdetectionquantitationlimit", - "labsampleprep", - "count", - ], -} - - -def get_codes(code_service: _CODE_SERVICES) -> DataFrame: - """Return codes from a Samples code service. - - Parameters - ---------- - code_service : string - One of the following options: "states", "counties", "countries" - "sitetype", "samplemedia", "characteristicgroup", "characteristics", - or "observedproperty" - """ - valid_code_services = get_args(_CODE_SERVICES) - if code_service not in valid_code_services: - raise ValueError( - f"Invalid code service: '{code_service}'. " - f"Valid options are: {valid_code_services}." - ) - - url = f"{_BASE_URL}/codeservice/{code_service}?mimeType=application%2Fjson" - - response = requests.get(url) - - response.raise_for_status() - - data_dict = json.loads(response.text) - data_list = data_dict['data'] - - df = pd.DataFrame(data_list) - - return df - -def get_samples( - ssl_check: bool = True, - service: _SERVICES = "results", - profile: _PROFILES = "fullphyschem", - activityMediaName: Optional[Union[str, list[str]]] = None, - activityStartDateLower: Optional[str] = None, - activityStartDateUpper: Optional[str] = None, - activityTypeCode: Optional[Union[str, list[str]]] = None, - characteristicGroup: Optional[Union[str, list[str]]] = None, - characteristic: Optional[Union[str, list[str]]] = None, - characteristicUserSupplied: Optional[Union[str, list[str]]] = None, - boundingBox: Optional[list[float]] = None, - countryFips: Optional[Union[str, list[str]]] = None, - stateFips: Optional[Union[str, list[str]]] = None, - countyFips: Optional[Union[str, list[str]]] = None, - siteTypeCode: Optional[Union[str, list[str]]] = None, - siteTypeName: Optional[Union[str, list[str]]] = None, - usgsPCode: Optional[Union[str, list[str]]] = None, - hydrologicUnit: Optional[Union[str, list[str]]] = None, - monitoringLocationIdentifier: Optional[Union[str, list[str]]] = None, - organizationIdentifier: Optional[Union[str, list[str]]] = None, - pointLocationLatitude: Optional[float] = None, - pointLocationLongitude: Optional[float] = None, - pointLocationWithinMiles: Optional[float] = None, - projectIdentifier: Optional[Union[str, list[str]]] = None, - recordIdentifierUserSupplied: Optional[Union[str, list[str]]] = None, -) -> Tuple[DataFrame, BaseMetadata]: - """Search Samples database for USGS water quality data. - This is a wrapper function for the Samples database API. All potential - filters are provided as arguments to the function, but please do not - populate all possible filters; leave as many as feasible with their default - value (None). This is important because overcomplicated web service queries - can bog down the database's ability to return an applicable dataset before - it times out. - - The web GUI for the Samples database can be found here: - https://waterdata.usgs.gov/download-samples/#dataProfile=site - - If you would like more details on feasible query parameters (complete with - examples), please visit the Samples database swagger docs, here: - https://api.waterdata.usgs.gov/samples-data/docs#/ - - Parameters - ---------- - ssl_check : bool, optional - Check the SSL certificate. - service : string - One of the available Samples services: "results", "locations", "activities", - "projects", or "organizations". Defaults to "results". - profile : string - One of the available profiles associated with a service. Options for each - service are: - results - "fullphyschem", "basicphyschem", - "fullbio", "basicbio", "narrow", - "resultdetectionquantitationlimit", - "labsampleprep", "count" - locations - "site", "count" - activities - "sampact", "actmetric", - "actgroup", "count" - projects - "project", "projectmonitoringlocationweight" - organizations - "organization", "count" - activityMediaName : string or list of strings, optional - Name or code indicating environmental medium in which sample was taken. - Check the `activityMediaName_lookup()` function in this module for all - possible inputs. - Example: "Water". - activityStartDateLower : string, optional - The start date if using a date range. Takes the format YYYY-MM-DD. - The logic is inclusive, i.e. it will also return results that - match the date. If left as None, will pull all data on or before - activityStartDateUpper, if populated. - activityStartDateUpper : string, optional - The end date if using a date range. Takes the format YYYY-MM-DD. - The logic is inclusive, i.e. it will also return results that - match the date. If left as None, will pull all data after - activityStartDateLower up to the most recent available results. - activityTypeCode : string or list of strings, optional - Text code that describes type of field activity performed. - Example: "Sample-Routine, regular". - characteristicGroup : string or list of strings, optional - Characteristic group is a broad category of characteristics - describing one or more results. Check the `characteristicGroup_lookup()` - function in this module for all possible inputs. - Example: "Organics, PFAS" - characteristic : string or list of strings, optional - Characteristic is a specific category describing one or more results. - Check the `characteristic_lookup()` function in this module for all - possible inputs. - Example: "Suspended Sediment Discharge" - characteristicUserSupplied : string or list of strings, optional - A user supplied characteristic name describing one or more results. - boundingBox: list of four floats, optional - Filters on the the associated monitoring location's point location - by checking if it is located within the specified geographic area. - The logic is inclusive, i.e. it will include locations that overlap - with the edge of the bounding box. Values are separated by commas, - expressed in decimal degrees, NAD83, and longitudes west of Greenwich - are negative. - The format is a string consisting of: - - Western-most longitude - - Southern-most latitude - - Eastern-most longitude - - Northern-most longitude - Example: [-92.8,44.2,-88.9,46.0] - countryFips : string or list of strings, optional - Example: "US" (United States) - stateFips : string or list of strings, optional - Check the `stateFips_lookup()` function in this module for all - possible inputs. - Example: "US:15" (United States: Hawaii) - countyFips : string or list of strings, optional - Check the `countyFips_lookup()` function in this module for all - possible inputs. - Example: "US:15:001" (United States: Hawaii, Hawaii County) - siteTypeCode : string or list of strings, optional - An abbreviation for a certain site type. Check the `siteType_lookup()` - function in this module for all possible inputs. - Example: "GW" (Groundwater site) - siteTypeName : string or list of strings, optional - A full name for a certain site type. Check the `siteType_lookup()` - function in this module for all possible inputs. - Example: "Well" - usgsPCode : string or list of strings, optional - 5-digit number used in the US Geological Survey computerized - data system, National Water Information System (NWIS), to - uniquely identify a specific constituent. Check the - `characteristic_lookup()` function in this module for all possible - inputs. - Example: "00060" (Discharge, cubic feet per second) - hydrologicUnit : string or list of strings, optional - Max 12-digit number used to describe a hydrologic unit. - Example: "070900020502" - monitoringLocationIdentifier : string or list of strings, optional - A monitoring location identifier has two parts: the agency code - and the location number, separated by a dash (-). - Example: "USGS-040851385" - organizationIdentifier : string or list of strings, optional - Designator used to uniquely identify a specific organization. - Currently only accepting the organization "USGS". - pointLocationLatitude : float, optional - Latitude for a point/radius query (decimal degrees). Must be used - with pointLocationLongitude and pointLocationWithinMiles. - pointLocationLongitude : float, optional - Longitude for a point/radius query (decimal degrees). Must be used - with pointLocationLatitude and pointLocationWithinMiles. - pointLocationWithinMiles : float, optional - Radius for a point/radius query. Must be used with - pointLocationLatitude and pointLocationLongitude - projectIdentifier : string or list of strings, optional - Designator used to uniquely identify a data collection project. Project - identifiers are specific to an organization (e.g. USGS). - Example: "ZH003QW03" - recordIdentifierUserSupplied : string or list of strings, optional - Internal AQS record identifier that returns 1 entry. Only available - for the "results" service. - - Returns - ------- - df : ``pandas.DataFrame`` - Formatted data returned from the API query. - md : :obj:`dataretrieval.utils.Metadata` - Custom ``dataretrieval`` metadata object pertaining to the query. - - Examples - -------- - .. code:: - - >>> # Get PFAS results within a bounding box - >>> df, md = dataretrieval.waterdata.get_samples( - ... boundingBox=[-90.2,42.6,-88.7,43.2], - ... characteristicGroup="Organics, PFAS" - ... ) - - >>> # Get all activities for the Commonwealth of Virginia over a date range - >>> df, md = dataretrieval.waterdata.get_samples( - ... service="activities", - ... profile="sampact", - ... activityStartDateLower="2023-10-01", - ... activityStartDateUpper="2024-01-01", - ... stateFips="US:51") - - >>> # Get all pH samples for two sites in Utah - >>> df, md = dataretrieval.waterdata.get_samples( - ... monitoringLocationIdentifier=['USGS-393147111462301', 'USGS-393343111454101'], - ... usgsPCode='00400') - - """ - - _check_profiles(service, profile) - - params = { - k: v for k, v in locals().items() - if k not in ["ssl_check", "service", "profile"] - and v is not None - } - - - params.update({"mimeType": "text/csv"}) - - if "boundingBox" in params: - params["boundingBox"] = to_str(params["boundingBox"]) - - url = f"{_BASE_URL}/{service}/{profile}" - - req = PreparedRequest() - req.prepare_url(url, params=params) - print(f"Request: {req.url}") - - response = requests.get(url, params=params, verify=ssl_check) - - response.raise_for_status() - - df = pd.read_csv(StringIO(response.text), delimiter=",") - - return df, BaseMetadata(response) - -def _check_profiles( - service: _SERVICES, - profile: _PROFILES, -) -> None: - """Check whether a service profile is valid. - - Parameters - ---------- - service : string - One of the service names from the "services" list. - profile : string - One of the profile names from "results_profiles", - "locations_profiles", "activities_profiles", - "projects_profiles" or "organizations_profiles". - """ - valid_services = get_args(_SERVICES) - if service not in valid_services: - raise ValueError( - f"Invalid service: '{service}'. " - f"Valid options are: {valid_services}." - ) - - valid_profiles = _PROFILE_LOOKUP[service] - if profile not in valid_profiles: - raise ValueError( - f"Invalid profile: '{profile}' for service '{service}'. " - f"Valid options are: {valid_profiles}." - ) - diff --git a/dataretrieval/waterdata/__init__.py b/dataretrieval/waterdata/__init__.py new file mode 100644 index 00000000..7f68bfd6 --- /dev/null +++ b/dataretrieval/waterdata/__init__.py @@ -0,0 +1,45 @@ +""" +Water Data API module for accessing USGS water data services. + +This module provides functions for downloading data from the Water Data APIs, +including the USGS Aquarius Samples database. + +See https://api.waterdata.usgs.gov/ for API reference. +""" + +from __future__ import annotations + +# Public API exports +from .api import ( + _check_profiles, + get_codes, + get_daily, + get_field_measurements, + get_latest_continuous, + get_latest_daily, + get_monitoring_locations, + get_samples, + get_time_series_metadata, +) +from .types import ( + CODE_SERVICES, + PROFILE_LOOKUP, + PROFILES, + SERVICES, +) + +__all__ = [ + "get_codes", + "get_daily", + "get_field_measurements", + "get_latest_continuous", + "get_latest_daily", + "get_monitoring_locations", + "get_samples", + "get_time_series_metadata", + "_check_profiles", + "CODE_SERVICES", + "SERVICES", + "PROFILES", + "PROFILE_LOOKUP", +] diff --git a/dataretrieval/waterdata/api.py b/dataretrieval/waterdata/api.py new file mode 100644 index 00000000..7e17f254 --- /dev/null +++ b/dataretrieval/waterdata/api.py @@ -0,0 +1,1506 @@ +"""Functions for downloading data from the Water Data APIs, including the USGS +Aquarius Samples database. + +See https://api.waterdata.usgs.gov/ for API reference. +""" + +import json +import logging +from io import StringIO +from typing import List, Optional, Tuple, Union, get_args + +import pandas as pd +import requests +from requests.models import PreparedRequest + +from dataretrieval.utils import BaseMetadata, to_str +from dataretrieval.waterdata.types import ( + CODE_SERVICES, + PROFILE_LOOKUP, + PROFILES, + SERVICES, +) +from dataretrieval.waterdata.utils import SAMPLES_URL, get_ogc_data + +# Set up logger for this module +logger = logging.getLogger(__name__) + + +def get_daily( + monitoring_location_id: Optional[Union[str, List[str]]] = None, + parameter_code: Optional[Union[str, List[str]]] = None, + statistic_id: Optional[Union[str, List[str]]] = None, + properties: Optional[List[str]] = None, + time_series_id: Optional[Union[str, List[str]]] = None, + daily_id: Optional[Union[str, List[str]]] = None, + approval_status: Optional[Union[str, List[str]]] = None, + unit_of_measure: Optional[Union[str, List[str]]] = None, + qualifier: Optional[Union[str, List[str]]] = None, + value: Optional[Union[str, List[str]]] = None, + last_modified: Optional[str] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Daily data provide one data value to represent water conditions for the + day. + + Throughout much of the history of the USGS, the primary water data available + was daily data collected manually at the monitoring location once each day. + With improved availability of computer storage and automated transmission of + data, the daily data published today are generally a statistical summary or + metric of the continuous data collected each day, such as the daily mean, + minimum, or maximum value. Daily data are automatically calculated from the + continuous data of the same parameter code and are described by parameter + code and a statistic code. These data have also been referred to as “daily + values” or “DV”. + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of + the agency responsible for the monitoring location (e.g. USGS) with + the ID number of the monitoring location (e.g. 02238500), separated + by a hyphen (e.g. USGS-02238500). + parameter_code : string or list of strings, optional + Parameter codes are 5-digit codes used to identify the constituent + measured and the units of measure. A complete list of parameter + codes and associated groupings can be found at + https://help.waterdata.usgs.gov/codes-and-parameters/parameters. + statistic_id : string or list of strings, optional + A code corresponding to the statistic an observation represents. + Example codes include 00001 (max), 00002 (min), and 00003 (mean). + A complete list of codes and their descriptions can be found at + https://help.waterdata.usgs.gov/code/stat_cd_nm_query?stat_nm_cd=%25&fmt=html. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. + Available options are: geometry, id, time_series_id, + monitoring_location_id, parameter_code, statistic_id, time, value, + unit_of_measure, approval_status, qualifier, last_modified + time_series_id : string or list of strings, optional + A unique identifier representing a single time series. This + corresponds to the id field in the time-series-metadata endpoint. + daily_id : string or list of strings, optional + A universally unique identifier (UUID) representing a single version of + a record. It is not stable over time. Every time the record is refreshed + in our database (which may happen as part of normal operations and does + not imply any change to the data itself) a new ID will be generated. To + uniquely identify a single observation over time, compare the time and + time_series_id fields; each time series will only have a single + observation at a given time. + approval_status : string or list of strings, optional + Some of the data that you have obtained from this U.S. Geological Survey + database may not have received Director's approval. Any such data values + are qualified as provisional and are subject to revision. Provisional + data are released on the condition that neither the USGS nor the United + States Government may be held liable for any damages resulting from its + use. This field reflects the approval status of each record, and is either + "Approved", meaining processing review has been completed and the data is + approved for publication, or "Provisional" and subject to revision. For + more information about provisional data, go to + [https://waterdata.usgs.gov/provisional-data-statement/] + (https://waterdata.usgs.gov/provisional-data-statement/). + unit_of_measure : string or list of strings, optional + A human-readable description of the units of measurement associated + with an observation. + qualifier : string or list of strings, optional + This field indicates any qualifiers associated with an observation, for + instance if a sensor may have been impacted by ice or if values were + estimated. + value : string or list of strings, optional + The value of the observation. Values are transmitted as strings in + the JSON response format in order to preserve precision. + last_modified : string, optional + The last time a record was refreshed in our database. This may happen + due to regular operational processes and does not necessarily indicate + anything about the measurement has changed. You can query this field + using date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + Only features that have a last_modified that intersects the value of + datetime are selected. + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. + The returning object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + time : string, optional + The date an observation represents. You can query this field using + date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a time that intersects the + value of datetime are selected. If a feature has multiple temporal + properties, it is the decision of the server whether only a single + temporal property is used to determine the extent or all relevant + temporal properties. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (NA) will set the + limit to the maximum allowable limit for the service. + convert_type : boolean, optional + If True, the function will convert the data to dates and qualifier to + string vector + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get daily flow data from a single site + >>> # over a yearlong period + >>> df, md = dataretrieval.waterdata.get_daily( + ... monitoring_location_id="USGS-02238500", + ... parameter_code="00060", + ... time="2021-01-01T00:00:00Z/2022-01-01T00:00:00Z", + ... ) + + >>> # Get approved daily flow data from multiple sites + >>> df, md = dataretrieval.waterdata.get_daily( + ... monitoring_location_id = ["USGS-05114000", "USGS-09423350"], + ... approval_status = "Approved", + ... time = "2024-01-01/.." + """ + service = "daily" + output_id = "daily_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + + +def get_monitoring_locations( + monitoring_location_id: Optional[List[str]] = None, + agency_code: Optional[List[str]] = None, + agency_name: Optional[List[str]] = None, + monitoring_location_number: Optional[List[str]] = None, + monitoring_location_name: Optional[List[str]] = None, + district_code: Optional[List[str]] = None, + country_code: Optional[List[str]] = None, + country_name: Optional[List[str]] = None, + state_code: Optional[List[str]] = None, + state_name: Optional[List[str]] = None, + county_code: Optional[List[str]] = None, + county_name: Optional[List[str]] = None, + minor_civil_division_code: Optional[List[str]] = None, + site_type_code: Optional[List[str]] = None, + site_type: Optional[List[str]] = None, + hydrologic_unit_code: Optional[List[str]] = None, + basin_code: Optional[List[str]] = None, + altitude: Optional[List[str]] = None, + altitude_accuracy: Optional[List[str]] = None, + altitude_method_code: Optional[List[str]] = None, + altitude_method_name: Optional[List[str]] = None, + vertical_datum: Optional[List[str]] = None, + vertical_datum_name: Optional[List[str]] = None, + horizontal_positional_accuracy_code: Optional[List[str]] = None, + horizontal_positional_accuracy: Optional[List[str]] = None, + horizontal_position_method_code: Optional[List[str]] = None, + horizontal_position_method_name: Optional[List[str]] = None, + original_horizontal_datum: Optional[List[str]] = None, + original_horizontal_datum_name: Optional[List[str]] = None, + drainage_area: Optional[List[str]] = None, + contributing_drainage_area: Optional[List[str]] = None, + time_zone_abbreviation: Optional[List[str]] = None, + uses_daylight_savings: Optional[List[str]] = None, + construction_date: Optional[List[str]] = None, + aquifer_code: Optional[List[str]] = None, + national_aquifer_code: Optional[List[str]] = None, + aquifer_type_code: Optional[List[str]] = None, + well_constructed_depth: Optional[List[str]] = None, + hole_constructed_depth: Optional[List[str]] = None, + depth_source_code: Optional[List[str]] = None, + properties: Optional[List[str]] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Location information is basic information about the monitoring location + including the name, identifier, agency responsible for data collection, and + the date the location was established. It also includes information about + the type of location, such as stream, lake, or groundwater, and geographic + information about the location, such as state, county, latitude and + longitude, and hydrologic unit code (HUC). + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of + the agency responsible for the monitoring location (e.g. USGS) with + the ID number of the monitoring location (e.g. 02238500), separated + by a hyphen (e.g. USGS-02238500). + agency_code : string or list of strings, optional + The agency that is reporting the data. Agency codes are fixed values + assigned by the National Water Information System (NWIS). A list of + agency codes is available at + [this link](https://help.waterdata.usgs.gov/code/agency_cd_query?fmt=html). + agency_name : string or list of strings, optional + The name of the agency that is reporting the data. + monitoring_location_number : string or list of strings, optional + Each monitoring location in the USGS data base has a unique 8- to + 15-digit identification number. Monitoring location numbers are + assigned based on [this logic](https://help.waterdata.usgs.gov/faq/sites/do-station-numbers-have-any-particular-meaning). + monitoring_location_name : string or list of strings, optional + This is the official name of the monitoring location in the database. + For well information this can be a district-assigned local number. + district_code : string or list of strings, optional + The Water Science Centers (WSCs) across the United States use the FIPS + state code as the district code. In some case, monitoring locations and + samples may be managed by a water science center that is adjacent to the + state in which the monitoring location actually resides. For example a + monitoring location may have a district code of 30 which translates to + Montana, but the state code could be 56 for Wyoming because that is where + the monitoring location actually is located. + country_code : string or list of strings, optional + The code for the country in which the monitoring location is located. + country_name : string or list of strings, optional + The name of the country in which the monitoring location is located. + state_code : string or list of strings, optional + State code. A two-digit ANSI code (formerly FIPS code) as defined by + the American National Standards Institute, to define States and + equivalents. A three-digit ANSI code is used to define counties and + county equivalents. A [lookup table](https://www.census.gov/library/reference/code-lists/ansi.html#states) + is available. The only countries with + political subdivisions other than the US are Mexico and Canada. The Mexican + states have US state codes ranging from 81-86 and Canadian provinces have + state codes ranging from 90-98. + state_name : string or list of strings, optional + The name of the state or state equivalent in which the monitoring location + is located. + county_code : string or list of strings, optional + The code for the county or county equivalent (parish, borough, etc.) in which + the monitoring location is located. A [list of codes](https://help.waterdata.usgs.gov/code/county_query?fmt=html) + is available. + county_name : string or list of strings, optional + The name of the county or county equivalent (parish, borough, etc.) in which + the monitoring location is located. A [list of codes](https://help.waterdata.usgs.gov/code/county_query?fmt=html) + is available. + minor_civil_division_code : string or list of strings, optional + Codes for primary governmental or administrative divisions of the county or + county equivalent in which the monitoring location is located. + site_type_code : string or list of strings, optional + A code describing the hydrologic setting of the monitoring location. A [list of + codes](https://help.waterdata.usgs.gov/code/site_tp_query?fmt=html) is available. + Example: "US:15:001" (United States: Hawaii, Hawaii County) + site_type : string or list of strings, optional + A description of the hydrologic setting of the monitoring location. A [list of + codes](https://help.waterdata.usgs.gov/code/site_tp_query?fmt=html) is available. + hydrologic_unit_code : string or list of strings, optional + The United States is divided and sub-divided into successively smaller + hydrologic units which are classified into four levels: regions, + sub-regions, accounting units, and cataloging units. The hydrologic + units are arranged within each other, from the smallest (cataloging + units) to the largest (regions). Each hydrologic unit is identified by a + unique hydrologic unit code (HUC) consisting of two to eight digits + based on the four levels of classification in the hydrologic unit + system. + basin_code : string or list of strings, optional + The Basin Code or "drainage basin code" is a two-digit code that further + subdivides the 8-digit hydrologic-unit code. The drainage basin code is + defined by the USGS State Office where the monitoring location is + located. + altitude : string or list of strings, optional + Altitude of the monitoring location referenced to the specified Vertical + Datum. + altitude_accuracy : string or list of strings, optional + Accuracy of the altitude, in feet. An accuracy of +/- 0.1 foot would be + entered as “.1”. Many altitudes are interpolated from the contours on + topographic maps; accuracies determined in this way are generally + entered as one-half of the contour interval. + altitude_method_code : string or list of strings, optional + Codes representing the method used to measure altitude. A [list of codes](https://help.waterdata.usgs.gov/code/alt_meth_cd_query?fmt=html) + is available. + altitude_method_name : float, optional + The name of the the method used to measure altitude. A [list of codes](https://help.waterdata.usgs.gov/code/alt_meth_cd_query?fmt=html) + is + available. + vertical_datum : float, optional + The datum used to determine altitude and vertical position at the + monitoring location. A [list of codes](https://help.waterdata.usgs.gov/code/alt_datum_cd_query?fmt=html) + is available. + vertical_datum_name : float, optional + The datum used to determine altitude and vertical position at the + monitoring location. A [list of codes](https://help.waterdata.usgs.gov/code/alt_datum_cd_query?fmt=html) + is available. + horizontal_positional_accuracy_code : string or list of strings, optional + Indicates the accuracy of the latitude longitude values. A [list of codes](https://help.waterdata.usgs.gov/code/coord_acy_cd_query?fmt=html) + is available. + horizontal_positional_accuracy : string or list of strings, optional + Indicates the accuracy of the latitude longitude values. A [list of codes](https://help.waterdata.usgs.gov/code/coord_acy_cd_query?fmt=html) + is available. + horizontal_position_method_code : string or list of strings, optional + Indicates the method used to determine latitude longitude values. A + [list of codes](https://help.waterdata.usgs.gov/code/coord_meth_cd_query?fmt=html) + is available. + horizontal_position_method_name : string or list of strings, optional + Indicates the method used to determine latitude longitude values. A + [list of codes](https://help.waterdata.usgs.gov/code/coord_meth_cd_query?fmt=html) + is available. + original_horizontal_datum : string or list of strings, optional + Coordinates are published in EPSG:4326 / WGS84 / World Geodetic System + 1984. This field indicates the original datum used to determine + coordinates before they were converted. A [list of codes](https://help.waterdata.usgs.gov/code/coord_datum_cd_query?fmt=html) + is available. + original_horizontal_datum_name : string or list of strings, optional + Coordinates are published in EPSG:4326 / WGS84 / World Geodetic System + 1984. This field indicates the original datum used to determine coordinates + before they were converted. A [list of codes](https://help.waterdata.usgs.gov/code/coord_datum_cd_query?fmt=html) + is available. + drainage_area : string or list of strings, optional + The area enclosed by a topographic divide from which direct surface runoff + from precipitation normally drains by gravity into the stream above that + point. + contributing_drainage_area : string or list of strings, optional + The contributing drainage area of a lake, stream, wetland, or estuary + monitoring location, in square miles. This item should be present only + if the contributing area is different from the total drainage area. This + situation can occur when part of the drainage area consists of very + porous soil or depressions that either allow all runoff to enter the + groundwater or traps the water in ponds so that rainfall does not + contribute to runoff. A transbasin diversion can also affect the total + drainage area. + time_zone_abbreviation : string or list of strings, optional + A short code describing the time zone used by a monitoring location. + uses_daylight_savings : string or list of strings, optional + A flag indicating whether or not a monitoring location uses daylight savings. + construction_date : string or list of strings, optional + Date the well was completed. + aquifer_code : string or list of strings, optional + Local aquifers in the USGS water resources data base are identified by a + geohydrologic unit code (a three-digit number related to the age of the + formation, followed by a 4 or 5 character abbreviation for the geologic + unit or aquifer name). Additional information is available + [at this link](https://help.waterdata.usgs.gov/faq/groundwater/local-aquifer-description). + national_aquifer_code : string or list of strings, optional + National aquifers are the principal aquifers or aquifer systems in the United + States, defined as regionally extensive aquifers or aquifer systems that have + the potential to be used as a source of potable water. Not all groundwater + monitoring locations can be associated with a National Aquifer. Such + monitoring locations will not be retrieved using this search criteria. A [list + of National aquifer codes and names](https://help.waterdata.usgs.gov/code/nat_aqfr_query?fmt=html) + is available. + aquifer_type_code : string or list of strings, optional + Groundwater occurs in aquifers under two different conditions. Where water + only partly fills an aquifer, the upper surface is free to rise and decline. + These aquifers are referred to as unconfined (or water-table) aquifers. Where + water completely fills an aquifer that is overlain by a confining bed, the + aquifer is referred to as a confined (or artesian) aquifer. When a confined + aquifer is penetrated by a well, the water level in the well will rise above + the top of the aquifer (but not necessarily above land surface). Additional + information is available [at this link](https://help.waterdata.usgs.gov/faq/groundwater/local-aquifer-description). + well_constructed_depth : string or list of strings, optional + The depth of the finished well, in feet below land surface datum. Note: Not + all groundwater monitoring locations have information on Well Depth. Such + monitoring locations will not be retrieved using this search criteria. + hole_constructed_depth : string or list of strings, optional + The total depth to which the hole is drilled, in feet below land surface datum. + Note: Not all groundwater monitoring locations have information on Hole Depth. + Such monitoring locations will not be retrieved using this search criteria. + depth_source_code : string or list of strings, optional + A code indicating the source of water-level data. A [list of codes](https://help.waterdata.usgs.gov/code/water_level_src_cd_query?fmt=html) + is available. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. Available + options are: geometry, id, agency_code, agency_name, + monitoring_location_number, monitoring_location_name, district_code, + country_code, country_name, state_code, state_name, county_code, + county_name, minor_civil_division_code, site_type_code, site_type, + hydrologic_unit_code, basin_code, altitude, altitude_accuracy, + altitude_method_code, altitude_method_name, vertical_datum, + vertical_datum_name, horizontal_positional_accuracy_code, + horizontal_positional_accuracy, horizontal_position_method_code, + horizontal_position_method_name, original_horizontal_datum, + original_horizontal_datum_name, drainage_area, + contributing_drainage_area, time_zone_abbreviation, + uses_daylight_savings, construction_date, aquifer_code, + national_aquifer_code, aquifer_type_code, well_constructed_depth, + hole_constructed_depth, depth_source_code. + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (NA) will set the + limit to the maximum allowable limit for the service. + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. + The returning object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get monitoring locations within a bounding box + >>> # and leave out geometry + >>> df, md = dataretrieval.waterdata.get_monitoring_locations( + ... bbox=[-90.2, 42.6, -88.7, 43.2], skip_geometry=True + ... ) + + >>> # Get monitoring location info for specific sites + >>> # and only specific properties + >>> df, md = dataretrieval.waterdata.get_monitoring_locations( + ... monitoring_location_id=["USGS-05114000", "USGS-09423350"], + ... properties=["monitoring_location_id", "state_name", "country_name"], + ... ) + """ + service = "monitoring-locations" + output_id = "monitoring_location_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + + +def get_time_series_metadata( + monitoring_location_id: Optional[Union[str, List[str]]] = None, + parameter_code: Optional[Union[str, List[str]]] = None, + parameter_name: Optional[Union[str, List[str]]] = None, + properties: Optional[Union[str, List[str]]] = None, + statistic_id: Optional[Union[str, List[str]]] = None, + last_modified: Optional[Union[str, List[str]]] = None, + begin: Optional[Union[str, List[str]]] = None, + end: Optional[Union[str, List[str]]] = None, + unit_of_measure: Optional[Union[str, List[str]]] = None, + computation_period_identifier: Optional[Union[str, List[str]]] = None, + computation_identifier: Optional[Union[str, List[str]]] = None, + thresholds: Optional[int] = None, + sublocation_identifier: Optional[Union[str, List[str]]] = None, + primary: Optional[Union[str, List[str]]] = None, + parent_time_series_id: Optional[Union[str, List[str]]] = None, + time_series_id: Optional[Union[str, List[str]]] = None, + web_description: Optional[Union[str, List[str]]] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Daily data and continuous measurements are grouped into time series, + which represent a collection of observations of a single parameter, + potentially aggregated using a standard statistic, at a single monitoring + location. This endpoint provides metadata about those time series, + including their operational thresholds, units of measurement, and when + the earliest and most recent observations in a time series occurred. + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of + the agency responsible for the monitoring location (e.g. USGS) with + the ID number of the monitoring location (e.g. 02238500), separated + by a hyphen (e.g. USGS-02238500). + parameter_code : string or list of strings, optional + Parameter codes are 5-digit codes used to identify the constituent + measured and the units of measure. A complete list of parameter + codes and associated groupings can be found at + https://help.waterdata.usgs.gov/codes-and-parameters/parameters. + parameter_name : string or list of strings, optional + A human-understandable name corresponding to parameter_code. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. + Available options are: geometry, id, time_series_id, + monitoring_location_id, parameter_code, statistic_id, time, value, + unit_of_measure, approval_status, qualifier, last_modified + statistic_id : string or list of strings, optional + A code corresponding to the statistic an observation represents. + Example codes include 00001 (max), 00002 (min), and 00003 (mean). + A complete list of codes and their descriptions can be found at + https://help.waterdata.usgs.gov/code/stat_cd_nm_query?stat_nm_cd=%25&fmt=html. + last_modified : string, optional + The last time a record was refreshed in our database. This may happen + due to regular operational processes and does not necessarily indicate + anything about the measurement has changed. You can query this field + using date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a last_modified that + intersects the value of datetime are selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + begin : string or list of strings, optional + The datetime of the earliest observation in the time series. Together + with end, this field represents the period of record of a time series. + Note that some time series may have large gaps in their collection + record. This field is currently in the local time of the monitoring + location. We intend to update this in version v0 to use UTC with a time + zone. You can query this field using date-times or intervals, adhering + to RFC 3339, or using ISO 8601 duration objects. Intervals may be + bounded or half-bounded (double-dots at start or end). Only features + that have a begin that intersects the value of datetime are selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" for the last 36 hours + end : string or list of strings, optional + The datetime of the most recent observation in the time series. Data returned by + this endpoint updates at most once per day, and potentially less frequently than + that, and as such there may be more recent observations within a time series + than the time series end value reflects. Together with begin, this field + represents the period of record of a time series. It is additionally used to + determine whether a time series is "active". We intend to update this in + version v0 to use UTC with a time zone. You can query this field using date-times + or intervals, adhering to RFC 3339, or using ISO 8601 duration objects. Intervals + may be bounded or half-bounded (double-dots at start or end). Only + features that have a end that intersects the value of datetime are + selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" for + the last 36 hours + unit_of_measure : string or list of strings, optional + A human-readable description of the units of measurement associated + with an observation. + computation_period_identifier : string or list of strings, optional + Indicates the period of data used for any statistical computations. + computation_identifier : string or list of strings, optional + Indicates whether the data from this time series represent a specific + statistical computation. + thresholds : numeric or list of numbers, optional + Thresholds represent known numeric limits for a time series, for example + the historic maximum value for a parameter or a level below which a + sensor is non-operative. These thresholds are sometimes used to + automatically determine if an observation is erroneous due to sensor + error, and therefore shouldn't be included in the time series. + sublocation_identifier : string or list of strings, optional + primary : string or list of strings, optional + parent_time_series_id : string or list of strings, optional + time_series_id : string or list of strings, optional + A unique identifier representing a single time series. This + corresponds to the id field in the time-series-metadata endpoint. + web_description : string or list of strings, optional + A description of what this time series represents, as used by WDFN and + other USGS data dissemination products. + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. + The returning object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (None) will set the + limit to the maximum allowable limit for the service. + convert_type : boolean, optional + If True, the function will convert the data to dates and qualifier to + string vector + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get timeseries metadata information from a single site + >>> # over a yearlong period + >>> df, md = dataretrieval.waterdata.get_time_series_metadata( + ... monitoring_location_id="USGS-02238500" + ... ) + + >>> # Get timeseries metadata information from multiple sites + >>> # that begin after January 1, 1990. + >>> df, md = dataretrieval.waterdata.get_time_series_metadata( + ... monitoring_location_id = ["USGS-05114000", "USGS-09423350"], + ... begin = "1990-01-01/.." + ... ) + """ + service = "time-series-metadata" + output_id = "time_series_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + + +def get_latest_continuous( + monitoring_location_id: Optional[Union[str, List[str]]] = None, + parameter_code: Optional[Union[str, List[str]]] = None, + statistic_id: Optional[Union[str, List[str]]] = None, + properties: Optional[Union[str, List[str]]] = None, + time_series_id: Optional[Union[str, List[str]]] = None, + latest_continuous_id: Optional[Union[str, List[str]]] = None, + approval_status: Optional[Union[str, List[str]]] = None, + unit_of_measure: Optional[Union[str, List[str]]] = None, + qualifier: Optional[Union[str, List[str]]] = None, + value: Optional[int] = None, + last_modified: Optional[Union[str, List[str]]] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """This endpoint provides the most recent observation for each time series + of continuous data. Continuous data are collected via automated sensors + installed at a monitoring location. They are collected at a high frequency + and often at a fixed 15-minute interval. Depending on the specific monitoring + location, the data may be transmitted automatically via telemetry and be + available on WDFN within minutes of collection, while other times the delivery + of data may be delayed if the monitoring location does not have the capacity to + automatically transmit data. Continuous data are described by parameter name + and parameter code. These data might also be referred to as "instantaneous + values" or "IV" + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of the + agency responsible for the monitoring location (e.g. USGS) with the ID + number of the monitoring location (e.g. 02238500), separated by a hyphen + (e.g. USGS-02238500). + parameter_code : string or list of strings, optional + Parameter codes are 5-digit codes used to identify the constituent + measured and the units of measure. A complete list of parameter codes + and associated groupings can be found at + https://help.waterdata.usgs.gov/codes-and-parameters/parameters. + statistic_id : string or list of strings, optional + A code corresponding to the statistic an observation represents. + Example codes include 00001 (max), 00002 (min), and 00003 (mean). + A complete list of codes and their descriptions can be found at + https://help.waterdata.usgs.gov/code/stat_cd_nm_query?stat_nm_cd=%25&fmt=html. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. Available + options are: geometry, id, time_series_id, monitoring_location_id, + parameter_code, statistic_id, time, value, unit_of_measure, + approval_status, qualifier, last_modified + time_series_id : string or list of strings, optional + A unique identifier representing a single time series. This + corresponds to the id field in the time-series-metadata endpoint. + latest_continuous_id : string or list of strings, optional + A universally unique identifier (UUID) representing a single version of + a record. It is not stable over time. Every time the record is refreshed + in our database (which may happen as part of normal operations and does + not imply any change to the data itself) a new ID will be generated. To + uniquely identify a single observation over time, compare the time and + time_series_id fields; each time series will only have a single + observation at a given time. + approval_status : string or list of strings, optional + Some of the data that you have obtained from this U.S. Geological Survey + database may not have received Director's approval. Any such data values + are qualified as provisional and are subject to revision. Provisional + data are released on the condition that neither the USGS nor the United + States Government may be held liable for any damages resulting from its + use. This field reflects the approval status of each record, and is either + "Approved", meaining processing review has been completed and the data is + approved for publication, or "Provisional" and subject to revision. For + more information about provisional data, go to + [https://waterdata.usgs.gov/provisional-data-statement/] + (https://waterdata.usgs.gov/provisional-data-statement/). + unit_of_measure : string or list of strings, optional + A human-readable description of the units of measurement associated + with an observation. + qualifier : string or list of strings, optional + This field indicates any qualifiers associated with an observation, for + instance if a sensor may have been impacted by ice or if values were + estimated. + value : string or list of strings, optional + The value of the observation. Values are transmitted as strings in + the JSON response format in order to preserve precision. + last_modified : string, optional + The last time a record was refreshed in our database. This may happen + due to regular operational processes and does not necessarily indicate + anything about the measurement has changed. You can query this field + using date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a last_modified that + intersects the value of datetime are selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. + The returning object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + time : string, optional + The date an observation represents. You can query this field using + date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a time that intersects the + value of datetime are selected. If a feature has multiple temporal + properties, it is the decision of the server whether only a single + temporal property is used to determine the extent or all relevant + temporal properties. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (None) will set the + limit to the maximum allowable limit for the service. + convert_type : boolean, optional + If True, the function will convert the data to dates and qualifier to + string vector + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get latest flow data from a single site + >>> df, md = dataretrieval.waterdata.get_latest_continuous( + ... monitoring_location_id="USGS-02238500", parameter_code="00060" + ... ) + + >>> # Get latest continuous measurements for multiple sites + >>> df, md = dataretrieval.waterdata.get_latest_continuous( + ... monitoring_location_id=["USGS-05114000", "USGS-09423350"] + ... ) + """ + service = "latest-continuous" + output_id = "latest_continuous_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + + +def get_latest_daily( + monitoring_location_id: Optional[Union[str, List[str]]] = None, + parameter_code: Optional[Union[str, List[str]]] = None, + statistic_id: Optional[Union[str, List[str]]] = None, + properties: Optional[Union[str, List[str]]] = None, + time_series_id: Optional[Union[str, List[str]]] = None, + latest_daily_id: Optional[Union[str, List[str]]] = None, + approval_status: Optional[Union[str, List[str]]] = None, + unit_of_measure: Optional[Union[str, List[str]]] = None, + qualifier: Optional[Union[str, List[str]]] = None, + value: Optional[int] = None, + last_modified: Optional[Union[str, List[str]]] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Daily data provide one data value to represent water conditions for the + day. + + Throughout much of the history of the USGS, the primary water data available + was daily data collected manually at the monitoring location once each day. + With improved availability of computer storage and automated transmission of + data, the daily data published today are generally a statistical summary or + metric of the continuous data collected each day, such as the daily mean, + minimum, or maximum value. Daily data are automatically calculated from the + continuous data of the same parameter code and are described by parameter + code and a statistic code. These data have also been referred to as “daily + values” or “DV”. + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of the + agency responsible for the monitoring location (e.g. USGS) with the ID + number of the monitoring location (e.g. 02238500), separated by a hyphen + (e.g. USGS-02238500). + parameter_code : string or list of strings, optional + Parameter codes are 5-digit codes used to identify the constituent + measured and the units of measure. A complete list of parameter codes + and associated groupings can be found at + https://help.waterdata.usgs.gov/codes-and-parameters/parameters. + statistic_id : string or list of strings, optional + A code corresponding to the statistic an observation represents. + Example codes include 00001 (max), 00002 (min), and 00003 (mean). + A complete list of codes and their descriptions can be found at + https://help.waterdata.usgs.gov/code/stat_cd_nm_query?stat_nm_cd=%25&fmt=html. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. Available + options are: geometry, id, time_series_id, monitoring_location_id, + parameter_code, statistic_id, time, value, unit_of_measure, + approval_status, qualifier, last_modified + time_series_id : string or list of strings, optional + A unique identifier representing a single time series. This + corresponds to the id field in the time-series-metadata endpoint. + latest_daily_id : string or list of strings, optional + A universally unique identifier (UUID) representing a single version of + a record. It is not stable over time. Every time the record is refreshed + in our database (which may happen as part of normal operations and does + not imply any change to the data itself) a new ID will be generated. To + uniquely identify a single observation over time, compare the time and + time_series_id fields; each time series will only have a single + observation at a given time. + approval_status : string or list of strings, optional + Some of the data that you have obtained from this U.S. Geological Survey + database may not have received Director's approval. Any such data values + are qualified as provisional and are subject to revision. Provisional + data are released on the condition that neither the USGS nor the United + States Government may be held liable for any damages resulting from its + use. This field reflects the approval status of each record, and is either + "Approved", meaining processing review has been completed and the data is + approved for publication, or "Provisional" and subject to revision. For + more information about provisional data, go to + [https://waterdata.usgs.gov/provisional-data-statement/] + (https://waterdata.usgs.gov/provisional-data-statement/). + unit_of_measure : string or list of strings, optional + A human-readable description of the units of measurement associated + with an observation. + qualifier : string or list of strings, optional + This field indicates any qualifiers associated with an observation, for + instance if a sensor may have been impacted by ice or if values were + estimated. + value : string or list of strings, optional + The value of the observation. Values are transmitted as strings in + the JSON response format in order to preserve precision. + last_modified : string, optional + The last time a record was refreshed in our database. This may happen + due to regular operational processes and does not necessarily indicate + anything about the measurement has changed. You can query this field + using date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a last_modified that + intersects the value of datetime are selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. + The returning object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + time : string, optional + The date an observation represents. You can query this field using + date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a time that intersects the + value of datetime are selected. If a feature has multiple temporal + properties, it is the decision of the server whether only a single + temporal property is used to determine the extent or all relevant + temporal properties. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (None) will set the + limit to the maximum allowable limit for the service. + convert_type : boolean, optional + If True, the function will convert the data to dates and qualifier to + string vector + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get most recent daily flow data from a single site + >>> df, md = dataretrieval.waterdata.get_latest_daily( + ... monitoring_location_id="USGS-02238500", parameter_code="00060" + ... ) + + >>> # Get most recent daily measurements for two sites + >>> df, md = dataretrieval.waterdata.get_latest_daily( + ... monitoring_location_id=["USGS-05114000", "USGS-09423350"] + ... ) + """ + service = "latest-daily" + output_id = "latest_daily_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + +def get_field_measurements( + monitoring_location_id: Optional[Union[str, List[str]]] = None, + parameter_code: Optional[Union[str, List[str]]] = None, + observing_procedure_code: Optional[Union[str, List[str]]] = None, + properties: Optional[List[str]] = None, + field_visit_id: Optional[Union[str, List[str]]] = None, + approval_status: Optional[Union[str, List[str]]] = None, + unit_of_measure: Optional[Union[str, List[str]]] = None, + qualifier: Optional[Union[str, List[str]]] = None, + value: Optional[Union[str, List[str]]] = None, + last_modified: Optional[Union[str, List[str]]] = None, + observing_procedure: Optional[Union[str, List[str]]] = None, + vertical_datum: Optional[Union[str, List[str]]] = None, + measuring_agency: Optional[Union[str, List[str]]] = None, + skip_geometry: Optional[bool] = None, + time: Optional[Union[str, List[str]]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + convert_type: bool = True, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Field measurements are physically measured values collected during a + visit to the monitoring location. Field measurements consist of measurements + of gage height and discharge, and readings of groundwater levels, and are + primarily used as calibration readings for the automated sensors collecting + continuous data. They are collected at a low frequency, and delivery of the + data in WDFN may be delayed due to data processing time. + + Parameters + ---------- + monitoring_location_id : string or list of strings, optional + A unique identifier representing a single monitoring location. This + corresponds to the id field in the monitoring-locations endpoint. + Monitoring location IDs are created by combining the agency code of the + agency responsible for the monitoring location (e.g. USGS) with the ID + number of the monitoring location (e.g. 02238500), separated by a hyphen + (e.g. USGS-02238500). + parameter_code : string or list of strings, optional + Parameter codes are 5-digit codes used to identify the constituent + measured and the units of measure. A complete list of parameter codes + and associated groupings can be found at + https://help.waterdata.usgs.gov/codes-and-parameters/parameters. + observing_procedure_code : string or list of strings, optional + A short code corresponding to the observing procedure for the field + measurement. + properties : string or list of strings, optional + A vector of requested columns to be returned from the query. Available + options are: geometry, id, time_series_id, monitoring_location_id, + parameter_code, statistic_id, time, value, unit_of_measure, + approval_status, qualifier, last_modified + field_visit_id : string or list of strings, optional + A universally unique identifier (UUID) for the field visit. + Multiple measurements may be made during a single field visit. + approval_status : string or list of strings, optional + Some of the data that you have obtained from this U.S. Geological Survey + database may not have received Director's approval. Any such data values + are qualified as provisional and are subject to revision. Provisional + data are released on the condition that neither the USGS nor the United + States Government may be held liable for any damages resulting from its + use. This field reflects the approval status of each record, and is either + "Approved", meaining processing review has been completed and the data is + approved for publication, or "Provisional" and subject to revision. For + more information about provisional data, go to + [https://waterdata.usgs.gov/provisional-data-statement/] + (https://waterdata.usgs.gov/provisional-data-statement/). + unit_of_measure : string or list of strings, optional + A human-readable description of the units of measurement associated + with an observation. + qualifier : string or list of strings, optional + This field indicates any qualifiers associated with an observation, for + instance if a sensor may have been impacted by ice or if values were + estimated. + value : string or list of strings, optional + The value of the observation. Values are transmitted as strings in + the JSON response format in order to preserve precision. + last_modified : string, optional + The last time a record was refreshed in our database. This may happen + due to regular operational processes and does not necessarily indicate + anything about the measurement has changed. You can query this field + using date-times or intervals, adhering to RFC 3339, or using ISO 8601 + duration objects. Intervals may be bounded or half-bounded (double-dots + at start or end). Only features that have a last_modified that + intersects the value of datetime are selected. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" for the last 36 hours + observing_procedure : string or list of strings, optional + Water measurement or water-quality observing procedure descriptions. + vertical_datum : string or list of strings, optional + The datum used to determine altitude and vertical position at the monitoring location. + A list of codes is available. + measuring_agency : string or list of strings, optional + The agency performing the measurement. + skip_geometry : boolean, optional + This option can be used to skip response geometries for each feature. The returning + object will be a data frame with no spatial information. + Note that the USGS Water Data APIs use camelCase "skipGeometry" in + CQL2 queries. + time : string, optional + The date an observation represents. You can query this field using date-times + or intervals, adhering to RFC 3339, or using ISO 8601 duration objects. + Intervals may be bounded or half-bounded (double-dots at start or end). + Only features that have a time that intersects the value of datetime are + selected. If a feature has multiple temporal properties, it is the + decision of the server whether only a single temporal property is used + to determine the extent or all relevant temporal properties. + Examples: + - A date-time: "2018-02-12T23:20:50Z" + - A bounded interval: "2018-02-12T00:00:00Z/2018-03-18T12:31:12Z" + - Half-bounded intervals: "2018-02-12T00:00:00Z/.." or + "../2018-03-18T12:31:12Z" + - Duration objects: "P1M" for data from the past month or "PT36H" + for the last 36 hours + bbox : list of numbers, optional + Only features that have a geometry that intersects the bounding box are + selected. The bounding box is provided as four or six numbers, + depending on whether the coordinate reference system includes a vertical + axis (height or depth). Coordinates are assumed to be in crs 4326. The + expected format is a numeric vector structured: c(xmin,ymin,xmax,ymax). + Another way to think of it is c(Western-most longitude, Southern-most + latitude, Eastern-most longitude, Northern-most longitude). + limit : numeric, optional + The optional limit parameter is used to control the subset of the + selected features that should be returned in each page. The maximum + allowable limit is 10000. It may be beneficial to set this number lower + if your internet connection is spotty. The default (None) will set the + limit to the maximum allowable limit for the service. + convert_type : boolean, optional + If True, the function will convert the data to dates and qualifier to + string vector + + Returns + ------- + df : ``pandas.DataFrame`` or ``geopandas.GeoDataFrame`` + Formatted data returned from the API query. + md: :obj:`dataretrieval.utils.Metadata` + A custom metadata object + + Examples + -------- + .. code:: + + >>> # Get field measurements from a single groundwater site + >>> # and parameter code, and do not return geometry + >>> df, md = dataretrieval.waterdata.get_field_measurements( + ... monitoring_location_id="USGS-375907091432201", + ... parameter_code="72019", + ... skip_geometry=True, + ... ) + + >>> # Get field measurements from multiple sites and + >>> # parameter codes from the last 20 years + >>> df, md = dataretrieval.waterdata.get_field_measurements( + ... monitoring_location_id = ["USGS-451605097071701", + "USGS-263819081585801"], + ... parameter_code = ["62611", "72019"], + ... time = "P20Y" + ... ) + """ + service = "field-measurements" + output_id = "field_measurement_id" + + # Build argument dictionary, omitting None values + args = { + k: v + for k, v in locals().items() + if k not in {"service", "output_id"} and v is not None + } + + return get_ogc_data(args, output_id, service) + + +def get_codes(code_service: CODE_SERVICES) -> pd.DataFrame: + """Return codes from a Samples code service. + + Parameters + ---------- + code_service : string + One of the following options: "states", "counties", "countries" + "sitetype", "samplemedia", "characteristicgroup", "characteristics", + or "observedproperty" + """ + valid_code_services = get_args(CODE_SERVICES) + if code_service not in valid_code_services: + raise ValueError( + f"Invalid code service: '{code_service}'. " + f"Valid options are: {valid_code_services}." + ) + + url = f"{SAMPLES_URL}/codeservice/{code_service}?mimeType=application%2Fjson" + + response = requests.get(url) + + response.raise_for_status() + + data_dict = json.loads(response.text) + data_list = data_dict["data"] + + df = pd.DataFrame(data_list) + + return df + + +def get_samples( + ssl_check: bool = True, + service: SERVICES = "results", + profile: PROFILES = "fullphyschem", + activityMediaName: Optional[Union[str, list[str]]] = None, + activityStartDateLower: Optional[str] = None, + activityStartDateUpper: Optional[str] = None, + activityTypeCode: Optional[Union[str, list[str]]] = None, + characteristicGroup: Optional[Union[str, list[str]]] = None, + characteristic: Optional[Union[str, list[str]]] = None, + characteristicUserSupplied: Optional[Union[str, list[str]]] = None, + boundingBox: Optional[list[float]] = None, + countryFips: Optional[Union[str, list[str]]] = None, + stateFips: Optional[Union[str, list[str]]] = None, + countyFips: Optional[Union[str, list[str]]] = None, + siteTypeCode: Optional[Union[str, list[str]]] = None, + siteTypeName: Optional[Union[str, list[str]]] = None, + usgsPCode: Optional[Union[str, list[str]]] = None, + hydrologicUnit: Optional[Union[str, list[str]]] = None, + monitoringLocationIdentifier: Optional[Union[str, list[str]]] = None, + organizationIdentifier: Optional[Union[str, list[str]]] = None, + pointLocationLatitude: Optional[float] = None, + pointLocationLongitude: Optional[float] = None, + pointLocationWithinMiles: Optional[float] = None, + projectIdentifier: Optional[Union[str, list[str]]] = None, + recordIdentifierUserSupplied: Optional[Union[str, list[str]]] = None, +) -> Tuple[pd.DataFrame, BaseMetadata]: + """Search Samples database for USGS water quality data. + This is a wrapper function for the Samples database API. All potential + filters are provided as arguments to the function, but please do not + populate all possible filters; leave as many as feasible with their default + value (None). This is important because overcomplicated web service queries + can bog down the database's ability to return an applicable dataset before + it times out. + + The web GUI for the Samples database can be found here: + https://waterdata.usgs.gov/download-samples/#dataProfile=site + + If you would like more details on feasible query parameters (complete with + examples), please visit the Samples database swagger docs, here: + https://api.waterdata.usgs.gov/samples-data/docs#/ + + Parameters + ---------- + ssl_check : bool, optional + Check the SSL certificate. + service : string + One of the available Samples services: "results", "locations", "activities", + "projects", or "organizations". Defaults to "results". + profile : string + One of the available profiles associated with a service. Options for each + service are: + results - "fullphyschem", "basicphyschem", + "fullbio", "basicbio", "narrow", + "resultdetectionquantitationlimit", + "labsampleprep", "count" + locations - "site", "count" + activities - "sampact", "actmetric", + "actgroup", "count" + projects - "project", "projectmonitoringlocationweight" + organizations - "organization", "count" + activityMediaName : string or list of strings, optional + Name or code indicating environmental medium in which sample was taken. + Check the `activityMediaName_lookup()` function in this module for all + possible inputs. + Example: "Water". + activityStartDateLower : string, optional + The start date if using a date range. Takes the format YYYY-MM-DD. + The logic is inclusive, i.e. it will also return results that + match the date. If left as None, will pull all data on or before + activityStartDateUpper, if populated. + activityStartDateUpper : string, optional + The end date if using a date range. Takes the format YYYY-MM-DD. + The logic is inclusive, i.e. it will also return results that + match the date. If left as None, will pull all data after + activityStartDateLower up to the most recent available results. + activityTypeCode : string or list of strings, optional + Text code that describes type of field activity performed. + Example: "Sample-Routine, regular". + characteristicGroup : string or list of strings, optional + Characteristic group is a broad category of characteristics + describing one or more results. Check the `characteristicGroup_lookup()` + function in this module for all possible inputs. + Example: "Organics, PFAS" + characteristic : string or list of strings, optional + Characteristic is a specific category describing one or more results. + Check the `characteristic_lookup()` function in this module for all + possible inputs. + Example: "Suspended Sediment Discharge" + characteristicUserSupplied : string or list of strings, optional + A user supplied characteristic name describing one or more results. + boundingBox: list of four floats, optional + Filters on the the associated monitoring location's point location + by checking if it is located within the specified geographic area. + The logic is inclusive, i.e. it will include locations that overlap + with the edge of the bounding box. Values are separated by commas, + expressed in decimal degrees, NAD83, and longitudes west of Greenwich + are negative. + The format is a string consisting of: + - Western-most longitude + - Southern-most latitude + - Eastern-most longitude + - Northern-most longitude + Example: [-92.8,44.2,-88.9,46.0] + countryFips : string or list of strings, optional + Example: "US" (United States) + stateFips : string or list of strings, optional + Check the `stateFips_lookup()` function in this module for all + possible inputs. + Example: "US:15" (United States: Hawaii) + countyFips : string or list of strings, optional + Check the `countyFips_lookup()` function in this module for all + possible inputs. + Example: "US:15:001" (United States: Hawaii, Hawaii County) + siteTypeCode : string or list of strings, optional + An abbreviation for a certain site type. Check the `siteType_lookup()` + function in this module for all possible inputs. + Example: "GW" (Groundwater site) + siteTypeName : string or list of strings, optional + A full name for a certain site type. Check the `siteType_lookup()` + function in this module for all possible inputs. + Example: "Well" + usgsPCode : string or list of strings, optional + 5-digit number used in the US Geological Survey computerized + data system, National Water Information System (NWIS), to + uniquely identify a specific constituent. Check the + `characteristic_lookup()` function in this module for all possible + inputs. + Example: "00060" (Discharge, cubic feet per second) + hydrologicUnit : string or list of strings, optional + Max 12-digit number used to describe a hydrologic unit. + Example: "070900020502" + monitoringLocationIdentifier : string or list of strings, optional + A monitoring location identifier has two parts: the agency code + and the location number, separated by a dash (-). + Example: "USGS-040851385" + organizationIdentifier : string or list of strings, optional + Designator used to uniquely identify a specific organization. + Currently only accepting the organization "USGS". + pointLocationLatitude : float, optional + Latitude for a point/radius query (decimal degrees). Must be used + with pointLocationLongitude and pointLocationWithinMiles. + pointLocationLongitude : float, optional + Longitude for a point/radius query (decimal degrees). Must be used + with pointLocationLatitude and pointLocationWithinMiles. + pointLocationWithinMiles : float, optional + Radius for a point/radius query. Must be used with + pointLocationLatitude and pointLocationLongitude + projectIdentifier : string or list of strings, optional + Designator used to uniquely identify a data collection project. Project + identifiers are specific to an organization (e.g. USGS). + Example: "ZH003QW03" + recordIdentifierUserSupplied : string or list of strings, optional + Internal AQS record identifier that returns 1 entry. Only available + for the "results" service. + + Returns + ------- + df : ``pandas.DataFrame`` + Formatted data returned from the API query. + md : :obj:`dataretrieval.utils.Metadata` + Custom ``dataretrieval`` metadata object pertaining to the query. + + Examples + -------- + .. code:: + + >>> # Get PFAS results within a bounding box + >>> df, md = dataretrieval.waterdata.get_samples( + ... boundingBox=[-90.2, 42.6, -88.7, 43.2], + ... characteristicGroup="Organics, PFAS", + ... ) + + >>> # Get all activities for the Commonwealth of Virginia over a date range + >>> df, md = dataretrieval.waterdata.get_samples( + ... service="activities", + ... profile="sampact", + ... activityStartDateLower="2023-10-01", + ... activityStartDateUpper="2024-01-01", + ... stateFips="US:51", + ... ) + + >>> # Get all pH samples for two sites in Utah + >>> df, md = dataretrieval.waterdata.get_samples( + ... monitoringLocationIdentifier=[ + ... "USGS-393147111462301", + ... "USGS-393343111454101", + ... ], + ... usgsPCode="00400", + ... ) + + """ + + _check_profiles(service, profile) + + params = { + k: v + for k, v in locals().items() + if k not in ["ssl_check", "service", "profile"] and v is not None + } + + params.update({"mimeType": "text/csv"}) + + if "boundingBox" in params: + params["boundingBox"] = to_str(params["boundingBox"]) + + url = f"{SAMPLES_URL}/{service}/{profile}" + + req = PreparedRequest() + req.prepare_url(url, params=params) + logger.info("Request: %s", req.url) + + response = requests.get(url, params=params, verify=ssl_check) + + response.raise_for_status() + + df = pd.read_csv(StringIO(response.text), delimiter=",") + + return df, BaseMetadata(response) + + +def _check_profiles( + service: SERVICES, + profile: PROFILES, +) -> None: + """Check whether a service profile is valid. + + Parameters + ---------- + service : string + One of the service names from the "services" list. + profile : string + One of the profile names from "results_profiles", + "locations_profiles", "activities_profiles", + "projects_profiles" or "organizations_profiles". + """ + valid_services = get_args(SERVICES) + if service not in valid_services: + raise ValueError( + f"Invalid service: '{service}'. Valid options are: {valid_services}." + ) + + valid_profiles = PROFILE_LOOKUP[service] + if profile not in valid_profiles: + raise ValueError( + f"Invalid profile: '{profile}' for service '{service}'. " + f"Valid options are: {valid_profiles}." + ) diff --git a/dataretrieval/waterdata/types.py b/dataretrieval/waterdata/types.py new file mode 100644 index 00000000..65e73394 --- /dev/null +++ b/dataretrieval/waterdata/types.py @@ -0,0 +1,55 @@ +from typing import Literal + +CODE_SERVICES = Literal[ + "characteristicgroup", + "characteristics", + "counties", + "countries", + "observedproperty", + "samplemedia", + "sitetype", + "states", +] + +SERVICES = Literal[ + "activities", + "locations", + "organizations", + "projects", + "results", +] + +PROFILES = Literal[ + "actgroup", + "actmetric", + "basicbio", + "basicphyschem", + "count", + "fullbio", + "fullphyschem", + "labsampleprep", + "narrow", + "organization", + "project", + "projectmonitoringlocationweight", + "resultdetectionquantitationlimit", + "sampact", + "site", +] + +PROFILE_LOOKUP = { + "activities": ["sampact", "actmetric", "actgroup", "count"], + "locations": ["site", "count"], + "organizations": ["organization", "count"], + "projects": ["project", "projectmonitoringlocationweight"], + "results": [ + "fullphyschem", + "basicphyschem", + "fullbio", + "basicbio", + "narrow", + "resultdetectionquantitationlimit", + "labsampleprep", + "count", + ], +} diff --git a/dataretrieval/waterdata/utils.py b/dataretrieval/waterdata/utils.py new file mode 100644 index 00000000..68ae9e13 --- /dev/null +++ b/dataretrieval/waterdata/utils.py @@ -0,0 +1,778 @@ +import json +import logging +import warnings +import os +import re +from datetime import datetime +from typing import Any, Dict, List, Optional, Tuple, Union + +import pandas as pd +import requests +from zoneinfo import ZoneInfo + +from dataretrieval.utils import BaseMetadata +from dataretrieval import __version__ + +try: + import geopandas as gpd + + GEOPANDAS = True +except ImportError: + GEOPANDAS = False + +# Set up logger for this module +logger = logging.getLogger(__name__) + +BASE_URL = "https://api.waterdata.usgs.gov" +OGC_API_VERSION = "v0" +OGC_API_URL = f"{BASE_URL}/ogcapi/{OGC_API_VERSION}" +SAMPLES_URL = f"{BASE_URL}/samples-data" + + +def _switch_arg_id(ls: Dict[str, Any], id_name: str, service: str): + """ + Switch argument id from its package-specific identifier to the standardized "id" key + that the API recognizes. + + Sets the "id" key in the provided dictionary `ls` + with the value from either the service name or the expected id column name. + If neither key exists, "id" will be set to None. + + Parameters + ---------- + ls : Dict[str, Any] + The dictionary containing identifier keys to be standardized. + id_name : str + The name of the specific identifier key to look for. + service : str + The service name. + + Returns + ------- + Dict[str, Any] + The modified dictionary with the "id" key set appropriately. + + Examples + -------- + For service "time-series-metadata", the function will look for either + "time_series_metadata_id" or "time_series_id" and change the key to simply + "id". + """ + + service_id = service.replace("-", "_") + "_id" + + if "id" not in ls: + if service_id in ls: + ls["id"] = ls[service_id] + elif id_name in ls: + ls["id"] = ls[id_name] + + # Remove the original keys regardless of whether they were used + ls.pop(service_id, None) + ls.pop(id_name, None) + + return ls + + +def _switch_properties_id(properties: Optional[List[str]], id_name: str, service: str): + """ + Switch properties id from its package-specific identifier to the + standardized "id" key that the API recognizes. + + Sets the "id" key in the provided dictionary `ls` with the value from either + the service name or the expected id column name. If neither key exists, "id" + will be set to None. + + Parameters + ---------- + properties : Optional[List[str]] + A list containing the properties or column names to be pulled from the + service, or None. + id_name : str + The name of the specific identifier key to look for. + service : str + The service name. + + Returns + ------- + List[str] + The modified list with the "id" key set appropriately. + + Examples + -------- + For service "monitoring-locations", it will look for + "monitoring_location_id" and change + it to "id". + """ + if not properties: + return [] + service_id = service.replace("-", "_") + "_id" + last_letter = service[-1] + service_id_singular = "" + if last_letter == "s": + service_singular = service[:-1] + service_id_singular = service_singular.replace("-", "_") + "_id" + # Replace id fields with "id" + id_fields = [service_id, service_id_singular, id_name] + properties = ["id" if p in id_fields else p.replace("-", "_") for p in properties] + # Remove unwanted fields + return [p for p in properties if p not in ["geometry", service_id]] + + +def _format_api_dates( + datetime_input: Union[str, List[str]], date: bool = False +) -> Union[str, None]: + """ + Formats date or datetime input(s) for use with an API. + + Handles single values or ranges, and converting to ISO 8601 or date-only + formats as needed. + + Parameters + ---------- + datetime_input : Union[str, List[str]] + A single date/datetime string or a list of one or two date/datetime + strings. Accepts formats like "%Y-%m-%d %H:%M:%S", ISO 8601, or relative + periods (e.g., "P7D"). + date : bool, optional + If True, uses only the date portion ("YYYY-MM-DD"). If False (default), + returns full datetime in UTC ISO 8601 format ("YYYY-MM-DDTHH:MM:SSZ"). + + Returns + ------- + Union[str, None] + - If input is a single value, returns the formatted date/datetime string + or None if parsing fails. + - If input is a list of two values, returns a date/datetime range string + separated by "/" (e.g., "YYYY-MM-DD/YYYY-MM-DD" or + "YYYY-MM-DDTHH:MM:SSZ/YYYY-MM-DDTHH:MM:SSZ"). + - Returns None if input is empty, all NA, or cannot be parsed. + + Raises + ------ + ValueError + If `datetime_input` contains more than two values. + + Notes + ----- + - Handles blank or NA values by returning None. + - Supports relative period strings (e.g., "P7D") and passes them through + unchanged. + - Converts datetimes to UTC and formats as ISO 8601 with 'Z' suffix when + `date` is False. + - For date ranges, replaces "nan" with ".." in the output. + """ + # Get timezone + local_timezone = datetime.now().astimezone().tzinfo + + # Convert single string to list for uniform processing + if isinstance(datetime_input, str): + datetime_input = [datetime_input] + + # Check for null or all NA and return None + if all(pd.isna(dt) or dt == "" or dt is None for dt in datetime_input): + return None + + if len(datetime_input) <= 2: + # If the list is of length 1, first look for things like "P7D" or dates + # already formatted in ISO08601. Otherwise, try to coerce to datetime + if ( + len(datetime_input) == 1 + and re.search(r"P", datetime_input[0], re.IGNORECASE) + or "/" in datetime_input[0] + ): + return datetime_input[0] + # Otherwise, use list comprehension to parse dates + else: + try: + # Parse to naive datetime + parsed_dates = [ + datetime.strptime(dt, "%Y-%m-%d %H:%M:%S") for dt in datetime_input + ] + except Exception: + # Parse to date only + try: + parsed_dates = [ + datetime.strptime(dt, "%Y-%m-%d") for dt in datetime_input + ] + except Exception: + return None + # If the service only accepts dates for this input, not + # datetimes (e.g. "daily"), return just the dates separated by a + # "/", otherwise, return the datetime in UTC format. + if date: + return "/".join(dt.strftime("%Y-%m-%d") for dt in parsed_dates) + else: + parsed_locals = [ + dt.replace(tzinfo=local_timezone) for dt in parsed_dates + ] + formatted = "/".join( + dt.astimezone(ZoneInfo("UTC")).strftime("%Y-%m-%dT%H:%M:%SZ") + for dt in parsed_locals + ) + return formatted + else: + raise ValueError("datetime_input should only include 1-2 values") + + +def _cql2_param(args: Dict[str, Any]) -> str: + """ + Convert query parameters to CQL2 JSON format for POST requests. + + Parameters + ---------- + args : Dict[str, Any] + Dictionary of query parameters to convert to CQL2 format. + + Returns + ------- + str + JSON string representation of the CQL2 query. + """ + filters = [] + for key, values in args.items(): + filters.append({"op": "in", "args": [{"property": key}, values]}) + + query = {"op": "and", "args": filters} + + return json.dumps(query, indent=4) + + +def _default_headers(): + """ + Generate default HTTP headers for API requests. + + Returns + ------- + dict + A dictionary containing default headers including 'Accept-Encoding', + 'Accept', 'User-Agent', and 'lang'. If the environment variable + 'API_USGS_PAT' is set, its value is included as the 'X-Api-Key' header. + """ + headers = { + "Accept-Encoding": "compress, gzip", + "Accept": "application/json", + "User-Agent": f"python-dataretrieval/{__version__}", + "lang": "en-US", + } + token = os.getenv("API_USGS_PAT") + if token: + headers["X-Api-Key"] = token + return headers + + +def _check_ogc_requests(endpoint: str = "daily", req_type: str = "queryables"): + """ + Sends an HTTP GET request to the specified OGC endpoint and request type, + returning the JSON response. + + Parameters + ---------- + endpoint : str, optional + The OGC collection endpoint to query (default is "daily"). + req_type : str, optional + The type of request to make. Must be either "queryables" or "schema" + (default is "queryables"). + + Returns + ------- + dict + The JSON response from the OGC endpoint. + + Raises + ------ + AssertionError + If req_type is not "queryables" or "schema". + requests.HTTPError + If the HTTP request returns an unsuccessful status code. + """ + assert req_type in ["queryables", "schema"] + url = f"{OGC_API_URL}/collections/{endpoint}/{req_type}" + resp = requests.get(url, headers=_default_headers()) + resp.raise_for_status() + return resp.json() + + +def _error_body(resp: requests.Response): + """ + Provide more informative error messages based on the response status. + + Parameters + ---------- + resp : requests.Response + The HTTP response object to extract the error message from. + + Returns + ------- + str + The extracted error message. For status code 429, returns the 'message' + field from the JSON error object. For status code 403, returns a + predefined message indicating possible reasons for denial. For other + status codes, returns the raw response text. + """ + status = resp.status_code + if status == 429: + return "429: Too many requests made. Please obtain an API token or try again later." + elif status == 403: + return "403: Query request denied. Possible reasons include query exceeding server limits." + j_txt = resp.json() + return ( + f"{status}: {j_txt.get('code', 'Unknown type')}. " + f"{j_txt.get('description', 'No description provided')}." + ) + + +def _construct_api_requests( + service: str, + properties: Optional[List[str]] = None, + bbox: Optional[List[float]] = None, + limit: Optional[int] = None, + skip_geometry: bool = False, + **kwargs, +): + """ + Constructs an HTTP request object for the specified water data API service. + + Depending on the input parameters (whether there's lists of multiple + argument values), the function determines whether to use a GET or POST + request, formats parameters appropriately, and sets required headers. + + Parameters + ---------- + service : str + The name of the API service to query (e.g., "daily"). + properties : Optional[List[str]], optional + List of property names to include in the request. + bbox : Optional[List[float]], optional + Bounding box coordinates as a list of floats. + limit : Optional[int], optional + Maximum number of results to return per request. + skip_geometry : bool, optional + Whether to exclude geometry from the response (default is False). + **kwargs + Additional query parameters, including date/time filters and other + API-specific options. + + Returns + ------- + requests.PreparedRequest + The constructed HTTP request object ready to be sent. + + Notes + ----- + - Date/time parameters are automatically formatted to ISO8601. + - If multiple values are provided for non-single parameters, a POST request + is constructed. + - The function sets appropriate headers for GET and POST requests. + """ + service_url = f"{OGC_API_URL}/collections/{service}/items" + + # Single parameters can only have one value + single_params = {"datetime", "last_modified", "begin", "end", "time"} + + # Identify which parameters should be included in the POST content body + post_params = { + k: v + for k, v in kwargs.items() + if k not in single_params and isinstance(v, (list, tuple)) and len(v) > 1 + } + + # Everything else goes into the params dictionary for the URL + params = {k: v for k, v in kwargs.items() if k not in post_params} + # Set skipGeometry parameter (API expects camelCase) + params["skipGeometry"] = skip_geometry + + # If limit is none or greater than 10000, then set limit to max results. Otherwise, + # use the limit + params["limit"] = ( + 10000 if limit is None or limit > 10000 else limit + ) + + # Indicate if function needs to perform POST conversion + POST = bool(post_params) + + # Convert dates to ISO08601 format + time_periods = {"last_modified", "datetime", "time", "begin", "end"} + for i in time_periods: + if i in params: + dates = service == "daily" and i != "last_modified" + params[i] = _format_api_dates(params[i], date=dates) + + # String together bbox elements from a list to a comma-separated string, + # and string together properties if provided + if bbox: + params["bbox"] = ",".join(map(str, bbox)) + if properties: + params["properties"] = ",".join(properties) + + headers = _default_headers() + + if POST: + headers["Content-Type"] = "application/query-cql-json" + request = requests.Request( + method="POST", + url=service_url, + headers=headers, + data=_cql2_param(post_params), + params=params, + ) + else: + request = requests.Request( + method="GET", + url=service_url, + headers=headers, + params=params, + ) + return request.prepare() + + +def _next_req_url(resp: requests.Response) -> Optional[str]: + """ + Extracts the URL for the next page of results from an HTTP response from a + water data endpoint. + + Parameters + ---------- + resp : requests.Response + The HTTP response object containing JSON data and headers. + + Returns + ------- + Optional[str] + The URL for the next page of results if available, otherwise None. + + Notes + ----- + - If the environment variable "API_USGS_PAT" is set, logs the remaining + requests for the current hour. + - Logs the next URL if found at info level. + - Expects the response JSON to contain a "links" list with objects having + "rel" and "href" keys. + - Checks for the "next" relation in the "links" to determine the next URL. + """ + body = resp.json() + if not body.get("numberReturned"): + return None + header_info = resp.headers + if os.getenv("API_USGS_PAT", ""): + logger.info( + "Remaining requests this hour: %s", + header_info.get("x-ratelimit-remaining", ""), + ) + for link in body.get("links", []): + if link.get("rel") == "next": + next_url = link.get("href") + logger.info("Next URL: %s", next_url) + return next_url + return None + + +def _get_resp_data(resp: requests.Response, geopd: bool) -> pd.DataFrame: + """ + Extracts and normalizes data from an HTTP response containing GeoJSON features. + + Parameters + ---------- + resp : requests.Response + The HTTP response object expected to contain a JSON body with a "features" key. + geopd : bool + Indicates whether geopandas is installed and should be used to handle geometries. + + Returns + ------- + gpd.GeoDataFrame or pd.DataFrame + A geopandas GeoDataFrame if geometry is included, or a pandas DataFrame + containing the feature properties and each row's service-specific id. + Returns an empty pandas DataFrame if no features are returned. + """ + # Check if it's an empty response + body = resp.json() + if not body.get("numberReturned"): + return pd.DataFrame() + + # If geopandas not installed, return a pandas dataframe + if not geopd: + df = pd.json_normalize(body["features"], sep="_") + df = df.drop( + columns=["type", "geometry", "AsGeoJSON(geometry)"], errors="ignore" + ) + df.columns = [col.replace("properties_", "") for col in df.columns] + df.rename(columns={"geometry_coordinates": "geometry"}, inplace=True) + return df + + # Organize json into geodataframe and make sure id column comes along. + df = gpd.GeoDataFrame.from_features(body["features"]) + df["id"] = pd.json_normalize(body["features"])["id"].values + df = df[["id"] + [col for col in df.columns if col != "id"]] + + # If no geometry present, then return pandas dataframe. A geodataframe + # is not needed. + if df["geometry"].isnull().all(): + df = pd.DataFrame(df.drop(columns="geometry")) + + return df + + +def _walk_pages( + geopd: bool, + req: requests.PreparedRequest, + client: Optional[requests.Session] = None, +) -> Tuple[pd.DataFrame, requests.Response]: + """ + Iterates through paginated API responses and aggregates the results into a single DataFrame. + + Parameters + ---------- + geopd : bool + Indicates whether geopandas is installed and should be used for handling + geometries. + req : requests.PreparedRequest + The initial HTTP request to send. + client : Optional[requests.Session], default None + An optional HTTP client to use for requests. If not provided, a new + client is created. + + Returns + ------- + pd.DataFrame + A DataFrame containing the aggregated results from all pages. + requests.Response + The initial response object containing metadata about the first request. + + Raises + ------ + Exception + If a request fails or returns a non-200 status code. + """ + logger.info("Requesting: %s", req.url) + + if not geopd: + logger.warning( + "Geopandas is not installed. ", + "Geometries will be flattened into pandas DataFrames.", + ) + + # Get first response from client + # using GET or POST call + close_client = client is None + client = client or requests.Session() + try: + resp = client.send(req) + if resp.status_code != 200: + raise Exception(_error_body(resp)) + + # Store the initial response for metadata + initial_response = resp + + # Grab some aspects of the original request: headers and the + # request type (GET or POST) + method = req.method.upper() + headers = dict(req.headers) + content = req.body if method == "POST" else None + + dfs = _get_resp_data(resp, geopd=geopd) + curr_url = _next_req_url(resp) + while curr_url: + try: + resp = client.request( + method, + curr_url, + headers=headers, + data=content if method == "POST" else None, + ) + if resp.status_code != 200: + error_text = _error_body(resp) + raise Exception(error_text) + df1 = _get_resp_data(resp, geopd=geopd) + dfs = pd.concat([dfs, df1], ignore_index=True) + curr_url = _next_req_url(resp) + except Exception: + warnings.warn(f"{error_text}. Data request incomplete.") + logger.error("Request incomplete. %s", error_text) + logger.warning("Request failed for URL: %s. Data download interrupted.", curr_url) + curr_url = None + return dfs, initial_response + finally: + if close_client: + client.close() + + +def _deal_with_empty( + return_list: pd.DataFrame, properties: Optional[List[str]], service: str +) -> pd.DataFrame: + """ + Handles empty DataFrame results by returning a DataFrame with appropriate columns. + + If `return_list` is empty, determines the column names to use: + - If `properties` is not provided or contains only NaN values, retrieves the schema properties from the specified service. + - Otherwise, uses the provided `properties` list as column names. + + Parameters + ---------- + return_list : pd.DataFrame + The DataFrame to check for emptiness. + properties : Optional[List[str]] + List of property names to use as columns, or None. + service : str + The service endpoint to query for schema properties if needed. + + Returns + ------- + pd.DataFrame + The original DataFrame if not empty, otherwise an empty DataFrame with the appropriate columns. + """ + if return_list.empty: + if not properties or all(pd.isna(properties)): + schema = _check_ogc_requests(endpoint=service, req_type="schema") + properties = list(schema.get("properties", {}).keys()) + return pd.DataFrame(columns=properties) + return return_list + + +def _arrange_cols( + df: pd.DataFrame, properties: Optional[List[str]], output_id: str +) -> pd.DataFrame: + """ + Rearranges and renames columns in a DataFrame based on provided properties and service's output id. + + Parameters + ---------- + df : pd.DataFrame + The input DataFrame whose columns are to be rearranged or renamed. + properties : Optional[List[str]] + A list of column names to possibly rename. If None or contains only NaN, the function will rename 'id' to output_id. + output_id : str + The name to which the 'id' column should be renamed if applicable. + + Returns + ------- + pd.DataFrame or gpd.GeoDataFrame + The DataFrame with columns rearranged and/or renamed according to the specified properties and output_id. + """ + if properties and not all(pd.isna(properties)): + if "id" not in properties: + # If user refers to service-specific output id in properties, + # then rename the "id" column to the output_id (id column is + # automatically included). + if output_id in properties: + df = df.rename(columns={"id": output_id}) + # If output id is not in properties, but user requests the plural + # of the output_id (e.g. "monitoring_locations_id"), then rename + # "id" to plural. This is pretty niche. + else: + plural = output_id.replace("_id", "s_id") + if plural in properties: + df = df.rename(columns={"id": plural}) + return df.loc[:, [col for col in properties if col in df.columns]] + else: + return df.rename(columns={"id": output_id}) + + +def _cleanup_cols(df: pd.DataFrame, service: str = "daily") -> pd.DataFrame: + """ + Cleans and standardizes columns in a pandas DataFrame for water data endpoints. + + Parameters + ---------- + df : pd.DataFrame + The input DataFrame containing water data. + service : str, optional + The type of water data service (default is "daily"). + + Returns + ------- + pd.DataFrame + The cleaned DataFrame with standardized columns. + + Notes + ----- + - If the 'time' column exists and service is "daily", it is converted to date objects. + - The 'value' and 'contributing_drainage_area' columns are coerced to numeric types. + """ + if "time" in df.columns and service == "daily": + df["time"] = pd.to_datetime(df["time"]).dt.date + for col in ["value", "contributing_drainage_area"]: + if col in df.columns: + df[col] = pd.to_numeric(df[col], errors="coerce") + return df + + +def get_ogc_data( + args: Dict[str, Any], output_id: str, service: str +) -> Tuple[pd.DataFrame, BaseMetadata]: + """ + Retrieves OGC (Open Geospatial Consortium) data from a specified water data endpoint and returns it as a pandas DataFrame with metadata. + + This function prepares request arguments, constructs API requests, handles pagination, processes the results, + and formats the output DataFrame according to the specified parameters. + + Parameters + ---------- + args : Dict[str, Any] + Dictionary of request arguments for the OGC service. + output_id : str + The name of the output identifier to use in the request. + service : str + The OGC service type (e.g., "wfs", "wms"). + + Returns + ------- + pd.DataFrame or gpd.GeoDataFrame + A DataFrame containing the retrieved and processed OGC data. + BaseMetadata + A metadata object containing request information including URL and query time. + + Notes + ----- + - The function does not mutate the input `args` dictionary. + - Handles optional arguments such as `convert_type`. + - Applies column cleanup and reordering based on service and properties. + """ + args = args.copy() + # Add service as an argument + args["service"] = service + # Switch the input id to "id" if needed + args = _switch_arg_id(args, id_name=output_id, service=service) + properties = args.get("properties") + # Switch properties id to "id" if needed + args["properties"] = _switch_properties_id( + properties, id_name=output_id, service=service + ) + convert_type = args.pop("convert_type", False) + # Create fresh dictionary of args without any None values + args = {k: v for k, v in args.items() if v is not None} + # Build API request + req = _construct_api_requests(**args) + # Run API request and iterate through pages if needed + return_list, response = _walk_pages( + geopd=GEOPANDAS, req=req + ) + # Manage some aspects of the returned dataset + return_list = _deal_with_empty(return_list, properties, service) + if convert_type: + return_list = _cleanup_cols(return_list, service=service) + return_list = _arrange_cols(return_list, properties, output_id) + # Create metadata object from response + metadata = BaseMetadata(response) + return return_list, metadata + + +# def _get_description(service: str): +# tags = _get_collection().get("tags", []) +# for tag in tags: +# if tag.get("name") == service: +# return tag.get("description") +# return None + +# def _get_params(service: str): +# url = f"{_base_url()}collections/{service}/schema" +# resp = requests.get(url, headers=_default_headers()) +# resp.raise_for_status() +# properties = resp.json().get("properties", {}) +# return {k: v.get("description") for k, v in properties.items()} + +# def _get_collection(): +# url = f"{_base_url()}openapi?f=json" +# resp = requests.get(url, headers=_default_headers()) +# resp.raise_for_status() +# return resp.json() diff --git a/pyproject.toml b/pyproject.toml index a276f113..e55dc812 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -48,6 +48,10 @@ nldi = [ 'geopandas>=0.10' ] +waterdata = [ + 'geopandas>=0.10', +] + [project.urls] homepage = "https://github.com/DOI-USGS/dataretrieval-python" documentation = "https://doi-usgs.github.io/dataretrieval-python/" diff --git a/tests/nadp_test.py b/tests/nadp_test.py index 123e9e04..5d71b516 100644 --- a/tests/nadp_test.py +++ b/tests/nadp_test.py @@ -2,7 +2,7 @@ import os -import dataretrieval.nadp as nadp +from dataretrieval import nadp class TestMDNmap: diff --git a/tests/nldi_test.py b/tests/nldi_test.py index c4d6675f..9993a899 100644 --- a/tests/nldi_test.py +++ b/tests/nldi_test.py @@ -47,7 +47,7 @@ def test_get_basin(requests_mock): f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/basin" f"?simplified=true&splitCatchment=false" ) - response_file_path = "data/nldi_get_basin.json" + response_file_path = "tests/data/nldi_get_basin.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -62,7 +62,7 @@ def test_get_flowlines(requests_mock): f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/navigation/UM/flowlines" f"?distance=5&trimStart=false" ) - response_file_path = "data/nldi_get_flowlines.json" + response_file_path = "tests/data/nldi_get_flowlines.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -78,7 +78,7 @@ def test_get_flowlines_by_comid(requests_mock): request_url = ( f"{NLDI_API_BASE_URL}/comid/13294314/navigation/UM/flowlines?distance=50" ) - response_file_path = "data/nldi_get_flowlines_by_comid.json" + response_file_path = "tests/data/nldi_get_flowlines_by_comid.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -94,7 +94,7 @@ def test_features_by_feature_source_with_navigation(requests_mock): request_url = ( f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/navigation/UM/nwissite?distance=50" ) - response_file_path = "data/nldi_get_features_by_feature_source_with_nav_mode.json" + response_file_path = "tests/data/nldi_get_features_by_feature_source_with_nav_mode.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -115,7 +115,7 @@ def test_features_by_feature_source_without_navigation(requests_mock): """ request_url = f"{NLDI_API_BASE_URL}/WQP/USGS-054279485" response_file_path = ( - "data/nldi_get_features_by_feature_source_without_nav_mode.json" + "tests/data/nldi_get_features_by_feature_source_without_nav_mode.json" ) mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -128,7 +128,7 @@ def test_features_by_feature_source_without_navigation(requests_mock): def test_get_features_by_comid(requests_mock): """Tests NLDI get features query using comid as the origin""" request_url = f"{NLDI_API_BASE_URL}/comid/13294314/navigation/UM/WQP?distance=5" - response_file_path = "data/nldi_get_features_by_comid.json" + response_file_path = "tests/data/nldi_get_features_by_comid.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -144,7 +144,7 @@ def test_get_features_by_lat_long(requests_mock): request_url = ( f"{NLDI_API_BASE_URL}/comid/position?coords=POINT%28-89.509%2043.087%29" ) - response_file_path = "data/nldi_get_features_by_lat_long.json" + response_file_path = "tests/data/nldi_get_features_by_lat_long.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -156,7 +156,7 @@ def test_get_features_by_lat_long(requests_mock): def test_search_for_basin(requests_mock): """Tests NLDI search query for basin""" request_url = f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/basin" - response_file_path = "data/nldi_get_basin.json" + response_file_path = "tests/data/nldi_get_basin.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -172,7 +172,7 @@ def test_search_for_basin(requests_mock): def test_search_for_flowlines(requests_mock): """Tests NLDI search query for flowlines""" request_url = f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/navigation/UM/flowlines" - response_file_path = "data/nldi_get_flowlines.json" + response_file_path = "tests/data/nldi_get_flowlines.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -191,7 +191,7 @@ def test_search_for_flowlines(requests_mock): def test_search_for_flowlines_by_comid(requests_mock): """Tests NLDI search query for flowlines by comid""" request_url = f"{NLDI_API_BASE_URL}/comid/13294314/navigation/UM/flowlines" - response_file_path = "data/nldi_get_flowlines_by_comid.json" + response_file_path = "tests/data/nldi_get_flowlines_by_comid.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -207,7 +207,7 @@ def test_search_for_features_by_feature_source_with_navigation(requests_mock): request_url = ( f"{NLDI_API_BASE_URL}/WQP/USGS-054279485/navigation/UM/nwissite?distance=50" ) - response_file_path = "data/nldi_get_features_by_feature_source_with_nav_mode.json" + response_file_path = "tests/data/nldi_get_features_by_feature_source_with_nav_mode.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -228,7 +228,7 @@ def test_search_for_features_by_feature_source_without_navigation(requests_mock) """Tests NLDI search query for features by feature source""" request_url = f"{NLDI_API_BASE_URL}/WQP/USGS-054279485" response_file_path = ( - "data/nldi_get_features_by_feature_source_without_nav_mode.json" + "tests/data/nldi_get_features_by_feature_source_without_nav_mode.json" ) mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -245,7 +245,7 @@ def test_search_for_features_by_feature_source_without_navigation(requests_mock) def test_search_for_features_by_comid(requests_mock): """Tests NLDI search query for features by comid""" request_url = f"{NLDI_API_BASE_URL}/comid/13294314/navigation/UM/WQP?distance=5" - response_file_path = "data/nldi_get_features_by_comid.json" + response_file_path = "tests/data/nldi_get_features_by_comid.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) @@ -267,7 +267,7 @@ def test_search_for_features_by_lat_long(requests_mock): request_url = ( f"{NLDI_API_BASE_URL}/comid/position?coords=POINT%28-89.509%2043.087%29" ) - response_file_path = "data/nldi_get_features_by_lat_long.json" + response_file_path = "tests/data/nldi_get_features_by_lat_long.json" mock_request_data_sources(requests_mock) mock_request(requests_mock, request_url, response_file_path) diff --git a/tests/utils_test.py b/tests/utils_test.py index a99f91e7..711e5886 100644 --- a/tests/utils_test.py +++ b/tests/utils_test.py @@ -4,8 +4,10 @@ import pytest -import dataretrieval.nwis as nwis -from dataretrieval import utils +from dataretrieval import ( + utils, + nwis +) class Test_query: diff --git a/tests/waterdata_test.py b/tests/waterdata_test.py index 50eefdc5..816bc112 100755 --- a/tests/waterdata_test.py +++ b/tests/waterdata_test.py @@ -1,13 +1,20 @@ import datetime - +import sys import pytest from pandas import DataFrame +if sys.version_info < (3, 10): + pytest.skip("Skip entire module on Python < 3.10", allow_module_level=True) + from dataretrieval.waterdata import ( _check_profiles, get_samples, - _SERVICES, - _PROFILES + get_daily, + get_monitoring_locations, + get_latest_continuous, + get_latest_daily, + get_field_measurements, + get_time_series_metadata, ) def mock_request(requests_mock, request_url, file_path): @@ -24,7 +31,7 @@ def test_mock_get_samples(requests_mock): "activityMediaName=Water&activityStartDateLower=2020-01-01" "&activityStartDateUpper=2024-12-31&monitoringLocationIdentifier=USGS-05406500&mimeType=text%2Fcsv" ) - response_file_path = "data/samples_results.txt" + response_file_path = "tests/data/samples_results.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_samples( service="results", @@ -105,3 +112,109 @@ def test_samples_organizations(): ) assert len(df) == 1 assert df.size == 3 + +def test_get_daily(): + df, md = get_daily( + monitoring_location_id="USGS-05427718", + parameter_code="00060", + time="2025-01-01/.." + ) + assert "daily_id" in df.columns + assert "geometry" in df.columns + assert df.shape[1] == 12 + assert df.parameter_code.unique().tolist() == ["00060"] + assert df.monitoring_location_id.unique().tolist() == ["USGS-05427718"] + assert df["time"].apply(lambda x: isinstance(x, datetime.date)).all() + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + assert df["value"].dtype == "float64" + +def test_get_daily_properties(): + df, md = get_daily( + monitoring_location_id="USGS-05427718", + parameter_code="00060", + time="2025-01-01/..", + properties=["daily_id", "monitoring_location_id", "parameter_code", "time", "value", "geometry"] + ) + assert "daily_id" in df.columns + assert "geometry" in df.columns + assert df.shape[1] == 6 + assert df.parameter_code.unique().tolist() == ["00060"] + +def test_get_daily_no_geometry(): + df, md = get_daily( + monitoring_location_id="USGS-05427718", + parameter_code="00060", + time="2025-01-01/..", + skip_geometry=True + ) + assert "geometry" not in df.columns + assert df.shape[1] == 11 + assert isinstance(df, DataFrame) + +def test_get_monitoring_locations(): + df, md = get_monitoring_locations( + state_name="Connecticut", + site_type_code="GW" + ) + assert df.site_type_code.unique().tolist() == ["GW"] + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + +def test_get_monitoring_locations_hucs(): + df, md = get_monitoring_locations( + hydrologic_unit_code=["010802050102", "010802050103"] + ) + assert set(df.hydrologic_unit_code.unique().tolist()) == {"010802050102", "010802050103"} + +def test_get_latest_continuous(): + df, md = get_latest_continuous( + monitoring_location_id=["USGS-05427718", "USGS-05427719"], + parameter_code=["00060", "00065"] + ) + assert "latest_continuous_id" in df.columns + assert df.shape[0] <= 4 + assert df.statistic_id.unique().tolist() == ["00011"] + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + try: + datetime.datetime.strptime(df['time'].iloc[0], "%Y-%m-%dT%H:%M:%S+00:00") + out=True + except: + out=False + assert out + +def test_get_latest_daily(): + df, md = get_latest_daily( + monitoring_location_id=["USGS-05427718", "USGS-05427719"], + parameter_code=["00060", "00065"] + ) + assert "latest_daily_id" in df.columns + assert df.shape[1] == 12 + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + +def test_get_field_measurements(): + df, md = get_field_measurements( + monitoring_location_id="USGS-05427718", + unit_of_measure="ft^3/s", + time="2025-01-01/2025-10-01", + skip_geometry=True + ) + assert "field_measurement_id" in df.columns + assert "geometry" not in df.columns + assert df.unit_of_measure.unique().tolist() == ["ft^3/s"] + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + +def test_get_time_series_metadata(): + df, md = get_time_series_metadata( + bbox=[-89.840355,42.853411,-88.818626,43.422598], + parameter_code=["00060", "00065", "72019"], + skip_geometry=True + ) + assert set(df['parameter_name'].unique().tolist()) == {"Gage height", "Water level, depth LSD", "Discharge"} + assert hasattr(md, 'url') + assert hasattr(md, 'query_time') + + diff --git a/tests/waterservices_test.py b/tests/waterservices_test.py index 19cc30fb..449650aa 100755 --- a/tests/waterservices_test.py +++ b/tests/waterservices_test.py @@ -93,7 +93,7 @@ def test_get_dv(requests_mock): "https://waterservices.usgs.gov/nwis/dv?format={}" "&startDT=2020-02-14&endDT=2020-02-15&sites={}".format(format, site) ) - response_file_path = "data/waterservices_dv.txt" + response_file_path = "tests/data/waterservices_dv.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_dv( sites=["01491000", "01645000"], start="2020-02-14", end="2020-02-15" @@ -115,7 +115,7 @@ def test_get_dv_site_value_types(requests_mock, site_input_type_list): "https://waterservices.usgs.gov/nwis/dv?format={}" "&startDT=2020-02-14&endDT=2020-02-15&sites={}".format(_format, site) ) - response_file_path = "data/waterservices_dv.txt" + response_file_path = "tests/data/waterservices_dv.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -136,7 +136,7 @@ def test_get_iv(requests_mock): "https://waterservices.usgs.gov/nwis/iv?format={}" "&startDT=2019-02-14&endDT=2020-02-15&sites={}".format(format, site) ) - response_file_path = "data/waterservices_iv.txt" + response_file_path = "tests/data/waterservices_iv.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_iv( sites=["01491000", "01645000"], start="2019-02-14", end="2020-02-15" @@ -158,7 +158,7 @@ def test_get_iv_site_value_types(requests_mock, site_input_type_list): "https://waterservices.usgs.gov/nwis/iv?format={}" "&startDT=2019-02-14&endDT=2020-02-15&sites={}".format(_format, site) ) - response_file_path = "data/waterservices_iv.txt" + response_file_path = "tests/data/waterservices_iv.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -183,7 +183,7 @@ def test_get_info(requests_mock): request_url = "https://waterservices.usgs.gov/nwis/site?sites={}¶meterCd={}&siteOutput=Expanded&format={}".format( site, parameter_cd, format ) - response_file_path = "data/waterservices_site.txt" + response_file_path = "tests/data/waterservices_site.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_info(sites=["01491000", "01645000"], parameterCd="00618") if not isinstance(df, DataFrame): @@ -210,7 +210,7 @@ def test_get_gwlevels(requests_mock): "https://nwis.waterdata.usgs.gov/nwis/gwlevels?format={}&begin_date=1851-01-01" "&site_no={}".format(format, site) ) - response_file_path = "data/waterdata_gwlevels.txt" + response_file_path = "tests/data/waterdata_gwlevels.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_gwlevels(sites=site) if not isinstance(df, DataFrame): @@ -229,7 +229,7 @@ def test_get_gwlevels_site_value_types(requests_mock, site_input_type_list): "https://nwis.waterdata.usgs.gov/nwis/gwlevels?format={}&begin_date=1851-01-01" "&site_no={}".format(_format, site) ) - response_file_path = "data/waterdata_gwlevels.txt" + response_file_path = "tests/data/waterdata_gwlevels.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -249,7 +249,7 @@ def test_get_discharge_peaks(requests_mock): "https://nwis.waterdata.usgs.gov/nwis/peaks?format={}&site_no={}" "&begin_date=2000-02-14&end_date=2020-02-15".format(format, site) ) - response_file_path = "data/waterservices_peaks.txt" + response_file_path = "tests/data/waterservices_peaks.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_discharge_peaks(sites=[site], start="2000-02-14", end="2020-02-15") if not isinstance(df, DataFrame): @@ -269,7 +269,7 @@ def test_get_discharge_peaks_sites_value_types(requests_mock, site_input_type_li "https://nwis.waterdata.usgs.gov/nwis/peaks?format={}&site_no={}" "&begin_date=2000-02-14&end_date=2020-02-15".format(_format, site) ) - response_file_path = "data/waterservices_peaks.txt" + response_file_path = "tests/data/waterservices_peaks.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -292,7 +292,7 @@ def test_get_discharge_measurements(requests_mock): "https://nwis.waterdata.usgs.gov/nwis/measurements?site_no={}" "&begin_date=2000-02-14&end_date=2020-02-15&format={}".format(site, format) ) - response_file_path = "data/waterdata_measurements.txt" + response_file_path = "tests/data/waterdata_measurements.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_discharge_measurements( sites=[site], start="2000-02-14", end="2020-02-15" @@ -315,7 +315,7 @@ def test_get_discharge_measurements_sites_value_types( "https://nwis.waterdata.usgs.gov/nwis/measurements?site_no={}" "&begin_date=2000-02-14&end_date=2020-02-15&format={}".format(site, format) ) - response_file_path = "data/waterdata_measurements.txt" + response_file_path = "tests/data/waterdata_measurements.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -334,7 +334,7 @@ def test_get_pmcodes(requests_mock): DataFrame""" format = "rdb" request_url = "https://help.waterdata.usgs.gov/code/parameter_cd_nm_query?fmt=rdb&parm_nm_cd=%2500618%25" - response_file_path = "data/waterdata_pmcodes.txt" + response_file_path = "tests/data/waterdata_pmcodes.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_pmcodes(parameterCd="00618") if not isinstance(df, DataFrame): @@ -352,7 +352,7 @@ def test_get_pmcodes_parameterCd_value_types( parameterCd = "00618" request_url = "https://help.waterdata.usgs.gov/code/parameter_cd_nm_query?fmt={}&parm_nm_cd=%25{}%25" request_url = request_url.format(_format, parameterCd) - response_file_path = "data/waterdata_pmcodes.txt" + response_file_path = "tests/data/waterdata_pmcodes.txt" mock_request(requests_mock, request_url, response_file_path) if parameterCd_input_type_list: parameterCd = [parameterCd] @@ -372,7 +372,7 @@ def test_get_water_use_national(requests_mock): "https://nwis.waterdata.usgs.gov/nwis/water_use?rdb_compression=value&format={}&wu_year=ALL" "&wu_category=ALL&wu_county=ALL".format(format) ) - response_file_path = "data/water_use_national.txt" + response_file_path = "tests/data/water_use_national.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_water_use() if not isinstance(df, DataFrame): @@ -390,7 +390,7 @@ def test_get_water_use_national_year_value_types(requests_mock, year_input_type_ "https://nwis.waterdata.usgs.gov/nwis/water_use?rdb_compression=value&format={}&wu_year=ALL" "&wu_category=ALL&wu_county=ALL".format(_format) ) - response_file_path = "data/water_use_national.txt" + response_file_path = "tests/data/water_use_national.txt" mock_request(requests_mock, request_url, response_file_path) if year_input_type_list: years = [year] @@ -412,7 +412,7 @@ def test_get_water_use_national_county_value_types( "https://nwis.waterdata.usgs.gov/nwis/water_use?rdb_compression=value&format={}&wu_year=ALL" "&wu_category=ALL&wu_county=ALL".format(_format) ) - response_file_path = "data/water_use_national.txt" + response_file_path = "tests/data/water_use_national.txt" mock_request(requests_mock, request_url, response_file_path) if county_input_type_list: counties = [county] @@ -435,7 +435,7 @@ def test_get_water_use_national_county_value_types( "https://nwis.waterdata.usgs.gov/nwis/water_use?rdb_compression=value&format={}&wu_year=ALL" "&wu_category=ALL&wu_county=ALL".format(_format) ) - response_file_path = "data/water_use_national.txt" + response_file_path = "tests/data/water_use_national.txt" mock_request(requests_mock, request_url, response_file_path) if category_input_type_list: categories = [category] @@ -455,7 +455,7 @@ def test_get_water_use_allegheny(requests_mock): "https://nwis.waterdata.usgs.gov/PA/nwis/water_use?rdb_compression=value&format=rdb&wu_year=ALL" "&wu_category=ALL&wu_county=003&wu_area=county" ) - response_file_path = "data/water_use_allegheny.txt" + response_file_path = "tests/data/water_use_allegheny.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_water_use(state="PA", counties="003") if not isinstance(df, DataFrame): @@ -481,7 +481,7 @@ def test_get_ratings(requests_mock): request_url = "https://nwis.waterdata.usgs.gov/nwisweb/get_ratings/?site_no={}&file_type=base".format( site ) - response_file_path = "data/waterservices_ratings.txt" + response_file_path = "tests/data/waterservices_ratings.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_ratings(site_no=site) if not isinstance(df, DataFrame): @@ -501,7 +501,7 @@ def test_what_sites(requests_mock): "https://waterservices.usgs.gov/nwis/site?bBox=-83.0%2C36.5%2C-81.0%2C38.5" "¶meterCd={}&hasDataTypeCd=dv&format={}".format(parameter_cd, format) ) - response_file_path = "data/nwis_sites.txt" + response_file_path = "tests/data/nwis_sites.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_sites( @@ -534,7 +534,7 @@ def test_get_stats(requests_mock): request_url = "https://waterservices.usgs.gov/nwis/stat?sites=01491000%2C01645000&format={}".format( format ) - response_file_path = "data/waterservices_stats.txt" + response_file_path = "tests/data/waterservices_stats.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_stats(sites=["01491000", "01645000"]) @@ -552,7 +552,7 @@ def test_get_stats_site_value_types(requests_mock, site_input_type_list): request_url = "https://waterservices.usgs.gov/nwis/stat?sites={}&format={}".format( site, _format ) - response_file_path = "data/waterservices_stats.txt" + response_file_path = "tests/data/waterservices_stats.txt" mock_request(requests_mock, request_url, response_file_path) if site_input_type_list: sites = [site] @@ -579,7 +579,7 @@ def assert_metadata(requests_mock, request_url, md, site, parameter_cd, format): site_request_url = ( "https://waterservices.usgs.gov/nwis/site?sites={}&format=rdb".format(site) ) - with open("data/waterservices_site.txt") as text: + with open("tests/data/waterservices_site.txt") as text: requests_mock.get(site_request_url, text=text.read()) site_info, _ = md.site_info if not isinstance(site_info, DataFrame): @@ -591,7 +591,7 @@ def assert_metadata(requests_mock, request_url, md, site, parameter_cd, format): pcode_request_url = "https://help.waterdata.usgs.gov/code/parameter_cd_nm_query?fmt=rdb&parm_nm_cd=%25{}%25".format( param ) - with open("data/waterdata_pmcodes.txt") as text: + with open("tests/data/waterdata_pmcodes.txt") as text: requests_mock.get(pcode_request_url, text=text.read()) variable_info, _ = md.variable_info assert type(variable_info) is DataFrame diff --git a/tests/wqp_test.py b/tests/wqp_test.py index acf48c36..f36558bc 100755 --- a/tests/wqp_test.py +++ b/tests/wqp_test.py @@ -24,7 +24,7 @@ def test_get_results(requests_mock): "&characteristicName=Specific+conductance&startDateLo=05-01-2011&startDateHi=09-30-2011" "&mimeType=csv" ) - response_file_path = "data/wqp_results.txt" + response_file_path = "tests/data/wqp_results.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_results( siteid="WIDNR_WQX-10032762", @@ -48,7 +48,7 @@ def test_get_results_WQX3(requests_mock): "&mimeType=csv" "&dataProfile=fullPhysChem" ) - response_file_path = "data/wqp3_results.txt" + response_file_path = "tests/data/wqp3_results.txt" mock_request(requests_mock, request_url, response_file_path) df, md = get_results( legacy=False, @@ -71,7 +71,7 @@ def test_what_sites(requests_mock): "https://www.waterqualitydata.us/data/Station/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_sites.txt" + response_file_path = "tests/data/wqp_sites.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_sites(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -88,7 +88,7 @@ def test_what_organizations(requests_mock): "https://www.waterqualitydata.us/data/Organization/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_organizations.txt" + response_file_path = "tests/data/wqp_organizations.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_organizations(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -105,7 +105,7 @@ def test_what_projects(requests_mock): "https://www.waterqualitydata.us/data/Project/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_projects.txt" + response_file_path = "tests/data/wqp_projects.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_projects(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -122,7 +122,7 @@ def test_what_activities(requests_mock): "https://www.waterqualitydata.us/data/Activity/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_activities.txt" + response_file_path = "tests/data/wqp_activities.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_activities(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -139,7 +139,7 @@ def test_what_detection_limits(requests_mock): "https://www.waterqualitydata.us/data/ResultDetectionQuantitationLimit/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_detection_limits.txt" + response_file_path = "tests/data/wqp_detection_limits.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_detection_limits(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -156,7 +156,7 @@ def test_what_habitat_metrics(requests_mock): "https://www.waterqualitydata.us/data/BiologicalMetric/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_habitat_metrics.txt" + response_file_path = "tests/data/wqp_habitat_metrics.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_habitat_metrics(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -173,7 +173,7 @@ def test_what_project_weights(requests_mock): "https://www.waterqualitydata.us/data/ProjectMonitoringLocationWeighting/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_project_weights.txt" + response_file_path = "tests/data/wqp_project_weights.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_project_weights(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame @@ -190,7 +190,7 @@ def test_what_activity_metrics(requests_mock): "https://www.waterqualitydata.us/data/ActivityMetric/Search?statecode=US%3A34&characteristicName=Chloride" "&mimeType=csv" ) - response_file_path = "data/wqp_activity_metrics.txt" + response_file_path = "tests/data/wqp_activity_metrics.txt" mock_request(requests_mock, request_url, response_file_path) df, md = what_activity_metrics(statecode="US:34", characteristicName="Chloride") assert type(df) is DataFrame