Skip to content

Conversation

@divine7022
Copy link
Collaborator

Description

This pull request introduces a new utility script, IC_SOILGRID_Utilities.R, designed to facilitate the processing of SoilGrids data for generating soil carbon initial condition (IC) files. The new script provides essential functions for extracting, processing, and generating ensemble members from SoilGrids250m data. Key changes include the implementation of general get.site.info function, which extracts site information from various input formats.

New Functionality:

1) Implemented general and reusable get.site.info function to standardize site information handling:

  • Extracts data from Settings, MultiSettings, or flexible enough with CSV inputs too.
  • Validates required fields (site_id, lat, lon) and data types.
  • Supports both single sites and vectorized site data.
  • Includes coordinate validation with strict/lenient modes.
  • Returns consistent data frame output (site_id, name, lat, lon, str_id).
  • Testing:
    • Developed a comprehensive test suite for the get.site.info function, covering various scenarios such as settings objects, CSV files, MultiSettings objects, and vectorized site information.
    • Ensured that all tests pass successfully, validating the correctness of the new functionality.

2) soilgrids_ic_process:

  • A comprehensive function that processes SoilGrids data to create initial condition files. It extracts soil carbon data, handles missing values, generates ensemble members, and writes output to NetCDF files (End-to-end workflow from input to NetCDF).
  • Supports input from both PEcAn settings lists and optional CSV files for site information.
  • Includes parameters for output directory management, file overwriting, and verbosity for detailed logging.

3) preprocess_soilgrids_data:

  • A helper function that preprocesses the raw soil carbon data, handling missing values and ensuring data integrity.
  • Implements logic for scaling and defaulting values where necessary.

4) generate_soilgrids_ensemble:

  • A function that generates ensemble members for a given site based on processed soil carbon data, ensuring reproducibility through random seed setting.
  • Uses truncated normal distribution (no negative values).
  • Seed set by site_id for deterministic results.
  • Configurable ensemble size.

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@divine7022 divine7022 changed the title Add IC_SOILGRID_Utilities for SoilGrids Data Processing and Implement get.site.info Function Add IC_SOILGRID_Utilities for SoilGrids Data Processing Apr 11, 2025
@divine7022
Copy link
Collaborator Author

Hi @dlebauer @infotroph
Could you please review this PR when you have a moment?

Thank you!

@divine7022
Copy link
Collaborator Author

I'll add tests for IC_SOILGRID_Utilities after addressing your feedback. Currently, this PR only includes tests for get.site.info.

# Sites still with missing data - use default value
still_missing <- is.na(processed$`Total_soilC_0-30cm`)
if (any(still_missing)) {
processed$`Total_soilC_0-30cm`[still_missing] <- default_soilC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing data should not be replaced with a default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering the following options:

  1. Identify sites with missing data, log a warning about missing data, and filter out sites with missing data completely.
  2. Keep missing data as NA and let downstream processes handle it.
  3. Implementing a hierarchical approach where we try to find data from similar sites or ecoregions when direct measurements are unavailable.

I'd appreciate your thoughts on these approaches.

# Sites with missing 0-30cm but available 0-200cm uncertainty data
has_200cm_data <- is.na(processed$`Std_soilC_0-30cm`) & !is.na(processed$`Std_soilC_0-200cm`)
if (any(has_200cm_data)) {
processed$`Std_soilC_0-30cm`[has_200cm_data] <- processed$`Std_soilC_0-200cm`[has_200cm_data] * 0.15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it's not clear this is a valid substitution. Second, it's not clear, either from the units or from basic knowledge of soil carbon depth distributions, that you can multiply the SD by 0.15.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0.15 factor was intended as a simple depth ratio (30cm/200cm), assuming uniform carbon distribution with depth. However, I realize I should have considered that soil carbon typically decreases non-linearly with depth and varies by ecosystem type.

Would you recommend:

  1. keeping the missing standard deviation values as NA?
  2. Implementing a more sophisticated approach based on literature about soil carbon depth distributions?
  3. Using a different method to estimate uncertainty for these sites?

If you recommend a literature-based approach, are there specific models or papers that have been used successfully in other parts that I should reference?

@mdietze
Copy link
Member

mdietze commented Apr 11, 2025

  1. Unclear where these functions fit in the larger workflow, as I don't see any changes to any of the ic process functions that would be calling this (and merging it with aboveground data). Would be good to check that any function, and function argument, naming conventions used here are consistent with conventions assumed within the larger ic process workflow to ensure interoperability and polymorphism.
  2. I'd really like @Qianyuxuan and @DongchenZ to review this before it is approved as a lot of this feels redundant with all the existing SoilGrids code they have developed. It's not obvious to me how this work fits with prior SoilGrids code or with prior SOC IC code.
  3. bookdown documentation not updated

@divine7022
Copy link
Collaborator Author

Hello sir, I’ve addressed the previous concerns in my comments and would value your thoughts on the updated approach. Could you please take a look whenever you have a moment.
Thank you!

#' # From CSV file
#' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv")
#' }
get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce the number of arguments by simply using the path instead of settings + csv_path. To do so, you just need to detect if the path extension is .xml or .csv. And read the settings inside the function if it ends with .xml.

#' # From CSV file
#' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv")
#' }
get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I found this function is too complicated. In my previous code, I usually just use a single chunck of code to do the conversion, which I think might be useful in your case:

Suggested change
get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {
site_info <- settings$run %>%
purrr::map('site')%>%
purrr::map(function(site.list){
#conversion from string to number
site.list$lat <- as.numeric(site.list$lat)
site.list$lon <- as.numeric(site.list$lon)
list(site.id=site.list$id, lat=site.list$lat, lon=site.list$lon, site_name=site.list$name)
}) %>%
dplyr::bind_rows() %>%
as.list())

@dlebauer dlebauer requested review from DongchenZ and mdietze August 4, 2025 21:42
@dlebauer
Copy link
Member

dlebauer commented Aug 4, 2025

@divine7022 could you please resolve conflicts?

@mdietze and @DongchenZ have all of your suggestions / requested changes been addressed?

Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested in previous comments, please do not automatically fill the mean or other fixed default for sites with missing data. Users might choose to do that, but they need to know it's happening and not have it hidden inside the sampling process.

I'm also leery of the design of writing NC files with zeroes for wood and leaf C and would strongly prefer a workflow that prepares only the soil C data, with assembly saved for a later step when all needed datasets are available.

Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a reading through again, the remaining changes requested are related to handling missing data:

  • for soilgrids, don't fill missing data with mean or other fixed defaults
  • for IC not covered by soilgrids, don't add 0s. Best fix is to exclude these variables, just write out the ones provided by soilgrids (and defer to #3603)

@dlebauer dlebauer dismissed stale reviews from infotroph, mdietze, and themself August 27, 2025 16:09

have been addressed

@dlebauer dlebauer enabled auto-merge August 27, 2025 16:09
Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Thanks for the heroic effort of addressing all feedback and pushing this through!

@dlebauer dlebauer added this pull request to the merge queue Aug 27, 2025
Merged via the queue into PecanProject:develop with commit 7c50adc Aug 27, 2025
18 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants