Add IC_SOILGRID_Utilities for SoilGrids Data Processing #3508

divine7022 · 2025-04-11T00:41:46Z

Description

This pull request introduces a new utility script, IC_SOILGRID_Utilities.R, designed to facilitate the processing of SoilGrids data for generating soil carbon initial condition (IC) files. The new script provides essential functions for extracting, processing, and generating ensemble members from SoilGrids250m data. Key changes include the implementation of general get.site.info function, which extracts site information from various input formats.

New Functionality:

1) Implemented general and reusable `get.site.info` function to standardize site information handling:

Extracts data from Settings, MultiSettings, or flexible enough with CSV inputs too.
Validates required fields (site_id, lat, lon) and data types.
Supports both single sites and vectorized site data.
Includes coordinate validation with strict/lenient modes.
Returns consistent data frame output (site_id, name, lat, lon, str_id).
Testing:
- Developed a comprehensive test suite for the get.site.info function, covering various scenarios such as settings objects, CSV files, MultiSettings objects, and vectorized site information.
- Ensured that all tests pass successfully, validating the correctness of the new functionality.

2) `soilgrids_ic_process`:

A comprehensive function that processes SoilGrids data to create initial condition files. It extracts soil carbon data, handles missing values, generates ensemble members, and writes output to NetCDF files (End-to-end workflow from input to NetCDF).
Supports input from both PEcAn settings lists and optional CSV files for site information.
Includes parameters for output directory management, file overwriting, and verbosity for detailed logging.

3) `preprocess_soilgrids_data`:

A helper function that preprocesses the raw soil carbon data, handling missing values and ensuring data integrity.
Implements logic for scaling and defaulting values where necessary.

4) `generate_soilgrids_ensemble`:

A function that generates ensemble members for a given site based on processed soil carbon data, ensuring reproducibility through random seed setting.
Uses truncated normal distribution (no negative values).
Seed set by site_id for deterministic results.
Configurable ensemble size.

Motivation and Context

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

divine7022 · 2025-04-11T01:04:33Z

Hi @dlebauer @infotroph
Could you please review this PR when you have a moment?

Thank you!

divine7022 · 2025-04-11T01:08:39Z

I'll add tests for IC_SOILGRID_Utilities after addressing your feedback. Currently, this PR only includes tests for get.site.info.

modules/data.land/R/IC_SOILGRID_Utilities.R

mdietze · 2025-04-11T11:18:37Z

modules/data.land/R/IC_SOILGRID_Utilities.R

+    # Sites still with missing data - use default value
+    still_missing <- is.na(processed$`Total_soilC_0-30cm`)
+    if (any(still_missing)) {
+      processed$`Total_soilC_0-30cm`[still_missing] <- default_soilC


Missing data should not be replaced with a default value

I'm considering the following options:

Identify sites with missing data, log a warning about missing data, and filter out sites with missing data completely.

Keep missing data as NA and let downstream processes handle it.

Implementing a hierarchical approach where we try to find data from similar sites or ecoregions when direct measurements are unavailable.

I'd appreciate your thoughts on these approaches.

mdietze · 2025-04-11T11:21:06Z

modules/data.land/R/IC_SOILGRID_Utilities.R

+    # Sites with missing 0-30cm but available 0-200cm uncertainty data
+    has_200cm_data <- is.na(processed$`Std_soilC_0-30cm`) & !is.na(processed$`Std_soilC_0-200cm`)
+    if (any(has_200cm_data)) {
+      processed$`Std_soilC_0-30cm`[has_200cm_data] <- processed$`Std_soilC_0-200cm`[has_200cm_data] * 0.15


First, it's not clear this is a valid substitution. Second, it's not clear, either from the units or from basic knowledge of soil carbon depth distributions, that you can multiply the SD by 0.15.

The 0.15 factor was intended as a simple depth ratio (30cm/200cm), assuming uniform carbon distribution with depth. However, I realize I should have considered that soil carbon typically decreases non-linearly with depth and varies by ecosystem type.

Would you recommend:

keeping the missing standard deviation values as NA?

Implementing a more sophisticated approach based on literature about soil carbon depth distributions?

Using a different method to estimate uncertainty for these sites?

If you recommend a literature-based approach, are there specific models or papers that have been used successfully in other parts that I should reference?

modules/data.land/R/IC_SOILGRID_Utilities.R

mdietze · 2025-04-11T11:38:33Z

Unclear where these functions fit in the larger workflow, as I don't see any changes to any of the ic process functions that would be calling this (and merging it with aboveground data). Would be good to check that any function, and function argument, naming conventions used here are consistent with conventions assumed within the larger ic process workflow to ensure interoperability and polymorphism.
I'd really like @Qianyuxuan and @DongchenZ to review this before it is approved as a lot of this feels redundant with all the existing SoilGrids code they have developed. It's not obvious to me how this work fits with prior SoilGrids code or with prior SOC IC code.
bookdown documentation not updated

divine7022 · 2025-04-17T04:40:55Z

Hello sir, I’ve addressed the previous concerns in my comments and would value your thoughts on the updated approach. Could you please take a look whenever you have a moment.
Thank you!

DongchenZ · 2025-04-22T00:30:35Z

base/settings/R/get.site.info.R

+#' # From CSV file
+#' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv")
+#' }
+get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {


You can reduce the number of arguments by simply using the path instead of settings + csv_path. To do so, you just need to detect if the path extension is .xml or .csv. And read the settings inside the function if it ends with .xml.

DongchenZ · 2025-04-22T00:36:51Z

base/settings/R/get.site.info.R

+#' # From CSV file
+#' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv")
+#' }
+get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {


In general, I found this function is too complicated. In my previous code, I usually just use a single chunck of code to do the conversion, which I think might be useful in your case:

Suggested change

get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {

site_info <- settings$run %>%

purrr::map('site')%>%

purrr::map(function(site.list){

#conversion from string to number

site.list$lat <- as.numeric(site.list$lat)

site.list$lon <- as.numeric(site.list$lon)

list(site.id=site.list$id, lat=site.list$lat, lon=site.list$lon, site_name=site.list$name)

}) %>%

dplyr::bind_rows() %>%

as.list())

modules/data.land/R/IC_SOILGRID_Utilities.R

…y handling

…ation logic

dlebauer · 2025-08-04T21:44:22Z

@divine7022 could you please resolve conflicts?

@mdietze and @DongchenZ have all of your suggestions / requested changes been addressed?

…ettings extraction

…imit

…act()" This reverts commit 51931a6.

infotroph

As requested in previous comments, please do not automatically fill the mean or other fixed default for sites with missing data. Users might choose to do that, but they need to know it's happening and not have it hidden inside the sampling process.

I'm also leery of the design of writing NC files with zeroes for wood and leaf C and would strongly prefer a workflow that prepares only the soil C data, with assembly saved for a later step when all needed datasets are available.

modules/data.land/R/IC_SOILGRID_Utilities.R

modules/data.land/NEWS.md

modules/data.land/R/IC_SOILGRID_Utilities.R

modules/data.land/R/ic_process.R

dlebauer

After a reading through again, the remaining changes requested are related to handling missing data:

for soilgrids, don't fill missing data with mean or other fixed defaults
for IC not covered by soilgrids, don't add 0s. Best fix is to exclude these variables, just write out the ones provided by soilgrids (and defer to #3603)

have been addressed

dlebauer

Yay! Thanks for the heroic effort of addressing all feedback and pushing this through!

Added SoilGrid IC Utilities

bb09c4b

github-actions bot added Tests Modules Base labels Apr 11, 2025

divine7022 changed the title ~~Add IC_SOILGRID_Utilities for SoilGrids Data Processing and Implement get.site.info Function~~ Add IC_SOILGRID_Utilities for SoilGrids Data Processing Apr 11, 2025

mdietze previously requested changes Apr 11, 2025

View reviewed changes

mdietze assigned Qianyuxuan and DongchenZ Apr 11, 2025

DongchenZ reviewed Apr 22, 2025

View reviewed changes

modules/data.land/R/IC_SOILGRID_Utilities.R Outdated Show resolved Hide resolved

DongchenZ reviewed Apr 22, 2025

View reviewed changes

modules/data.land/R/IC_SOILGRID_Utilities.R Outdated Show resolved Hide resolved

divine7022 added 4 commits June 5, 2025 08:49

refactor(soilgrids): improve soil carbon IC generation and uncertaint…

7a7c091

…y handling

simplified general function for site information extraction and valid…

c646fd1

…ation logic

updated .Rd and test for get.site.info

2f389ba

add the support of soilgrid source to ic_process

f90b881

dlebauer requested review from DongchenZ and mdietze August 4, 2025 21:42

divine7022 added 7 commits August 8, 2025 10:23

Merge remote-tracking branch 'origin/develop' into ic-generation

6863bd7

update CHANGELOG.md

c904b92

remove get.site.info from NAMESPACE

286f007

remove get.site.info() entry from NEWS.md

62676f6

remove get.site.info(), since soilgrid_ic_process() now uses direct s…

333c1f7

…ettings extraction

remove unused get.site.info.Rd

5160a7f

delete test.get.site.info.R

5b2e439

divine7022 added 4 commits August 12, 2025 02:42

remove dot notation to resolve CI global variable binding error

a4d3ae8

resolve package dependency and namespace issues

5f337e5

reduce import count by moving MASS to suggests to fix ci dependency l…

d3a8e9a

…imit

add MASS to docker dependency

6f90246

github-actions bot added the Dockerfile label Aug 12, 2025

divine7022 added 2 commits August 12, 2025 03:05

Merge remote-tracking branch 'origin/develop' into ic-generation

1304eb3

Revert "Enhance robustness and error handling in soilgrids_soilC_extr…

c4340f6

…act()" This reverts commit 51931a6.

infotroph previously requested changes Aug 13, 2025

View reviewed changes

divine7022 added 6 commits August 15, 2025 14:33

add R >= 4.1.0 dependency for base pipe support

5c1aaee

refactor and remove drop deterministic fill

a07b35a

update .Rd

ce87e2f

update preprocess_soilgrids_data.Rd

9c2b936

udpdate soilgrids_ic_process.Rd

01556b0

Merge remote-tracking branch 'origin/develop' into ic-generation

a703645

infotroph reviewed Aug 15, 2025

View reviewed changes

divine7022 added 2 commits August 15, 2025 23:58

remove all clamping logic and added pre-allocate where needed

4f4d621

move update to unreleased section

37b368e

dlebauer previously requested changes Aug 18, 2025

View reviewed changes

remove unintended changes to exclude from pr

94a8fee

dlebauer mentioned this pull request Aug 22, 2025

Feature: Multi-source Initial Conditions (IC) assembly in ic_process #3603

Open

divine7022 and others added 2 commits August 22, 2025 19:41

remove ic_process changes to handle in dedicated PR

279944a

Merge branch 'develop' into ic-generation

9ad1921

dlebauer enabled auto-merge August 27, 2025 16:09

dlebauer approved these changes Aug 27, 2025

View reviewed changes

dlebauer added this pull request to the merge queue Aug 27, 2025

Merged via the queue into PecanProject:develop with commit 7c50adc Aug 27, 2025
18 of 25 checks passed

-get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) {
+site_info <- settings$run %>%
+                                 purrr::map('site')%>%
+                                 purrr::map(function(site.list){
+                                   #conversion from string to number
+                                   site.list$lat <- as.numeric(site.list$lat)
+                                   site.list$lon <- as.numeric(site.list$lon)
+                                   list(site.id=site.list$id, lat=site.list$lat, lon=site.list$lon, site_name=site.list$name)
+                                 }) %>%
+                                 dplyr::bind_rows() %>%
+                                 as.list())

Add IC_SOILGRID_Utilities for SoilGrids Data Processing #3508

Add IC_SOILGRID_Utilities for SoilGrids Data Processing #3508

Uh oh!

Conversation

divine7022 commented Apr 11, 2025

Description

New Functionality:

1) Implemented general and reusable get.site.info function to standardize site information handling:

2) soilgrids_ic_process:

3) preprocess_soilgrids_data:

4) generate_soilgrids_ensemble:

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Uh oh!

divine7022 commented Apr 11, 2025

Uh oh!

divine7022 commented Apr 11, 2025

Uh oh!

Uh oh!

mdietze Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

divine7022 Apr 13, 2025

Choose a reason for hiding this comment

Uh oh!

mdietze Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

divine7022 Apr 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdietze commented Apr 11, 2025

Uh oh!

divine7022 commented Apr 17, 2025

Uh oh!

DongchenZ Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

DongchenZ Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dlebauer commented Aug 4, 2025

Uh oh!

infotroph left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlebauer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlebauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

1) Implemented general and reusable `get.site.info` function to standardize site information handling:

2) `soilgrids_ic_process`:

3) `preprocess_soilgrids_data`:

4) `generate_soilgrids_ensemble`:

dlebauer left a comment •

edited

Loading