-
Notifications
You must be signed in to change notification settings - Fork 287
Add IC_SOILGRID_Utilities for SoilGrids Data Processing #3508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @dlebauer @infotroph Thank you! |
|
I'll add tests for |
| # Sites still with missing data - use default value | ||
| still_missing <- is.na(processed$`Total_soilC_0-30cm`) | ||
| if (any(still_missing)) { | ||
| processed$`Total_soilC_0-30cm`[still_missing] <- default_soilC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing data should not be replaced with a default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering the following options:
- Identify sites with missing data, log a warning about missing data, and filter out sites with missing data completely.
- Keep missing data as NA and let downstream processes handle it.
- Implementing a hierarchical approach where we try to find data from similar sites or ecoregions when direct measurements are unavailable.
I'd appreciate your thoughts on these approaches.
| # Sites with missing 0-30cm but available 0-200cm uncertainty data | ||
| has_200cm_data <- is.na(processed$`Std_soilC_0-30cm`) & !is.na(processed$`Std_soilC_0-200cm`) | ||
| if (any(has_200cm_data)) { | ||
| processed$`Std_soilC_0-30cm`[has_200cm_data] <- processed$`Std_soilC_0-200cm`[has_200cm_data] * 0.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, it's not clear this is a valid substitution. Second, it's not clear, either from the units or from basic knowledge of soil carbon depth distributions, that you can multiply the SD by 0.15.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 0.15 factor was intended as a simple depth ratio (30cm/200cm), assuming uniform carbon distribution with depth. However, I realize I should have considered that soil carbon typically decreases non-linearly with depth and varies by ecosystem type.
Would you recommend:
- keeping the missing standard deviation values as
NA? - Implementing a more sophisticated approach based on literature about soil carbon depth distributions?
- Using a different method to estimate uncertainty for these sites?
If you recommend a literature-based approach, are there specific models or papers that have been used successfully in other parts that I should reference?
|
|
Hello sir, I’ve addressed the previous concerns in my comments and would value your thoughts on the updated approach. Could you please take a look whenever you have a moment. |
base/settings/R/get.site.info.R
Outdated
| #' # From CSV file | ||
| #' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv") | ||
| #' } | ||
| get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce the number of arguments by simply using the path instead of settings + csv_path. To do so, you just need to detect if the path extension is .xml or .csv. And read the settings inside the function if it ends with .xml.
base/settings/R/get.site.info.R
Outdated
| #' # From CSV file | ||
| #' site_info <- PEcAn.settings::get.site.info(csv_path = "sites.csv") | ||
| #' } | ||
| get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I found this function is too complicated. In my previous code, I usually just use a single chunck of code to do the conversion, which I think might be useful in your case:
| get.site.info <- function(settings = NULL, csv_path = NULL, strict_checking = TRUE) { | |
| site_info <- settings$run %>% | |
| purrr::map('site')%>% | |
| purrr::map(function(site.list){ | |
| #conversion from string to number | |
| site.list$lat <- as.numeric(site.list$lat) | |
| site.list$lon <- as.numeric(site.list$lon) | |
| list(site.id=site.list$id, lat=site.list$lat, lon=site.list$lon, site_name=site.list$name) | |
| }) %>% | |
| dplyr::bind_rows() %>% | |
| as.list()) |
|
@divine7022 could you please resolve conflicts? @mdietze and @DongchenZ have all of your suggestions / requested changes been addressed? |
…ettings extraction
infotroph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested in previous comments, please do not automatically fill the mean or other fixed default for sites with missing data. Users might choose to do that, but they need to know it's happening and not have it hidden inside the sampling process.
I'm also leery of the design of writing NC files with zeroes for wood and leaf C and would strongly prefer a workflow that prepares only the soil C data, with assembly saved for a later step when all needed datasets are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a reading through again, the remaining changes requested are related to handling missing data:
- for soilgrids, don't fill missing data with mean or other fixed defaults
- for IC not covered by soilgrids, don't add 0s. Best fix is to exclude these variables, just write out the ones provided by soilgrids (and defer to #3603)
have been addressed
dlebauer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! Thanks for the heroic effort of addressing all feedback and pushing this through!
7c50adc
Description
This pull request introduces a new utility script, IC_SOILGRID_Utilities.R, designed to facilitate the processing of SoilGrids data for generating soil carbon initial condition (IC) files. The new script provides essential functions for extracting, processing, and generating ensemble members from SoilGrids250m data. Key changes include the implementation of general
get.site.infofunction, which extracts site information from various input formats.New Functionality:
1) Implemented general and reusable
get.site.infofunction to standardize site information handling:get.site.infofunction, covering various scenarios such as settings objects, CSV files, MultiSettings objects, and vectorized site information.2)
soilgrids_ic_process:3)
preprocess_soilgrids_data:4)
generate_soilgrids_ensemble:Motivation and Context
Review Time Estimate
Types of changes
Checklist: