-
Notifications
You must be signed in to change notification settings - Fork 1
Issue 106: analysis pipeline #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe pull request introduces a restructured analysis pipeline that separates model execution from analysis workflows. It adds new exploratory data analysis visualisation functions, reorganises targets configuration into a config-driven structure, renames parameters in score conversion utilities, and updates global variable declarations to support the new analysis pipeline. Changes
Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
_targets.R (1)
39-42: Invalid regex pattern in grep (same issue as in_targets_model_run.R).The pattern
*\\.Ris incorrect regex syntax. Consider extracting this shared logic into a helper function to avoid duplication and ensure consistency.🔎 Proposed fix
# load target modules targets <- list.files(here("targets"), full.names = TRUE) -targets <- grep("*\\.R", targets, value = TRUE) +targets <- grep("\\.R$", targets, value = TRUE) purrr::walk(targets, source)
🤖 Fix all issues with AI Agents
In @_targets_model_run.R:
- Around line 34-37: The grep call uses an invalid regex "*\\.R" which can fail;
update the selection to a correct regex (e.g., use grep("\\.R$", targets, value
= TRUE)) or better yet use list.files(..., pattern = "\\.R$", full.names = TRUE)
to only return .R files; adjust the existing targets variable usage (the grep
call that assigns to targets) so purrr::walk(targets, source) receives the
proper .R file paths.
- Around line 31-32: n_workers can become zero when future::availableCores() <
4; update the n_workers assignment so it never drops below 1 (compute the
original floor(as.integer(availableCores()/4)) and then enforce a minimum of 1)
and keep using plan(multisession, workers = n_workers) so the multisession plan
always has at least one worker.
In @R/EDA_plots.R:
- Around line 290-322: The roxygen import is wrong and the pivot keys are
inconsistent: update the @importFrom line to include the ggplot2 geoms actually
used (geom_point and geom_line) instead of geom_bar, and fix the pivot by
ensuring the id columns in pivot_wider match the grouping from summarise_scores
— either include "hosp_data_real_time" in id_cols or remove it from the
summarise_scores by= vector; locate get_scatterplot_scores, the summarise_scores
call and the pivot_wider call and make the id_cols =
c("forecast_date","location","hosp_data_real_time") (or drop hosp_data_real_time
from by=) and change the roxygen to @importFrom ggplot2 geom_point geom_line so
the imports and pivot behavior align.
In @R/prep_scores_to_model.R:
- Around line 1-9: prep_scores_to_model currently returns NULL and breaks
downstream targets; implement it to pivot scores_long from long to wide (e.g.,
each sample/feature combination becomes a row with score columns) and then join
ww_metadata by the appropriate key (sample id or site id used in scores_long)
before returning a cleaned data.frame/tibble; update the function
prep_scores_to_model to accept scores_long and ww_metadata, use pivot_wider (or
spread) on the score name column and values column, perform a left_join with
ww_metadata using the matching id column present in both, ensure column names
are consistent with the caller scores_to_model in targets/run_gam_targets.R, and
return the resulting joined table instead of NULL.
In @scratch/explore_results.R:
- Line 5: The summarise call computes wis with mean(wis) which will return NA if
any wis values are missing; update the aggregation in the summarise step (the
summarise function that defines wis) to call mean(wis, na.rm = TRUE) so missing
values are ignored and downstream plotting won't fail.
- Line 15: The division uses a hardcoded index scores_overall$wis[2], which is
brittle; update the calculation of rwis to use a named baseline lookup instead
(e.g., find the row in scores_overall whose model or baseline column equals the
intended baseline identifier and use its wis value) so that rwis <-
scores_overall$wis / <baseline_wis> uses a stable reference (referencing
scores_overall and the column name used for model/baseline to locate the
baseline row instead of index 2).
In @targets/run_gam_targets.R:
- Around line 10-16: The tar target scores_to_model is broken because it calls
prep_scores_to_model which is currently a stub returning NULL and it references
a misspelled dependency ww_metadata_table_combines; change the target call to
use the correct ww_metadata_table_combined and implement prep_scores_to_model
(in R/prep_scores_to_model.R) to accept scores_long and ww_metadata, validate
inputs, join/transform the data into a non-NULL tidy data.frame or list suitable
for downstream modeling (e.g., merged scores with metadata and any necessary
cleaning), and return that object; ensure the function name prep_scores_to_model
and the target scores_to_model remain unchanged so the pipeline wiring stays
consistent.
🧹 Nitpick comments (4)
scratch/explore_results.R (2)
10-13: Removeposition = "stack"for single-value bars.The
position = "stack"parameter is redundant when each bar represents a single value (onewispermodel_ww). Useposition = "dodge"if multiple aesthetics map to the same x-value, or remove the parameter entirely.🔎 Proposed fix
ggplot(scores_overall) + geom_bar(aes(x = model_ww, y = wis, fill = model_ww), - stat = "identity", position = "stack" + stat = "identity" )
3-30: Consider extracting the repeated aggregation pattern into a helper function.The same aggregation pattern (group by model/include_ww/additional dimensions, summarise wis, add model_ww label) is repeated three times. Extracting this into a helper function would reduce duplication and improve maintainability.
🔎 Example refactor
aggregate_scores <- function(scores, ...) { scores |> group_by(model, include_ww, ...) |> summarise(wis = mean(wis, na.rm = TRUE), .groups = "drop") |> mutate(model_ww = glue::glue("{model}-{include_ww}")) } scores_overall <- aggregate_scores(scores) scores_by_date <- aggregate_scores(scores, forecast_date) scores_by_loc <- aggregate_scores(scores, location)targets/analysis_config_targets.R (1)
13-16: Consider validating that the scores file exists.The
scores_fptarget defines a hardcoded path to the scores CSV file. According to the PR description, this pipeline relies on locally saved outputs inoutput/. If the file does not exist, downstream targets that depend onscores_fpmay fail. Consider adding validation to check file existence or documenting the prerequisite that model runs must be completed first.🔎 Proposed validation approach
tar_target( name = scores_fp, - command = file.path("output", "overall_data_all_runs", "scores.csv") + command = { + fp <- file.path("output", "overall_data_all_runs", "scores.csv") + if (!file.exists(fp)) { + stop("Scores file not found. Please run the model pipeline first.") + } + fp + } )_targets.R (1)
82-97: Placeholder comments for incomplete features.The comments at lines 84, 86, and 92-97 contain placeholders (empty parentheses, question marks, and TODOs for future figures). This aligns with the PR's stated remaining tasks. Consider converting these to proper TODO comments (e.g.,
# TODO: compute coverage metrics) for easier tracking.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
R/EDA_plots.RR/convert_to_su_object.RR/globals.RR/prep_scores_to_model.R_targets.R_targets_model_run.Rman/convert_to_su_object.Rdman/get_scatterplot_scores.Rdscratch/explore_results.Rtargets/analysis_EDA_plot_targets.Rtargets/analysis_config_targets.Rtargets/get_metadata_targets.Rtargets/run_gam_targets.R
🔇 Additional comments (16)
man/get_scatterplot_scores.Rd (1)
1-17: LGTM!The documentation structure is standard and correctly describes the function interface. As this is a Roxygen2-generated file, ensure the source documentation in
R/EDA_plots.Ris kept up to date.targets/run_gam_targets.R (1)
1-9: LGTM!The pipeline structure correctly chains reading raw scores and converting them to a scoring utilities object. The use of
tar_targetwith explicit names and commands follows targets best practices. Thescores_fpvariable is properly defined inanalysis_config_targets.Ras a tar_target that generates the file path to scores.csv.scratch/explore_results.R (1)
1-1: No action required—the file path in explore_results.R is correct.The path
output/overall_data/scores.csvmatches what the pipeline's scoring step actually produces (as defined intargets/scoring_targets.Rline 48), so the script will read from the correct location. Although the analysis configuration intargets/analysis_config_targets.Rreferencesoutput/overall_data_all_runs/scores.csv, that represents a separate expectation; explore_results.R is an exploratory script correctly pointing to the scoring pipeline's actual output location.Likely an incorrect or invalid review comment.
targets/analysis_EDA_plot_targets.R (1)
1-10: LGTM!The EDA plotting targets are well-structured with clear dependencies on the
scorestarget from the pipeline. Both plotting functions (get_plot_scores_by_dateandget_scatterplot_scores) are properly defined in R/EDA_plots.R, and the target is correctly configured in the secondary outputs. The naming convention is consistent and the pipeline flow is logical.R/EDA_plots.R (1)
269-275: Excellent refactoring of variable name.The variable rename from
scores_by_loctoscores_by_datecorrectly reflects that the data is aggregated byforecast_daterather than by location, improving code clarity.man/convert_to_su_object.Rd (1)
1-17: LGTM!The documentation correctly reflects the parameter rename from
scores_datatoscores_raw, maintaining consistency with the function implementation.targets/get_metadata_targets.R (2)
8-17: LGTM!The updates to use
ww_data_postinstead ofww_datacorrectly align with the new analysis pipeline structure whereww_data_postis defined inanalysis_config_targets.R.
22-23: The use ofdistinct()is correct as-is.The duplicate rows are expected and legitimate: multiple scenarios with different
include_ww, model, andhosp_data_real_timevalues map to the same (location, forecast_date) pair, resulting in identical metadata calculations. Sincecalculate_ww_metadata_table()returns one row per (forecast_date, location_abbr, location_name) combination and all columns derive from this grouping, usingdistinct()without specifying key columns is appropriate—any duplicates will be identical across all columns.targets/analysis_config_targets.R (1)
2-12: No action required.The
scenariosobject is properly defined intargets/create_permutations_targets.R(lines 122–137) as atar_group_by()target that creates a grouped dataset with all required fields:location_name,location_abbr, andforecast_date. The pipeline execution order in_targets.Rensures thatcreate_permutations_targetsis evaluated beforeanalysis_config_targets, so theww_data_posttarget's dependency onscenariosis correctly satisfied.R/convert_to_su_object.R (1)
3-21: Parameter and variable naming improvements look good.The parameter rename from
scores_datatoscores_rawimproves clarity by indicating raw input data. The approach of setting the "metrics" attribute on the scores object is appropriate; the metrics attribute order does not affect functionality in the scoringutils package.R/globals.R (1)
34-40: LGTM!The new global variable annotations follow the established pattern in this file and correctly document the variables used by
get_scatterplot_scores. The duplicates with other sections are acceptable as they serve per-function documentation purposes._targets_model_run.R (2)
39-60: LGTM!The
tar_option_setconfiguration is well-structured with sensible defaults for worker-based storage, transient memory, and parquet format.
62-100: LGTM!The pipeline structure is well-organised with clear separation between setup, data loading, model fitting, and scoring stages. The modular approach using sourced target definitions promotes maintainability.
_targets.R (3)
1-10: LGTM!The header documentation clearly explains the pipeline's purpose and its expected input structure from the
output/folder. This is helpful for users running the analysis pipeline.
67-75: LGTM!The
analysis_configsection clearly combines permutation and configuration targets with helpful inline comments explaining each component's purpose.
99-104: LGTM!The final pipeline structure is well-organised and clearly separates configuration, metadata extraction, secondary computations, and plotting into distinct logical groups.
| prep_scores_to_model <- function(scores_long, | ||
| ww_metadata) { | ||
| # Pivot scores from long to wide | ||
|
|
||
| # Join wastewater metadata | ||
|
|
||
| # Placeholder | ||
| return(NULL) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement the placeholder function before merging.
This function is called by scores_to_model in targets/run_gam_targets.R but currently returns NULL, which will break the downstream pipeline. The PR description mentions remaining tasks, but this function should be implemented or the calling target should be removed before merging.
Would you like me to generate a basic implementation template for pivoting scores and joining metadata, or should this target be temporarily removed from the pipeline?
🤖 Prompt for AI Agents
In @R/prep_scores_to_model.R around lines 1 - 9, prep_scores_to_model currently
returns NULL and breaks downstream targets; implement it to pivot scores_long
from long to wide (e.g., each sample/feature combination becomes a row with
score columns) and then join ww_metadata by the appropriate key (sample id or
site id used in scores_long) before returning a cleaned data.frame/tibble;
update the function prep_scores_to_model to accept scores_long and ww_metadata,
use pivot_wider (or spread) on the score name column and values column, perform
a left_join with ww_metadata using the matching id column present in both,
ensure column names are consistent with the caller scores_to_model in
targets/run_gam_targets.R, and return the resulting joined table instead of
NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to ignore this for a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaitejohnson, understood! Would you like me to open a follow-up issue to track the implementation of prep_scores_to_model?
|
|
||
| scores_overall <- scores |> | ||
| group_by(model, include_ww) |> | ||
| summarise(wis = mean(wis)) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add na.rm = TRUE to the mean() call.
If the wis column contains missing values, the aggregation will return NA for all groups, causing downstream plotting to fail.
🔎 Proposed fix
- summarise(wis = mean(wis)) |>
+ summarise(wis = mean(wis, na.rm = TRUE)) |>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| summarise(wis = mean(wis)) |> | |
| summarise(wis = mean(wis, na.rm = TRUE)) |> |
🤖 Prompt for AI Agents
In @scratch/explore_results.R at line 5, The summarise call computes wis with
mean(wis) which will return NA if any wis values are missing; update the
aggregation in the summarise step (the summarise function that defines wis) to
call mean(wis, na.rm = TRUE) so missing values are ignored and downstream
plotting won't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alos ignoring stuff in scratch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaitejohnson, understood! Since scratch/ contains exploratory code, it makes sense to skip detailed reviews there.
If you'd like to prevent automated reviews on the scratch/ directory in future PRs, you can configure CodeRabbit to exclude it by adding a path_filters configuration in your .coderabbit.yaml file. For example:
reviews:
path_filters:
- "!scratch/**"This will exclude all files under scratch/ from automated reviews.
| tar_target( | ||
| name = scores_to_model, | ||
| command = prep_scores_to_model( | ||
| scores_long = scores, | ||
| ww_metadata = ww_metadata_table_combines | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check for prep_scores_to_model implementation and its return value
find . -name "prep_scores_to_model.R" -o -name "*prep_scores*" | head -20Repository: epiforecasts/evalwwforecasts
Length of output: 98
🏁 Script executed:
# Verify ww_metadata_table_combines is defined as a target
rg -n "ww_metadata_table_combines" targets/Repository: epiforecasts/evalwwforecasts
Length of output: 147
🏁 Script executed:
# Look for the function definition in R directory
rg -n "prep_scores_to_model" R/ --max-count 5Repository: epiforecasts/evalwwforecasts
Length of output: 144
🏁 Script executed:
# Read the prep_scores_to_model function to check return value
cat -n R/prep_scores_to_model.RRepository: epiforecasts/evalwwforecasts
Length of output: 327
🏁 Script executed:
# Search more broadly for ww_metadata_table_combines definition
rg -n "ww_metadata_table_combines\s*=" targets/ -A 5Repository: epiforecasts/evalwwforecasts
Length of output: 54
🏁 Script executed:
# Check if it's defined in other files
rg -n "ww_metadata_table_combines\s*=" . --include="*.R" -A 3Repository: epiforecasts/evalwwforecasts
Length of output: 513
🏁 Script executed:
# Search entire repo for ww_metadata_table_combines with corrected syntax
rg "ww_metadata_table_combines" . -A 2Repository: epiforecasts/evalwwforecasts
Length of output: 212
🏁 Script executed:
# Search for similar variable names that might be the intended target
rg "ww_metadata" targets/ -nRepository: epiforecasts/evalwwforecasts
Length of output: 543
🏁 Script executed:
# List all tar_target definitions to see what targets are actually defined
rg "tar_target\(" targets/ -A 2 | head -60Repository: epiforecasts/evalwwforecasts
Length of output: 2691
🏁 Script executed:
# Final check: search entire repo for any other references to ww_metadata_table_combines
rg "ww_metadata_table_combines" . --countRepository: epiforecasts/evalwwforecasts
Length of output: 101
🏁 Script executed:
# Verify the exact name of the defined target
rg "name = ww_metadata_table_combined" targets/ -A 3Repository: epiforecasts/evalwwforecasts
Length of output: 319
Fix target name typo and implement prep_scores_to_model function.
Two critical issues prevent this target from functioning:
prep_scores_to_modelreturnsNULL(R/prep_scores_to_model.R) — the function is a placeholder with no actual implementation.- The target parameter references
ww_metadata_table_combines, but the actual defined target isww_metadata_table_combined— this typo will cause a missing dependency error at runtime.
Update the target to use the correct parameter name and implement the function to return usable output.
🤖 Prompt for AI Agents
In @targets/run_gam_targets.R around lines 10 - 16, The tar target
scores_to_model is broken because it calls prep_scores_to_model which is
currently a stub returning NULL and it references a misspelled dependency
ww_metadata_table_combines; change the target call to use the correct
ww_metadata_table_combined and implement prep_scores_to_model (in
R/prep_scores_to_model.R) to accept scores_long and ww_metadata, validate
inputs, join/transform the data into a non-NULL tidy data.frame or list suitable
for downstream modeling (e.g., merged scores with metadata and any necessary
cleaning), and return that object; ensure the function name prep_scores_to_model
and the target scores_to_model remain unchanged so the pipeline wiring stays
consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a placeholder function to be filled in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaitejohnson, understood! Thanks for clarifying that the placeholder function is intentional.
Just a heads up: there's still the typo issue on Line 14 — ww_metadata = ww_metadata_table_combines should be ww_metadata = ww_metadata_table_combined (missing the d at the end). That typo will cause a dependency error when targets tries to run this pipeline, so you'll want to fix that before executing the workflow.
ciaramccarthy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked for me after adding the brackets around TRUE and FALSE in the get_scatterplot_scores function. I noticed that the forecast dates in create_permutations_targets don't go up to 2025-07-07 (as it says they should do in the comment on lines 26-27) - not sure if this is intentional? Also, what are you thinking re coderabbit's suggestions?
R/EDA_plots.R
Outdated
| ww_plus_hosp = TRUE, | ||
| hosp_only = FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added brackets here to avoid this error: Error in rename(): ! Can't rename columns with TRUE. ✖ TRUE must be numeric or character, not TRUE.
Ooh ok I can't remember why but I think this should only go to June 30th, 2025 -- thats all that I have outputs for! Have gone through the coderabbit suggestions, fixed the rename, and fixed some bugs in the file.paths! |
|
Re the metadata table @kaitejohnson, the analysis pipeline is running much more quickly now so I think its okay to leave it as is (versus having that in a separate pipeline). |
This PR addresses #106 . It does the following:
_targets.Rfile to_targets_model_run.R_targets.Rfile which will just rely on the locally saved outputs inoutput/to generate figures and analysis the forecast performance across the locations and forecast dates for which the model has already been run_targets.R_targets.Rpipeline. @ciaramccarthy1 Let me know what you think of this, alternatively, we just make it its own separate pipeline so that we can edit if to add things if needed but don't need to rerun it if we ever need to wipe the analysis targets cleanStill to do (could be in a separate PR)
Suggestions for testing:
Once the output data is in the correct folder in
outputs/, runtar_make()and ensure thattar_load(plot_scores_by_date)andtar_load(scatterplot_scores)generates figuresSummary by CodeRabbit
Release Notes
New Features
Documentation
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.