Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion R/utils_groupcomparison.R
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,8 @@ getSamplesInfo = function(summarization_output) {
is_single_subject, has_tech_replicates) {

if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
weight_input = 1/input$Variance
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
} else {
weight_input = NULL
}
Comment on lines 142 to 147
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n R/utils_groupcomparison.R | head -200

Repository: Vitek-Lab/MSstats

Length of output: 8797


🏁 Script executed:

# Search for where Variance column is created/assigned
rg -n "Variance\s*=" --type r -B 3 -A 1 | head -100

Repository: Vitek-Lab/MSstats

Length of output: 1999


🏁 Script executed:

# Read the calculateProteinVariance function
cat -n R/utils_feature_selection.R | sed -n '246,300p'

Repository: Vitek-Lab/MSstats

Length of output: 2304


🏁 Script executed:

# Search for where Variance is assigned to ProteinLevelData or similar
rg -n "ProteinLevelData.*Variance|Variance.*ProteinLevel" --type r -B 2 -A 2

Repository: Vitek-Lab/MSstats

Length of output: 1616


🏁 Script executed:

# Search more broadly for Variance creation in dataProcess pipeline
rg -n "var_resid|Variance" --type r R/ | grep -E "(var_resid|Variance\s*=|Variance\$)" | head -50

Repository: Vitek-Lab/MSstats

Length of output: 745


🏁 Script executed:

# Search for where var_resid is renamed to Variance
rg -n "var_resid.*Variance|Variance.*var_resid|setnames.*[Vv]ariance" --type r -B 1 -A 1

Repository: Vitek-Lab/MSstats

Length of output: 309


🏁 Script executed:

# Look for where var_resid might be renamed to Variance by checking column selection/renaming
rg -n "Variance|var_resid" --type r R/dataProcess.R | head -50

Repository: Vitek-Lab/MSstats

Length of output: 43


🏁 Script executed:

# Find all R files and see what's the main entry point
fd "\.R$" R/ | head -20

Repository: Vitek-Lab/MSstats

Length of output: 515


🏁 Script executed:

# Check dataProcess.R for Variance column handling
cat -n R/dataProcess.R | grep -i "variance\|var_resid" -B 2 -A 2

Repository: Vitek-Lab/MSstats

Length of output: 3315


🏁 Script executed:

# Look for the actual output structure - search for ProteinLevelData creation
rg -n "ProteinLevelData\s*=" --type r R/dataProcess.R -A 5 | head -80

Repository: Vitek-Lab/MSstats

Length of output: 43


🏁 Script executed:

# Let's check the groupComparison.R file to see how the data flows into the function
cat -n R/groupComparison.R | head -150

Repository: Vitek-Lab/MSstats

Length of output: 10509


🏁 Script executed:

# Test R behavior with mean() and partial NAs
python3 << 'EOF'
import subprocess
result = subprocess.run(['Rscript', '-e', '''
# Test mean with partial NAs
x = c(1, 2, NA, 4, 5)
print("Vector with partial NAs:")
print(x)
print("mean(x) with na.rm=FALSE (default):")
print(mean(x))
print("mean(x) with na.rm=TRUE:")
print(mean(x, na.rm=TRUE))

# Test division by NA result
y = 1/c(0.1, 0.2, NA, 0.3)
print("1/c(0.1, 0.2, NA, 0.3):")
print(y)
print("mean of that:")
print(mean(y))
print("Dividing y by mean(y):")
print(y / mean(y))
'''], capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
EOF

Repository: Vitek-Lab/MSstats

Length of output: 616


🏁 Script executed:

# Search for test cases that might show Variance with NA values
rg -n "Variance.*NA|NA.*Variance" --type r -B 2 -A 2 | head -80

Repository: Vitek-Lab/MSstats

Length of output: 577


🏁 Script executed:

# Search for tests that explicitly test Variance with NAs
rg -n "all.*is.na.*Variance|is.na.*Variance" --type r -B 3 -A 3

Repository: Vitek-Lab/MSstats

Length of output: 634


🏁 Script executed:

# Look for test files that might test the weight calculation logic
cat -n inst/tinytest/test_utils_groupComparison_model.R

Repository: Vitek-Lab/MSstats

Length of output: 960


🏁 Script executed:

# Check if the test at line 17 is actually passing or if there's a mismatch
# The test expects unwormalized weights but the code normalizes them
cat -n inst/tinytest/test_utils_groupComparison_model.R | sed -n '14,18p'

Repository: Vitek-Lab/MSstats

Length of output: 392


Missing na.rm = TRUE in mean() silently zeros out all weights when any Variance is NA

The guard at line 142 only rejects the case where all variances are NA. When some (but not all) rows have NA variance, the branch is entered, unnormalized_weight_input contains some NA elements, and mean(unnormalized_weight_input) returns NA (R's default is na.rm = FALSE). Every element of weight_input then becomes NA, and the all-NA weight vector is forwarded to every lm() / lmer() call below — effectively dropping all observations and producing a failed or degenerate model.

🐛 Proposed fix
     if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
         unnormalized_weight_input = 1/input$Variance
-        weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
+        weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE)
     } else {
         weight_input = NULL
     }

Note: The existing test (inst/tinytest/test_utils_groupComparison_model.R:17) expects unnormalized weights but does not cover the partial-NA case, which is where this bug surfaces. The test should be updated to verify normalization and to include a case with partial NAs in the Variance column.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
weight_input = 1/input$Variance
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
} else {
weight_input = NULL
}
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE)
} else {
weight_input = NULL
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/utils_groupcomparison.R` around lines 142 - 147, The current normalization
of unnormalized_weight_input uses mean(unnormalized_weight_input) which returns
NA if any Variance is NA; change the normalization to
mean(unnormalized_weight_input, na.rm = TRUE) and then guard against remaining
NA entries by replacing NA weights with 0 (e.g. after computing weight_input, do
weight_input[is.na(weight_input)] <- 0) so downstream lm()/lmer() calls don't
receive all-NA weights; update the code that references "Variance",
"unnormalized_weight_input", and "weight_input" accordingly and add/adjust tests
to cover the partial-NA Variance case.

Expand Down
Loading