-
Notifications
You must be signed in to change notification settings - Fork 0
Implementing and Testing DIANN converter for MSstatsBIG. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds chunked DIANN-to-MSstats conversion: new chunked reader ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant bigDIANN as bigDIANNtoMSstatsFormat
participant reduce as reduceBigDIANN
participant chunk as cleanDIANNChunk
participant preprocess as MSstatsPreprocessBig
participant output as Arrow/Sparklyr
User->>bigDIANN: call with DIANN file & params
bigDIANN->>reduce: orchestrate chunked reduction
reduce->>reduce: detect delimiter
rect rgba(200,220,255,0.5)
Note over reduce,chunk: Chunked loop (1e6 rows/chunk)
reduce->>chunk: process chunk 1
chunk->>chunk: select columns (MBR-aware)
chunk->>chunk: un-nest semicolon fields
chunk->>chunk: parse FragmentIon & ProductCharge
chunk->>chunk: apply Q-value and fragment filters
chunk->>reduce: return processed chunk
reduce->>reduce: append to CSV output
reduce->>chunk: process chunk 2...N
end
reduce-->>bigDIANN: completed reduced CSV
bigDIANN->>preprocess: MSstatsPreprocessBig (feature selection, filtering, anomaly scoring)
preprocess->>output: produce Arrow/Sparklyr table
bigDIANN->>User: return processed data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Fix all issues with AI Agents 🤖
In @DESCRIPTION:
- Around line 26-27: Move the test-only packages out of Imports and into
Suggests in DESCRIPTION: remove "testthat" and "mockery" from the Imports list
and add them under Suggests so they are only required for development/testing;
update the DESCRIPTION sections accordingly so Imports contains runtime
dependencies only and Suggests lists testthat and mockery.
In @man/bigDIANNtoMSstatsFormat.Rd:
- Around line 33-34: The documentation and default parameter use incorrect DIANN
column names without dots; update the man page entry for quantificationColumn to
use "Fragment.Quant.Corrected" (DIANN 1.8.x) and "Fragment.Quant.Raw" (DIANN
1.9.x) instead of the dotted-less variants, and update the default value in the
reduceBigDIANN function (R/clean_DIANN.R, function reduceBigDIANN) to
"Fragment.Quant.Corrected" so the code and docs match DIANN's actual output
column names.
In @R/clean_DIANN.R:
- Around line 49-58: The current sub('\\^\\.\\*', '', .data$Fragment.Info) in
the dplyr::mutate (assigning FragmentIon) uses an incorrect literal regex and
does nothing; either remove the sub and set FragmentIon = .data$Fragment.Info
directly, or replace the pattern with one that strips the charge/suffix (e.g.
sub("\\^.*$", "", .data$Fragment.Info)) so "y7^1/1" -> "y7" and "b3-H2O^1/1" ->
"b3-H2O"; update the FragmentIon assignment accordingly in the mutate call.
In @R/converters.R:
- Around line 162-163: The roxygen docs for the quantificationColumn parameter
are inconsistent with the actual default; update the documentation text for
quantificationColumn to match the real default value used in the function
(change the displayed 'FragmentQuantCorrected' to 'Fragment.Quant.Corrected')
and likewise ensure the alternative value text matches the actual option name
used in code (e.g., use 'Fragment.Quant.Raw' if that is the real token), so the
docstring and the default parameter value for quantificationColumn are
identical.
🧹 Nitpick comments (6)
R/clean_DIANN.R (2)
91-91: Address the TODO comment regarding annotation columns.The TODO indicates uncertainty about whether
ConditionandBioReplicatecolumns are needed. This should be confirmed with Tony (or through documentation review) to ensure the converter produces the correct MSstats format.Would you like me to help search the MSstats documentation or existing converter implementations to verify the required columns?
2-21: Consider adding error handling for I/O operations.Both
reduceBigDIANNandcleanDIANNChunklack error handling for file operations. Consider wrapping I/O operations intryCatchblocks to provide informative error messages when:
- Input files are missing, corrupted, or have incorrect format
- Output path is invalid or write-protected
- Disk space is exhausted during chunked writes
This would improve the user experience when processing large files that might fail midway through.
Also applies to: 24-113
tests/testthat/test-converters.R (4)
31-32: Usetempfile()for output file to avoid test pollution.The input file uses
tempfile()but the output file is hardcoded as"preprocess_output.csv". This could cause issues with parallel test execution or leave artifacts in the working directory if the test fails before cleanup.🔎 Proposed fix
input_file <- tempfile(fileext = ".csv") - output_file <- "preprocess_output.csv" + output_file <- tempfile(fileext = ".csv")
57-60: Useon.exit()for guaranteed cleanup.If an assertion fails before reaching cleanup, temporary files may persist. Using
on.exit()ensures cleanup runs regardless of test outcome.🔎 Proposed fix
input_file <- tempfile(fileext = ".csv") - output_file <- "preprocess_output.csv" + output_file <- tempfile(fileext = ".csv") + on.exit({ + if (file.exists(input_file)) file.remove(input_file) + if (file.exists(output_file)) file.remove(output_file) + }, add = TRUE) # ... test code ... - # Cleanup - file.remove(input_file) - if (file.exists(output_file)) file.remove(output_file) })
64-64: Unusedmock_reducevariable.
mock_reduceis created but never used. Thestub()function on line 66 is what actually replacesreduceBigSpectronaut. This line can be removed.🔎 Proposed fix
test_that("bigSpectronauttoMSstatsFormat works correctly", { - # Mock reduceBigSpectronaut as its source is not provided - mock_reduce <- mock(NULL) - stub(bigSpectronauttoMSstatsFormat, "reduceBigSpectronaut", function(input_file, output_path, ...) {
76-77: Usetempfile()for test files.Similar to the previous test, hardcoded filenames can cause test pollution. While
input_fileisn't actually read (due to mocking), usingtempfile()foroutput_fileensures proper isolation.🔎 Proposed fix
- input_file <- "dummy_spectro_input.csv" - output_file <- "spectro_output.csv" + input_file <- "dummy_spectro_input.csv" # Not actually read due to mock + output_file <- tempfile(fileext = ".csv") + on.exit({ + if (file.exists(output_file)) file.remove(output_file) + reduce_file <- paste0("reduce_output_", basename(output_file)) + if (file.exists(reduce_file)) file.remove(reduce_file) + }, add = TRUE)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
tests/testthat/topN_preprocess_output.csvis excluded by!**/*.csvtests/testthat/topN_spectro_output.csvis excluded by!**/*.csvtests/testthat/topN_test_diann_output.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
DESCRIPTIONNAMESPACER/clean_DIANN.RR/converters.Rman/bigDIANNtoMSstatsFormat.Rdtests/testthat/test-converters.Rtests/testthat/test-diann_converter.R
🔇 Additional comments (6)
NAMESPACE (1)
5-5: LGTM!The new export for
bigDIANNtoMSstatsFormatis correctly added and aligns with the new converter functionality.tests/testthat/test-diann_converter.R (1)
6-109: LGTM! Comprehensive test coverage.The test suite provides excellent coverage of the DIANN converter functionality:
- cleanDIANNChunk test: Validates chunk processing, fragment filtering (H2O removal), and column mapping
- reduceBigDIANN test: Tests file-based chunked processing with multiple proteins and charge extraction
- bigDIANNtoMSstatsFormat test: End-to-end validation of feature selection logic with arrow backend
The tests correctly exercise key edge cases and validate expected outputs.
tests/testthat/test-converters.R (3)
1-4: LGTM on test setup.The imports and context setup are appropriate for testing converter functions.
6-28: LGTM!Good coverage of the annotation merge functionality. The test validates that
RunandIntensityare preserved whileConditionandBioReplicateare correctly joined.
92-95: Cleanup may miss the intermediate file withtempfile()approach.If you adopt
tempfile()foroutput_file, thepaste0("reduce_output_", output_file)pattern will break becauseoutput_filewould be a full path like/tmp/RtmpXXX/file123.csv. Consider adjusting how the reduce output path is constructed or usebasename().Also, consider wrapping cleanup in
on.exit()as suggested for the previous test.R/converters.R (1)
170-199: VerifyreduceBigDIANNexists and parameter alignment.The function implementation follows the established pattern from
bigSpectronauttoMSstatsFormat. The functionreduceBigDIANNis correctly defined in R/clean_DIANN.R and properly invoked.Looking at
MSstatsPreprocessBigsignature (R/converters.R:44-54), the parameter order is:input_file, output_file_name, backend, max_feature_count, filter_unique_peptides, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection.For
bigSpectronauttoMSstatsFormat(R/converters.R:147-151):MSstatsPreprocessBig( paste0("reduce_output_", output_file_name), output_file_name, backend, max_feature_count, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection)This is missing
filter_unique_peptidesentirely—a bug in the existing function.Your new function correctly includes all parameters in the proper order:
MSstatsPreprocessBig( paste0("reduce_output_", output_file_name), output_file_name, backend, max_feature_count, filter_unique_peptides, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection)
| \item{quantificationColumn}{Use 'FragmentQuantCorrected'(default) column for quantified intensities for DIANN 1.8.x. | ||
| Use 'FragmentQuantRaw' for quantified intensities for DIANN 1.9.x.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n man/bigDIANNtoMSstatsFormat.RdRepository: Vitek-Lab/MSstatsBig
Length of output: 3052
🏁 Script executed:
rg -i "fragment.quant|fragmentquant" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 5131
🏁 Script executed:
cat -n R/clean_DIANN.R | head -50Repository: Vitek-Lab/MSstatsBig
Length of output: 2207
🏁 Script executed:
rg "FragmentQuantRaw|Fragment.Quant.Raw" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 832
🏁 Script executed:
rg "Fragment.Quant.Raw" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 46
🏁 Script executed:
rg "1.9" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 50377
🏁 Script executed:
grep -A 10 "read.*delim\|colnames" R/clean_DIANN.R | head -30Repository: Vitek-Lab/MSstatsBig
Length of output: 1149
🏁 Script executed:
grep -B 5 -A 15 "intersect.*quantificationColumn\|quantificationColumn.*intersect" R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 852
🏁 Script executed:
rg -B 3 -A 3 "names\(|colnames.*=" R/ --max-count=5Repository: Vitek-Lab/MSstatsBig
Length of output: 3036
🏁 Script executed:
rg "standardizeColnames|Fragment.Quant" R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 46
🏁 Script executed:
sed -n '1,100p' tests/testthat/test-diann_converter.R | grep -A 20 "Fragment"Repository: Vitek-Lab/MSstatsBig
Length of output: 3986
Correct the column name documentation to match DIANN's actual output format.
The documentation incorrectly specifies column names without dots. DIANN exports columns with dots in their names. The actual column names are:
Fragment.Quant.Corrected(for DIANN 1.8.x) - notFragmentQuantCorrectedFragment.Quant.Raw(for DIANN 1.9.x) - notFragmentQuantRaw
Update lines 33-34 to use the correct dotted format, and also fix the default value in reduceBigDIANN (R/clean_DIANN.R, line 3) which currently uses "FragmentQuantCorrected" instead of "Fragment.Quant.Corrected".
🤖 Prompt for AI Agents
In @man/bigDIANNtoMSstatsFormat.Rd around lines 33-34, The documentation and
default parameter use incorrect DIANN column names without dots; update the man
page entry for quantificationColumn to use "Fragment.Quant.Corrected" (DIANN
1.8.x) and "Fragment.Quant.Raw" (DIANN 1.9.x) instead of the dotted-less
variants, and update the default value in the reduceBigDIANN function
(R/clean_DIANN.R, function reduceBigDIANN) to "Fragment.Quant.Corrected" so the
code and docs match DIANN's actual output column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the man page documentation for the quantificationColumn parameter
echo "=== Checking man/bigDIANNtoMSstatsFormat.Rd around lines 33-34 ==="
sed -n '30,40p' man/bigDIANNtoMSstatsFormat.Rd
echo -e "\n=== Checking R/clean_DIANN.R for reduceBigDIANN default parameter ==="
# Check the function signature for reduceBigDIANN
rg -A 5 "^reduceBigDIANN\s*<-\s*function" R/clean_DIANN.R
Length of output: 910
@Rudhik1904, I don't see the changes yet in the repository. The files still contain the incorrect column names:
man/bigDIANNtoMSstatsFormat.Rd (lines 33-34) still shows:
'FragmentQuantCorrected'→ should be'Fragment.Quant.Corrected''FragmentQuantRaw'→ should be'Fragment.Quant.Raw'
R/clean_DIANN.R (line 2) still shows:
quantificationColumn = "FragmentQuantCorrected"→ should be"Fragment.Quant.Corrected"
Could you please commit and push the changes? The column names need to include the dots to match DIANN's actual output format.
| if ("Condition" %in% colnames(input)) msstats_cols <- c(msstats_cols, "Condition") | ||
| if ("BioReplicate" %in% colnames(input)) msstats_cols <- c(msstats_cols, "BioReplicate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed with Devon, you can remove this code and instead add an additional parameter for users to provide an annotation file (i.e. the table with Run, BioReplicate, and Condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| mbr_cols <- if (MBR) { | ||
| c('Lib.Q.Value', 'Lib.PG.Q.Value') | ||
| } else { | ||
| c('Global.Q.Value', 'Global.PG.Q.Value') | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we need to add a filter for filtering rows if their values in these columns is below a certain threshold:
see reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@R/clean_DIANN.R`:
- Around line 2-3: The default quantificationColumn in reduceBigDIANN is
incorrect ("FragmentQuantCorrected") and should match bigDIANNtoMSstatsFormat's
column name ("Fragment.Quant.Corrected") so the column can be found; update the
default value of the quantificationColumn parameter in reduceBigDIANN to
"Fragment.Quant.Corrected" (or otherwise normalize dotted vs. non-dotted names
inside reduceBigDIANN to accept both), referencing the reduceBigDIANN function
and the bigDIANNtoMSstatsFormat expectation when making the change.
- Around line 35-58: The code unconditionally uses columns "Fragment.Info" and
the variable quantificationColumn after only using intersect() earlier, so add
guards to skip downstream processing when those columns are missing: check if
quantificationColumn exists in colnames(input) before calling
tidyr::separate_rows on it or converting input[[quantificationColumn]] <-
as.numeric(...), and check for "Fragment.Info" in colnames(input) before
separating rows or any parsing of fragment data; use the existing split_cols or
explicit logical checks (e.g., if (quantificationColumn %in% colnames(input)) {
... } and if ("Fragment.Info" %in% colnames(input)) { ... }) around the
tidyr::separate_rows, fragment processing, and numeric conversion to avoid
crashes when columns are absent.
- Around line 64-70: ProductCharge currently becomes NA when Fragment.Info
contains "/" but no digits follow because as.integer(stringr::str_extract(...))
yields NA; modify the ProductCharge computation (the dplyr::if_else branch that
uses stringr::str_extract on .data$Fragment.Info) to wrap the extracted value
with dplyr::coalesce (or equivalent) so that if str_extract returns NA it falls
back to "1" before coercion, ensuring the final value is 1L when parsing fails.
In `@R/converters.R`:
- Around line 184-193: The intermediate path construction using
paste0("reduce_output_", output_file_name) is unsafe when output_file_name
contains directories; update calls that create the reduced intermediate filename
(in reduceBigDIANN -> MSstatsPreprocessBig sequence and the analogous locations
in bigSpectronauttoMSstatsFormat) to build the path with
file.path(dirname(output_file_name), paste0("reduce_output_",
basename(output_file_name))) so the reduce_output_* file is created alongside
the intended output file regardless of directory components.
In `@tests/testthat/test-diann_converter.R`:
- Around line 78-108: The test uses a static output filename
("test_diann_output.csv") which can cause collisions; change the output_file
assignment to use tempfile(fileext = ".csv") and update all subsequent
references (the call to bigDIANNtoMSstatsFormat, the dplyr::collect on
converted, and the cleanup file.remove calls that use output_file and
paste0("reduce_output_", output_file)) to use that tempfile variable so the test
writes/cleans unique temp files; relevant symbols: output_file,
bigDIANNtoMSstatsFormat, converted, and the cleanup file.remove/paste0 calls.
♻️ Duplicate comments (2)
R/clean_DIANN.R (2)
7-13: Stop on unsupported extensions; avoid treating .xlsx as TSV.
grepl("tsv|xls")matches.xlsx, which is binary. The";"fallback can silently mis-parse files. Prefer anchored, case-insensitive extension checks and error for unsupported formats.🐛 Proposed fix
- if (grepl("csv", input_file)) { + if (grepl("\\.csv$", input_file, ignore.case = TRUE)) { delim = "," - } else if (grepl("tsv|xls", input_file)) { + } else if (grepl("\\.(tsv|txt)$", input_file, ignore.case = TRUE)) { delim = "\t" } else { - delim <- ";" + stop("Unsupported file extension. Please provide a .csv or .tsv/.txt file.") }
60-63: Remove the ineffective FragmentIon regex.
sub('\\^\\.\\*', ...)matches the literal "^.*" and is effectively a no-op. If no transformation is intended, assign the value directly.🔧 Suggested change
- FragmentIon = sub('\\^\\.\\*', '', .data$Fragment.Info), + FragmentIon = .data$Fragment.Info,
| reduceBigDIANN <- function(input_file, output_path, MBR = TRUE, | ||
| quantificationColumn = "FragmentQuantCorrected", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the reduceBigDIANN function
find . -name "clean_DIANN.R" -o -name "*DIANN*" -type f | head -20Repository: Vitek-Lab/MSstatsBig
Length of output: 114
🏁 Script executed:
# Search for the reduceBigDIANN function definition
rg -n "reduceBigDIANN\s*<-\s*function" -A 5Repository: Vitek-Lab/MSstatsBig
Length of output: 504
🏁 Script executed:
# Search for bigDIANNtoMSstatsFormat function and its quantificationColumn default
rg -n "bigDIANNtoMSstatsFormat\s*<-\s*function" -A 10Repository: Vitek-Lab/MSstatsBig
Length of output: 938
🏁 Script executed:
# Search for any test files mentioning quantificationColumn
fd -e R -path "*/test*" | xargs rg "quantificationColumn|Fragment" 2>/dev/null | head -30Repository: Vitek-Lab/MSstatsBig
Length of output: 4826
Align quantificationColumn default with bigDIANNtoMSstatsFormat.
The default "FragmentQuantCorrected" differs from bigDIANNtoMSstatsFormat ("Fragment.Quant.Corrected"), and test data confirms DIANN output uses the dotted notation. Calling reduceBigDIANN without override will fail to locate the column.
🔧 Suggested change
-reduceBigDIANN <- function(input_file, output_path, MBR = TRUE,
- quantificationColumn = "FragmentQuantCorrected",
+reduceBigDIANN <- function(input_file, output_path, MBR = TRUE,
+ quantificationColumn = "Fragment.Quant.Corrected",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| reduceBigDIANN <- function(input_file, output_path, MBR = TRUE, | |
| quantificationColumn = "FragmentQuantCorrected", | |
| reduceBigDIANN <- function(input_file, output_path, MBR = TRUE, | |
| quantificationColumn = "Fragment.Quant.Corrected", |
🤖 Prompt for AI Agents
In `@R/clean_DIANN.R` around lines 2 - 3, The default quantificationColumn in
reduceBigDIANN is incorrect ("FragmentQuantCorrected") and should match
bigDIANNtoMSstatsFormat's column name ("Fragment.Quant.Corrected") so the column
can be found; update the default value of the quantificationColumn parameter in
reduceBigDIANN to "Fragment.Quant.Corrected" (or otherwise normalize dotted vs.
non-dotted names inside reduceBigDIANN to accept both), referencing the
reduceBigDIANN function and the bigDIANNtoMSstatsFormat expectation when making
the change.
| # 1. Select required columns | ||
| base_cols <- c('Protein.Names', 'Stripped.Sequence', 'Modified.Sequence', | ||
| 'Precursor.Charge', quantificationColumn, 'Q.Value', | ||
| 'Precursor.Mz', 'Fragment.Info', 'Run') | ||
|
|
||
| mbr_cols <- if (MBR) { | ||
| c('Lib.Q.Value', 'Lib.PG.Q.Value') | ||
| } else { | ||
| c('Global.Q.Value', 'Global.PG.Q.Value') | ||
| } | ||
|
|
||
| req_cols <- intersect(c(base_cols, mbr_cols), colnames(input)) | ||
| input <- dplyr::select(input, all_of(req_cols)) | ||
|
|
||
| # 2. Split concatenated values (un-nest) | ||
| split_cols <- intersect(c(quantificationColumn, "Fragment.Info"), colnames(input)) | ||
| if (length(split_cols) > 0) { | ||
| input <- tidyr::separate_rows(input, all_of(split_cols), sep = ";") | ||
| } | ||
|
|
||
| # 3. Process fragment information | ||
|
|
||
| #Convert Intensity to Numeric from Char strings | ||
| input[[quantificationColumn]] <- as.numeric(input[[quantificationColumn]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "clean_DIANN.R" -type f 2>/dev/null | head -5Repository: Vitek-Lab/MSstatsBig
Length of output: 46
🏁 Script executed:
# Get overall structure first
git ls-files | grep -i diannRepository: Vitek-Lab/MSstatsBig
Length of output: 190
🏁 Script executed:
# Search for the file in common R locations
find . -name "clean_DIANN.R" -o -name "*DIANN*" 2>/dev/null | head -10Repository: Vitek-Lab/MSstatsBig
Length of output: 114
🏁 Script executed:
wc -l R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 83
🏁 Script executed:
cat -n R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 6174
Guard against missing Fragment.Info or quantificationColumn before unconditional use.
The code treats Fragment.Info and quantificationColumn as optional via intersect() at lines 46–50, but uses them unconditionally at lines 58, 62–68, and 76. If either column is absent from the input, the code will crash. Either enforce these as required columns upfront or conditionally apply the downstream processing when they are missing.
🔧 Example guard
- req_cols <- intersect(c(base_cols, mbr_cols), colnames(input))
- input <- dplyr::select(input, all_of(req_cols))
+ required_cols <- c(base_cols, mbr_cols)
+ missing_cols <- setdiff(required_cols, colnames(input))
+ if (length(missing_cols) > 0) {
+ stop("Missing required DIANN columns: ", paste(missing_cols, collapse = ", "))
+ }
+ input <- dplyr::select(input, all_of(required_cols))🤖 Prompt for AI Agents
In `@R/clean_DIANN.R` around lines 35 - 58, The code unconditionally uses columns
"Fragment.Info" and the variable quantificationColumn after only using
intersect() earlier, so add guards to skip downstream processing when those
columns are missing: check if quantificationColumn exists in colnames(input)
before calling tidyr::separate_rows on it or converting
input[[quantificationColumn]] <- as.numeric(...), and check for "Fragment.Info"
in colnames(input) before separating rows or any parsing of fragment data; use
the existing split_cols or explicit logical checks (e.g., if
(quantificationColumn %in% colnames(input)) { ... } and if ("Fragment.Info" %in%
colnames(input)) { ... }) around the tidyr::separate_rows, fragment processing,
and numeric conversion to avoid crashes when columns are absent.
| # Extract product charge | ||
| ProductCharge = dplyr::if_else( | ||
| grepl("/", .data$Fragment.Info), | ||
| # Extract charge (number right after "/" in string), default to 1 if parsing fails | ||
| as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")), | ||
| 1L | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "clean_DIANN.R" -type fRepository: Vitek-Lab/MSstatsBig
Length of output: 81
🏁 Script executed:
sed -n '64,70p' ./R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 348
🏁 Script executed:
python3 << 'EOF'
# Let's verify the behavior of stringr::str_extract and as.integer in R
# by examining what happens when the regex pattern doesn't match
# The behavior should be:
# stringr::str_extract("something/", "(?<=/)[0-9]+") -> NA (no digits after /)
# as.integer(NA) -> NA (not 1L)
# This is a R behavior question, let me create a test to understand it better
print("Testing R string extraction and type conversion behavior:")
print("Case 1: Fragment.Info with / but no digits after it")
print(" stringr::str_extract returns: NA (pattern doesn't match)")
print(" as.integer(NA) returns: NA (not 1L)")
print("")
print("Case 2: Fragment.Info with / and digits after it")
print(" stringr::str_extract returns: the digits")
print(" as.integer(digits) returns: integer value")
print("")
print("Case 3: Fragment.Info without /")
print(" Current code returns: 1L (via the else branch of if_else)")
print("")
print("So the bug is in Case 1 - when / is present but no digits follow")
print("The function would return NA instead of 1L")
EOFRepository: Vitek-Lab/MSstatsBig
Length of output: 150
🏁 Script executed:
grep -r "ProductCharge" ./R --include="*.R" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 4903
🏁 Script executed:
sed -n '50,80p' ./R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 1088
Ensure ProductCharge defaults to 1 when parsing fails.
The current code returns NA when a / is present in Fragment.Info but no digits follow (e.g., "y1/"). The regex pattern "(?<=/)[0-9]+" fails to match, causing stringr::str_extract() to return NA, and subsequently as.integer(NA) returns NA instead of the intended default of 1L. Use dplyr::coalesce() to handle this edge case:
🔧 Suggested change
- ProductCharge = dplyr::if_else(
- grepl("/", .data$Fragment.Info),
- # Extract charge (number right after "/" in string), default to 1 if parsing fails
- as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")),
- 1L
- )
+ ProductCharge = dplyr::if_else(
+ grepl("/", .data$Fragment.Info),
+ dplyr::coalesce(
+ as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")),
+ 1L
+ ),
+ 1L
+ )🤖 Prompt for AI Agents
In `@R/clean_DIANN.R` around lines 64 - 70, ProductCharge currently becomes NA
when Fragment.Info contains "/" but no digits follow because
as.integer(stringr::str_extract(...)) yields NA; modify the ProductCharge
computation (the dplyr::if_else branch that uses stringr::str_extract on
.data$Fragment.Info) to wrap the extracted value with dplyr::coalesce (or
equivalent) so that if str_extract returns NA it falls back to "1" before
coercion, ensuring the final value is 1L when parsing fails.
| # Reduce and clean the DIANN report file in chunks | ||
| reduceBigDIANN(input_file, | ||
| paste0("reduce_output_", output_file_name), | ||
| MBR, | ||
| quantificationColumn) | ||
|
|
||
| # Preprocess the cleaned data (feature selection, etc.) | ||
| msstats_data <- MSstatsPreprocessBig( | ||
| paste0("reduce_output_", output_file_name), | ||
| output_file_name, backend, max_feature_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
fd "converters.R" --type fRepository: Vitek-Lab/MSstatsBig
Length of output: 111
🏁 Script executed:
wc -l R/converters.RRepository: Vitek-Lab/MSstatsBig
Length of output: 82
🏁 Script executed:
cat -n R/converters.R | sed -n '180,195p'Repository: Vitek-Lab/MSstatsBig
Length of output: 878
🏁 Script executed:
cat -n R/converters.R | sed -n '170,200p'Repository: Vitek-Lab/MSstatsBig
Length of output: 1649
🏁 Script executed:
cat -n R/converters.R | sed -n '1,30p'Repository: Vitek-Lab/MSstatsBig
Length of output: 2410
🏁 Script executed:
rg "bigDIANNtoMSstatsFormat" --type r -A 3 -B 3Repository: Vitek-Lab/MSstatsBig
Length of output: 1488
🏁 Script executed:
cat -n tests/testthat/test-diann_converter.R | grep -A 50 "bigDIANNtoMSstatsFormat works with arrow"Repository: Vitek-Lab/MSstatsBig
Length of output: 2424
🏁 Script executed:
cat -n tests/testthat/test-diann_converter.RRepository: Vitek-Lab/MSstatsBig
Length of output: 5551
🏁 Script executed:
rg "output_file_name" R/converters.R -nRepository: Vitek-Lab/MSstatsBig
Length of output: 958
🏁 Script executed:
cat -n R/converters.R | sed -n '128,155p'Repository: Vitek-Lab/MSstatsBig
Length of output: 1853
Fix unsafe path construction that breaks when output_file_name contains directories.
The paste0("reduce_output_", output_file_name) calls at lines 186 and 192 produce invalid paths if output_file_name includes directory separators (e.g., /tmp/out.csv → reduce_output_/tmp/out.csv). Use file.path() with dirname() and basename() to safely place the intermediate file alongside the output.
Note: The same pattern exists in bigSpectronauttoMSstatsFormat at lines 143 and 148 and should be fixed identically.
🐛 Proposed fix
+ reduce_path <- file.path(
+ dirname(output_file_name),
+ paste0("reduce_output_", basename(output_file_name))
+ )
+ reduceBigDIANN(input_file, reduce_path, MBR, quantificationColumn)
- reduceBigDIANN(input_file,
- paste0("reduce_output_", output_file_name),
- MBR,
- quantificationColumn)
- msstats_data <- MSstatsPreprocessBig(
- paste0("reduce_output_", output_file_name),
+ msstats_data <- MSstatsPreprocessBig(
+ reduce_path,
output_file_name, backend, max_feature_count,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Reduce and clean the DIANN report file in chunks | |
| reduceBigDIANN(input_file, | |
| paste0("reduce_output_", output_file_name), | |
| MBR, | |
| quantificationColumn) | |
| # Preprocess the cleaned data (feature selection, etc.) | |
| msstats_data <- MSstatsPreprocessBig( | |
| paste0("reduce_output_", output_file_name), | |
| output_file_name, backend, max_feature_count, | |
| # Reduce and clean the DIANN report file in chunks | |
| reduce_path <- file.path( | |
| dirname(output_file_name), | |
| paste0("reduce_output_", basename(output_file_name)) | |
| ) | |
| reduceBigDIANN(input_file, reduce_path, MBR, quantificationColumn) | |
| # Preprocess the cleaned data (feature selection, etc.) | |
| msstats_data <- MSstatsPreprocessBig( | |
| reduce_path, | |
| output_file_name, backend, max_feature_count, |
🤖 Prompt for AI Agents
In `@R/converters.R` around lines 184 - 193, The intermediate path construction
using paste0("reduce_output_", output_file_name) is unsafe when output_file_name
contains directories; update calls that create the reduced intermediate filename
(in reduceBigDIANN -> MSstatsPreprocessBig sequence and the analogous locations
in bigSpectronauttoMSstatsFormat) to build the path with
file.path(dirname(output_file_name), paste0("reduce_output_",
basename(output_file_name))) so the reduce_output_* file is created alongside
the intended output file regardless of directory components.
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | ||
| input_file <- tempfile(fileext = ".csv") | ||
| output_file <- "test_diann_output.csv" | ||
|
|
||
| # 4 features for one protein. Feature selection should pick the top 2. | ||
| diann_data <- rbind( | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | ||
| ) | ||
| write.csv(diann_data, input_file, row.names = FALSE) | ||
|
|
||
| converted <- bigDIANNtoMSstatsFormat( | ||
| input_file = input_file, | ||
| output_file_name = output_file, | ||
| backend = "arrow", | ||
| max_feature_count = 2 | ||
| ) | ||
| result <- dplyr::collect(converted) | ||
|
|
||
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | ||
| # Top 2 features are y4 and y1. | ||
| expect_equal(nrow(result), 4) # 2 features * 2 runs | ||
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | ||
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | ||
|
|
||
| # Cleanup | ||
| file.remove(input_file) | ||
| if (file.exists(output_file)) file.remove(output_file) | ||
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual test file to confirm the current state
cat -n tests/testthat/test-diann_converter.R | sed -n '70,115p'Repository: Vitek-Lab/MSstatsBig
Length of output: 2688
🏁 Script executed:
# Check if there's any test configuration for parallel execution
git ls-files | grep -E "(testthat|\.yml|\.yaml|\.json)" | head -20Repository: Vitek-Lab/MSstatsBig
Length of output: 257
🏁 Script executed:
# Look for similar test output file patterns in other test files
fd '\.R$' tests/testthat/ | xargs grep -l "tempfile\|file.remove" | head -10Repository: Vitek-Lab/MSstatsBig
Length of output: 134
🏁 Script executed:
# Check for test configuration files (especially for parallel execution)
fd -e yml -e yaml -e dcf | head -20Repository: Vitek-Lab/MSstatsBig
Length of output: 46
🏁 Script executed:
# Look at other tests to see if they use static filenames too
grep -n "output_file.*<-" tests/testthat/*.RRepository: Vitek-Lab/MSstatsBig
Length of output: 465
🏁 Script executed:
# Check the git history to see if these leftover files should have been cleaned
git ls-files tests/testthat/*.csvRepository: Vitek-Lab/MSstatsBig
Length of output: 186
🏁 Script executed:
# Look for how other tests in the project handle temporary output files
grep -B 2 -A 2 "tempfile.*output\|output.*tempfile" tests/testthat/*.R | head -40Repository: Vitek-Lab/MSstatsBig
Length of output: 803
Use tempfile() instead of static filename to prevent test collisions.
The other tests in this file (lines 7 and 45) correctly use tempfile(fileext = ".csv") for output files. This test should follow the same pattern to avoid potential collisions in parallel test runs or from previous test failures. Evidence of incomplete cleanup is visible in the repository with leftover CSV files (e.g., topN_test_diann_output.csv).
🔧 Suggested change
- output_file <- "test_diann_output.csv"
+ output_file <- basename(tempfile(fileext = ".csv"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | |
| input_file <- tempfile(fileext = ".csv") | |
| output_file <- "test_diann_output.csv" | |
| # 4 features for one protein. Feature selection should pick the top 2. | |
| diann_data <- rbind( | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | |
| ) | |
| write.csv(diann_data, input_file, row.names = FALSE) | |
| converted <- bigDIANNtoMSstatsFormat( | |
| input_file = input_file, | |
| output_file_name = output_file, | |
| backend = "arrow", | |
| max_feature_count = 2 | |
| ) | |
| result <- dplyr::collect(converted) | |
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | |
| # Top 2 features are y4 and y1. | |
| expect_equal(nrow(result), 4) # 2 features * 2 runs | |
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | |
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | |
| # Cleanup | |
| file.remove(input_file) | |
| if (file.exists(output_file)) file.remove(output_file) | |
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) | |
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | |
| input_file <- tempfile(fileext = ".csv") | |
| output_file <- basename(tempfile(fileext = ".csv")) | |
| # 4 features for one protein. Feature selection should pick the top 2. | |
| diann_data <- rbind( | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | |
| ) | |
| write.csv(diann_data, input_file, row.names = FALSE) | |
| converted <- bigDIANNtoMSstatsFormat( | |
| input_file = input_file, | |
| output_file_name = output_file, | |
| backend = "arrow", | |
| max_feature_count = 2 | |
| ) | |
| result <- dplyr::collect(converted) | |
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | |
| # Top 2 features are y4 and y1. | |
| expect_equal(nrow(result), 4) # 2 features * 2 runs | |
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | |
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | |
| # Cleanup | |
| file.remove(input_file) | |
| if (file.exists(output_file)) file.remove(output_file) | |
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) | |
| }) |
🤖 Prompt for AI Agents
In `@tests/testthat/test-diann_converter.R` around lines 78 - 108, The test uses a
static output filename ("test_diann_output.csv") which can cause collisions;
change the output_file assignment to use tempfile(fileext = ".csv") and update
all subsequent references (the call to bigDIANNtoMSstatsFormat, the
dplyr::collect on converted, and the cleanup file.remove calls that use
output_file and paste0("reduce_output_", output_file)) to use that tempfile
variable so the test writes/cleans unique temp files; relevant symbols:
output_file, bigDIANNtoMSstatsFormat, converted, and the cleanup
file.remove/paste0 calls.
WIP: Added Clean_DIANN, Diann Converter and some basic test (more comprehensive test is still needed)
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.