Conversation
Add initial support for SDRF integration
minor changes
📝 WalkthroughWalkthroughRefactors aggregation and plotting to add SDRF/sample-aware and DIA-aware per-run and per-sample metrics across common, dia, quantms, mzidentml and plotting modules; adds utilities for identified-rate, charge-state and modification summaries; and bumps MultiQC upper bound and adds urllib3 to dependency specs. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (9)
pmultiqc/modules/common/plots/general.py (2)
124-124: Use f-string conversion flag instead of explicitstr()call.The explicit
str()call inside the f-string can be replaced with the more Pythonic!sconversion flag.🔎 Proposed fix
- sample=SampleName(f"Sample {str(sample)}"), + sample=SampleName(f"Sample {sample!s}"),Based on static analysis hints (Ruff RUF010).
187-187: Use f-string conversion flag instead of explicitstr()call.The explicit
str()call inside the f-string can be replaced with the more Pythonic!sconversion flag, consistent with the suggestion for line 124.🔎 Proposed fix
- sample=SampleName(f"Sample {str(sample)}"), + sample=SampleName(f"Sample {sample!s}"),Based on static analysis hints (Ruff RUF010).
pmultiqc/modules/common/ms/idxml.py (1)
89-100: Simplify_init_labels()return statement.The method returns three unused
Falsevalues alongside thelabelsdictionary. Only thelabelsdict is actually used by the caller (line 64). Simplify the return to justreturn labelsto reduce unnecessary complexity.🔎 Proposed simplification
def _init_labels(self): labels = { "spec_e_label": [], "xcorr_label": [], "hyper_label": [], "peps_label": [], "consensus_label": [], "msgf_label": False, "comet_label": False, "sage_label": False, } - return False, False, False, labels + return labelsAnd update the caller:
-_, _, _, labels = self._init_labels() +labels = self._init_labels()pmultiqc/modules/common/ms_io.py (1)
128-146: LGTM! Efficient batch processing with boolean indexing.The refactoring uses
info_df["scan"].isin(identified_spectrum_scan_id)to create a boolean mask, then processes identified and unidentified scans separately. This is more efficient than row-by-row iteration.Optional: Further reduce code duplication
The three loops for each histogram could be extracted into a helper function:
+ def process_columns(df_subset, charge_hist, intensity_hist, peaks_hist): + for charge_state in df_subset["precursor_charge"]: + data_add_value(charge_hist, charge_state) + for base_peak_inte in df_subset["base_peak_intensity"]: + data_add_value(intensity_hist, base_peak_inte) + for peak_per_ms2 in df_subset["num_peaks"]: + data_add_value(peaks_hist, peak_per_ms2) + identified_scans = info_df["scan"].isin(identified_spectrum_scan_id) - - for charge_state in info_df[identified_scans]["precursor_charge"]: - data_add_value(mzml_charge_plot, charge_state) - for base_peak_inte in info_df[identified_scans]["base_peak_intensity"]: - data_add_value(mzml_peak_distribution_plot, base_peak_inte) - for peak_per_ms2 in info_df[identified_scans]["num_peaks"]: - data_add_value(mzml_peaks_ms2_plot, peak_per_ms2) - - for charge_state in info_df[~identified_scans]["precursor_charge"]: - data_add_value(mzml_charge_plot_1, charge_state) - for base_peak_inte in info_df[~identified_scans]["base_peak_intensity"]: - data_add_value(mzml_peak_distribution_plot_1, base_peak_inte) - for peak_per_ms2 in info_df[~identified_scans]["num_peaks"]: - data_add_value(mzml_peaks_ms2_plot_1, peak_per_ms2) + process_columns( + info_df[identified_scans], + mzml_charge_plot, mzml_peak_distribution_plot, mzml_peaks_ms2_plot + ) + process_columns( + info_df[~identified_scans], + mzml_charge_plot_1, mzml_peak_distribution_plot_1, mzml_peaks_ms2_plot_1 + )pmultiqc/modules/common/plots/dia.py (2)
9-10: Self-import creates circular reference risk.Line 9 imports
dia as dia_plotsfrom the same module (pmultiqc.modules.common.plots). This self-reference is used incalculate_dia_intensity_std(line 560) to calldia_plots.can_groupby_for_std. Sincecan_groupby_for_stdis defined in this same file (line 257), you can call it directly without the import.🔎 Suggested fix
-from pmultiqc.modules.common.plots import dia as dia_plots from pmultiqc.modules.common.common_utils import group_chargeAnd at line 560:
- if dia_plots.can_groupby_for_std(df_sub, "Run"): + if can_groupby_for_std(df_sub, "Run"):
536-582: Missing return statement when grouping fails.When
can_groupby_for_stdreturnsFalse(line 580), the function logs a warning but doesn't explicitly return. This results in an implicitreturn None, which is handled by the caller but could be made explicit for clarity.🔎 Suggested fix
else: log.warning("No SDRF available; failed to parse experimental groups; SD Intensity not generated.") + return {}pmultiqc/modules/common/common_utils.py (2)
272-293: Column conflict handling logs warning but proceeds anyway.When both "Modifications" and "modifications" columns exist (line 279), the warning is logged but the code continues using the original "modifications" column. This could silently produce incorrect results if the columns have different data. Consider either raising an error or explicitly documenting which column takes precedence.
🔎 Suggested fix to clarify precedence
if "Modifications" in df_copy.columns and "modifications" not in df_copy.columns: df_copy = df_copy.rename(columns={"Modifications": "modifications"}) - else: - log.warning('Detected both "Modifications" and "modifications" columns.') + elif "Modifications" in df_copy.columns and "modifications" in df_copy.columns: + log.warning('Detected both "Modifications" and "modifications" columns; using "modifications".')
540-548: zip() without strict parameter may silently truncate mismatched iterables.As flagged by static analysis (B905), using
zip()withoutstrict=Trueon line 544 could silently drop data if the iterables have different lengths. Since both come from the same DataFrame, this is unlikely but worth protecting against.🔎 Suggested fix
mod_plot_dict = dict( - zip(mod_group_processed["modifications"], mod_group_processed["percentage"]) + zip(mod_group_processed["modifications"], mod_group_processed["percentage"], strict=True) )pmultiqc/modules/common/dia_utils.py (1)
837-855: dia_sample_level_modifications duplicates sample_level_modifications from quantms.py.This function has nearly identical logic to
sample_level_modificationsinpmultiqc/modules/quantms/quantms.py(lines 2286-2308). Both merge with SDRF, cast Sample to int, group by Sample, and callsummarize_modifications. Consider extracting to a shared utility incommon_utils.py.Based on the relevant code snippets,
summarize_modificationsandgroup_chargeare already duplicated betweencommon_utils.pyandmztab.py. This pattern of duplication should be consolidated.#!/bin/bash # Check for duplicate implementations of sample_level_modifications rg -n "def.*sample_level_modifications" --type py
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
MultiQC_logo.pngis excluded by!**/*.pngpmultiqc/images/pmultiqc_logo_dark.pngis excluded by!**/*.png
📒 Files selected for processing (17)
environment.ymlpmultiqc/main.pypmultiqc/modules/common/common_utils.pypmultiqc/modules/common/dia_utils.pypmultiqc/modules/common/ms/idxml.pypmultiqc/modules/common/ms/msinfo.pypmultiqc/modules/common/ms_io.pypmultiqc/modules/common/mzidentml_utils.pypmultiqc/modules/common/plots/dia.pypmultiqc/modules/common/plots/general.pypmultiqc/modules/common/plots/id.pypmultiqc/modules/diann/diann.pypmultiqc/modules/mzidentml/mzidentml.pypmultiqc/modules/quantms/quantms.pypyproject.tomlrecipe/meta.yamlrequirements.txt
💤 Files with no reviewable changes (1)
- pmultiqc/modules/diann/diann.py
🧰 Additional context used
🧬 Code graph analysis (7)
pmultiqc/modules/common/ms/msinfo.py (1)
pmultiqc/modules/common/ms_io.py (1)
add_ms_values(77-146)
pmultiqc/modules/quantms/quantms.py (1)
pmultiqc/modules/common/file_utils.py (1)
file_prefix(110-117)
pmultiqc/modules/common/ms_io.py (1)
pmultiqc/modules/common/histogram.py (1)
add_value(67-119)
pmultiqc/modules/mzidentml/mzidentml.py (3)
pmultiqc/modules/common/common_utils.py (1)
aggregate_msms_identified_rate(497-538)pmultiqc/modules/common/plots/id.py (1)
draw_quantms_identification(377-506)pmultiqc/modules/quantms/sparklines.py (1)
plot(19-56)
pmultiqc/modules/common/plots/id.py (1)
pmultiqc/modules/quantms/sparklines.py (1)
plot(19-56)
pmultiqc/modules/common/dia_utils.py (1)
pmultiqc/modules/common/common_utils.py (2)
cal_num_table_at_sample(451-480)summarize_modifications(540-548)
pmultiqc/modules/common/plots/dia.py (1)
pmultiqc/modules/common/common_utils.py (1)
group_charge(551-561)
🪛 OSV Scanner (2.3.1)
requirements.txt
[HIGH] 1-1: urllib3 1.26.20: urllib3 streaming API improperly handles highly compressed data
[HIGH] 1-1: urllib3 1.26.20: urllib3 allows an unbounded number of links in the decompression chain
[HIGH] 1-1: urllib3 1.26.20: urllib3 redirects are not disabled when retries are disabled on PoolManager instantiation
🪛 Ruff (0.14.10)
pmultiqc/modules/common/plots/general.py
124-124: Use explicit conversion flag
Replace with conversion flag
(RUF010)
187-187: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/quantms/quantms.py
1105-1105: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
1177-1177: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1183-1183: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1251-1251: Use explicit conversion flag
Replace with conversion flag
(RUF010)
2306-2306: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/plots/id.py
415-415: Use explicit conversion flag
Replace with conversion flag
(RUF010)
647-647: Use explicit conversion flag
Replace with conversion flag
(RUF010)
709-709: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/common_utils.py
528-528: Use explicit conversion flag
Replace with conversion flag
(RUF010)
544-544: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
558-558: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/dia_utils.py
854-854: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/plots/dia.py
87-87: Use explicit conversion flag
Replace with conversion flag
(RUF010)
554-554: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test_maxquant
- GitHub Check: test_mzid_mzML
- GitHub Check: test_lfq
- GitHub Check: test_maxquant_dia
- GitHub Check: test_mzid_mgf
- GitHub Check: test_diann
- GitHub Check: test_proteobench
- GitHub Check: test_tmt
- GitHub Check: test_dia
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (23)
pmultiqc/modules/common/plots/general.py (1)
66-67: Good practice: Explicit clustering configuration.Making the clustering behavior explicit is a good practice that improves code clarity and prevents reliance on default behavior that could change between library versions.
pmultiqc/modules/common/ms/idxml.py (1)
108-111: No action needed. The project is explicitly constrained topyopenms<=3.4.0(in pyproject.toml, requirements.txt, and environment.yml), so the codebase will never reach the pyopenms >= 3.5.0 scenario mentioned in the comment. The current implementation using plain lists is correct for all supported versions. The commented-out line appears to be a developer note about potential future compatibility, not an active issue. If the note causes confusion, it can optionally be removed, but this is not a critical concern.Likely an incorrect or invalid review comment.
pmultiqc/main.py (1)
48-49: Verify the attribute check and consider adding file existence validation.Two concerns:
Attribute check: Using
custom_logo_widthto detect MultiQC v1.33+ dark logo support seems indirect. Consider checking forhasattr(config, "custom_logo_dark")instead to directly verify dark logo support.Missing file validation: Unlike the main logo (line 42), there's no existence check for
pmultiqc_logo_darkbefore settingconfig.custom_logo_dark. If the file doesn't exist, the config will point to a non-existent path.Consider adding file existence validation:
if hasattr(config, "custom_logo_width"): if pmultiqc_logo_dark.exists(): config.custom_logo_dark = str(pmultiqc_logo_dark) else: log.warning(f"pmultiqc dark logo file not found at: {pmultiqc_logo_dark}") config.custom_logo_width = 118Additionally, please verify whether
custom_logo_widthis the correct attribute to check for v1.33+ support, or ifcustom_logo_darkwould be more appropriate.environment.yml (1)
8-8: LGTM!The MultiQC version constraint update is consistent with other dependency files in this PR.
pyproject.toml (1)
44-44: LGTM!The MultiQC version constraint update aligns with the dependency updates in requirements.txt, environment.yml, and recipe/meta.yaml.
recipe/meta.yaml (1)
28-28: LGTM!The MultiQC runtime constraint update is consistent with the version updates across all other dependency files.
pmultiqc/modules/common/ms/msinfo.py (1)
114-127: LGTM! Performance improvement through batch processing.The change from per-row iteration to passing the entire group DataFrame to
add_ms_valuesis a good performance optimization. This aligns with the refactoredadd_ms_valuesfunction in ms_io.py that now accepts and processes DataFrames in batch.pmultiqc/modules/mzidentml/mzidentml.py (3)
312-312: Ensure table.plot handles the nested data structure correctly.The code now accesses
self.cal_num_table_data["ms_runs"]instead ofself.cal_num_table_datadirectly. This is consistent with the refactoring inparse_out_mzid, but verify that the table rendering works correctly with this structure.
143-147: Verified: aggregate_msms_identified_rate correctly handles None sdrf_file_df parameter.The function implementation in common_utils.py (lines 507-508) explicitly checks
if sdrf_file_df is None:and returns the per-run identified rate. The call at lines 143-147 in mzidentml.py withsdrf_file_df=Nonewill correctly return only the per-run identified rate computed bycal_msms_identified_rate.
1213-1266: The nested structure is correctly and consistently used throughout the module.All consumers of
self.cal_num_table_dataaccess the nested structure{"ms_runs": ...}using bracket notation (line 312) or pass it todraw_quantms_identification()which safely checks for optional keys like"sdrf_samples"using.get(). No code expects a flat structure. The refactoring maintains consistency.pmultiqc/modules/common/ms_io.py (2)
110-117: LGTM! Good defensive coding with explicit type casting and NaN handling.The explicit type casting to Int64 and float ensures data consistency, and the
data_add_valuehelper properly handles NaN values by converting them to None before passing to the histogram.
119-126: Verify DIA mode processes all spectra correctly.In DIA mode, the function processes all precursor_charge, base_peak_intensity, and num_peaks values without filtering by identified scans. This appears correct for DIA workflows where all spectra are typically considered.
pmultiqc/modules/common/mzidentml_utils.py (1)
45-52: Remove unused import and function or clarify intent.
get_mzid_num_datais defined but never called anywhere in the codebase. The import statement inpmultiqc/modules/mzidentml/mzidentml.pyline 20 is also unused. Either remove the function and import or document the reason for keeping unused code.pmultiqc/modules/common/plots/dia.py (2)
71-105: SDRF merge and Sample type handling looks correct.The implementation correctly:
- Checks
sdrf_file_df.emptybefore merging- Drops duplicates on the merge key columns
- Casts Sample to int for consistent sorting
- Generates both Run and Sample views when SDRF is available
One minor improvement: the static analysis hint on line 87 suggests using an explicit conversion flag (
f"Sample {run!s}"instead off"Sample {str(run)}"), but this is purely stylistic.
269-303: Early return when no data is produced is a good defensive pattern.The function correctly handles the case when
calculate_dia_intensity_stdreturns empty data by returning early (lines 273-274), preventing downstream errors.pmultiqc/modules/common/common_utils.py (1)
551-561: group_charge utility function is well-implemented.The function correctly:
- Groups by the specified columns
- Converts column types to string for consistent handling
- Adds "Sample " prefix when grouping by Sample for display purposes
pmultiqc/modules/quantms/quantms.py (2)
424-430: MSMS identified rate aggregation correctly handles optional SDRF.The code properly checks for both
mzml_tableandidentified_msms_spectrabefore callingaggregate_msms_identified_rate, and passesself.file_dfwhich may be empty if no SDRF is available.
1105-1112: Column subsetting before stand_spectra_ref is a good optimization.Selecting only needed columns early reduces memory footprint when processing large peptide tables.
pmultiqc/modules/common/plots/id.py (2)
509-532: rebuild_dict_structure correctly transforms MC counts to percentages.The function properly:
- Buckets missed cleavages into 0, 1, and >=2 categories
- Handles division by zero when total is 0
- Returns percentage values for plotting
814-830: Dynamic label detection for modifications plot is consistent with other functions.The pattern of checking
isinstance(modified_data["plot_data"], list)to determine labeling is used consistently across multiple drawing functions, providing a uniform user experience.pmultiqc/modules/common/dia_utils.py (3)
212-259: _process_run_data restructured to return cal_num_table_data.The refactored function now:
- Builds per-run statistics and data separately
- Computes sample-level aggregation via
cal_num_table_at_sample- Structures the return as
{"sdrf_samples": ..., "ms_runs": ...}This aligns with the data structure expected by downstream plotting functions.
262-289: _calculate_run_statistics now returns a tuple.The function cleanly separates aggregated statistics (
stat_run) from raw data needed for sample-level aggregation (data_per_run). This is a good separation of concerns.
298-298: Updated path for files without PSM uses correct key.The fix correctly writes to
cal_num_table_data["ms_runs"][i]instead of a top-level key, maintaining consistency with the new data structure.
| # Supported starting from MultiQC v1.33 | ||
| if hasattr(config, "custom_logo_width"): | ||
| config.custom_logo_dark = str(pmultiqc_logo_dark) | ||
|
|
||
| if hasattr(config, "custom_logo_width"): | ||
| config.custom_logo_width = 118 |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Combine duplicate conditional blocks.
Lines 48 and 51 both check for the same condition if hasattr(config, "custom_logo_width"). These should be merged into a single conditional block.
🔎 Proposed refactor
# Supported starting from MultiQC v1.33
if hasattr(config, "custom_logo_width"):
config.custom_logo_dark = str(pmultiqc_logo_dark)
-
- if hasattr(config, "custom_logo_width"):
config.custom_logo_width = 118📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Supported starting from MultiQC v1.33 | |
| if hasattr(config, "custom_logo_width"): | |
| config.custom_logo_dark = str(pmultiqc_logo_dark) | |
| if hasattr(config, "custom_logo_width"): | |
| config.custom_logo_width = 118 | |
| # Supported starting from MultiQC v1.33 | |
| if hasattr(config, "custom_logo_width"): | |
| config.custom_logo_dark = str(pmultiqc_logo_dark) | |
| config.custom_logo_width = 118 |
🤖 Prompt for AI Agents
In pmultiqc/main.py around lines 47 to 52, there are two separate if-blocks both
checking hasattr(config, "custom_logo_width"); merge them into a single
conditional that checks hasattr(config, "custom_logo_width") once and inside
that block set both config.custom_logo_dark = str(pmultiqc_logo_dark) and
config.custom_logo_width = 118 so both assignments occur only when the attribute
exists.
| pep_table = pep_table.merge( | ||
| right=self.file_df[["Sample", "Run"]].drop_duplicates(), | ||
| left_on="stand_spectra_ref", | ||
| right_on="Run" | ||
| ) | ||
|
|
||
| pep_intensity_by_sample = dict() | ||
| pep_table["Sample"] = pep_table["Sample"].astype(int) | ||
| for name, group in pep_table.groupby("Sample", sort=True): | ||
|
|
||
| if config.kwargs["remove_decoy"]: | ||
| pep_intensity_by_sample[f"Sample {str(name)}"] = group[ | ||
| (group["opt_global_cv_MS:1002217_decoy_peptide"] == 0) | ||
| ]["average_intensity"].apply( | ||
| lambda x: np.log2(x) if pd.notnull(x) and x > 0 else 1 | ||
| ) | ||
| else: | ||
| pep_intensity_by_sample[f"Sample {str(name)}"] = group["average_intensity"].apply( | ||
| lambda x: np.log2(x) if pd.notnull(x) and x > 0 else 1 | ||
| ) | ||
|
|
||
| self.quantms_pep_intensity = [pep_intensity_by_run, pep_intensity_by_sample] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check for file_df emptiness checks and broader context
rg -n "file_df" pmultiqc/modules/quantms/quantms.py | head -30Repository: bigbio/pmultiqc
Length of output: 1329
🏁 Script executed:
# Look at the function containing lines 1166-1187 to understand validation
sed -n '1100,1200p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 5296
🏁 Script executed:
# Search for any validations before the merge operation
rg -B 20 "pep_table.merge" pmultiqc/modules/quantms/quantms.py | head -50Repository: bigbio/pmultiqc
Length of output: 921
🏁 Script executed:
# View context around line 1456 to see the if-block structure
sed -n '1450,1470p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 983
🏁 Script executed:
# Get broader context to understand function structure and block scope
sed -n '1140,1175p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 1787
🏁 Script executed:
# Get broader context to understand which function contains this code and any guards
sed -n '1100,1180p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 4245
🏁 Script executed:
# Check the full function definition and its entry conditions
sed -n '1050,1170p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 6536
🏁 Script executed:
# Check what happens after the merge and pep_intensity_by_sample assignment
sed -n '1187,1200p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 818
🏁 Script executed:
# Look for where self.quantms_pep_intensity is used to understand impact of empty dict
rg -n "quantms_pep_intensity" pmultiqc/modules/quantms/quantms.pyRepository: bigbio/pmultiqc
Length of output: 330
🏁 Script executed:
# Check the context around line 2195 to see how quantms_pep_intensity is used
sed -n '2190,2210p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 983
🏁 Script executed:
# Verify if there's any validation before cal_heat_map_score is called
rg -B 10 "cal_heat_map_score" pmultiqc/modules/quantms/quantms.py | head -30Repository: bigbio/pmultiqc
Length of output: 827
🏁 Script executed:
# Check what pep_table_exists is and how it relates to file_df
rg -B 5 -A 5 "pep_table_exists" pmultiqc/modules/quantms/quantms.py | head -40Repository: bigbio/pmultiqc
Length of output: 1417
🏁 Script executed:
# Check how file_df gets populated to understand if it can be empty when pep_table_exists is true
rg -n "self.file_df\s*=" pmultiqc/modules/quantms/quantms.pyRepository: bigbio/pmultiqc
Length of output: 100
🏁 Script executed:
# Find where pep_table_exists is set to True
rg -n "pep_table_exists\s*=\s*True" pmultiqc/modules/quantms/quantms.pyRepository: bigbio/pmultiqc
Length of output: 104
🏁 Script executed:
# Find where file_df gets populated with data (beyond initialization)
rg -n "file_df.*=" pmultiqc/modules/quantms/quantms.py -A 1 | head -40Repository: bigbio/pmultiqc
Length of output: 912
🏁 Script executed:
# Search for where file_df data is being added/assigned in the class
rg -n "file_df\[|file_df\.loc|file_df\.insert|\.append|\.concat" pmultiqc/modules/quantms/quantms.py | head -20Repository: bigbio/pmultiqc
Length of output: 1041
🏁 Script executed:
# Check how pep_table_exists gets set to True and in what condition
sed -n '1340,1360p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 1193
🏁 Script executed:
# Look at the __init__ method to see if file_df is populated there
sed -n '140,220p' pmultiqc/modules/quantms/quantms.py | cat -nRepository: bigbio/pmultiqc
Length of output: 3567
🏁 Script executed:
# Search for any constructor or initialization code that populates file_df
rg -n "def __init__|file_df.*pd\." pmultiqc/modules/quantms/quantms.py | head -10Repository: bigbio/pmultiqc
Length of output: 178
Add a guard to check if file_df is empty before the merge operation.
The merge at lines 1166-1170 will proceed even if self.file_df is empty (when experimental design/SDRF files are unavailable), resulting in an empty pep_table after the merge. This causes pep_intensity_by_sample to be an empty dict, while self.quantms_pep_intensity becomes a list with an empty dict. The downstream check at line 2195 (if self.quantms_pep_intensity:) doesn't catch this because non-empty lists remain truthy. Add if not self.file_df.empty: before the merge or handle the case explicitly.
🧰 Tools
🪛 Ruff (0.14.10)
1177-1177: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1183-1183: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In pmultiqc/modules/quantms/quantms.py around lines 1166 to 1187, add a guard to
check if self.file_df is empty before performing the merge: if self.file_df is
empty, skip the merge/loop and set self.quantms_pep_intensity to an empty list
(or otherwise a falsy value) so downstream checks like `if
self.quantms_pep_intensity:` behave correctly; otherwise proceed with the merge,
grouping and population of pep_intensity_by_sample as before and assign
self.quantms_pep_intensity = [pep_intensity_by_run, pep_intensity_by_sample].
| def cal_charge_state(psm, sdrf_file_df): | ||
|
|
||
| charge_state_df = psm[["filename", "charge"]].copy() | ||
|
|
||
| charge_state_df = charge_state_df.merge( | ||
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | ||
| left_on="filename", | ||
| right_on="Run" | ||
| ) | ||
|
|
||
| charge_state_df["Sample"] = charge_state_df["Sample"].astype(int) | ||
|
|
||
| charge_state_by_run = group_charge(charge_state_df, "filename", "charge") | ||
| charge_state_by_sample = group_charge(charge_state_df, "Sample", "charge") | ||
|
|
||
| return { | ||
| "plot_data": [ | ||
| charge_state_by_run.to_dict(orient="index"), | ||
| charge_state_by_sample.to_dict(orient="index"), | ||
| ], | ||
| "cats": sorted( | ||
| set(charge_state_by_run.columns) | ||
| | set(charge_state_by_sample.columns), | ||
| key=int | ||
| ) | ||
| } |
There was a problem hiding this comment.
cal_charge_state assumes sdrf_file_df is always valid.
The function merges psm with sdrf_file_df without checking if sdrf_file_df is empty. If called with an empty DataFrame, the merge will produce no rows, causing group_charge to return empty tables.
🔎 Suggested fix to handle empty SDRF
def cal_charge_state(psm, sdrf_file_df):
charge_state_df = psm[["filename", "charge"]].copy()
+ if sdrf_file_df.empty:
+ charge_state_by_run = group_charge(charge_state_df, "filename", "charge")
+ return {
+ "plot_data": charge_state_by_run.to_dict(orient="index"),
+ "cats": sorted(charge_state_by_run.columns, key=int)
+ }
+
charge_state_df = charge_state_df.merge(
right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(),
left_on="filename",
right_on="Run"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def cal_charge_state(psm, sdrf_file_df): | |
| charge_state_df = psm[["filename", "charge"]].copy() | |
| charge_state_df = charge_state_df.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| left_on="filename", | |
| right_on="Run" | |
| ) | |
| charge_state_df["Sample"] = charge_state_df["Sample"].astype(int) | |
| charge_state_by_run = group_charge(charge_state_df, "filename", "charge") | |
| charge_state_by_sample = group_charge(charge_state_df, "Sample", "charge") | |
| return { | |
| "plot_data": [ | |
| charge_state_by_run.to_dict(orient="index"), | |
| charge_state_by_sample.to_dict(orient="index"), | |
| ], | |
| "cats": sorted( | |
| set(charge_state_by_run.columns) | |
| | set(charge_state_by_sample.columns), | |
| key=int | |
| ) | |
| } | |
| def cal_charge_state(psm, sdrf_file_df): | |
| charge_state_df = psm[["filename", "charge"]].copy() | |
| if sdrf_file_df.empty: | |
| charge_state_by_run = group_charge(charge_state_df, "filename", "charge") | |
| return { | |
| "plot_data": charge_state_by_run.to_dict(orient="index"), | |
| "cats": sorted(charge_state_by_run.columns, key=int) | |
| } | |
| charge_state_df = charge_state_df.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| left_on="filename", | |
| right_on="Run" | |
| ) | |
| charge_state_df["Sample"] = charge_state_df["Sample"].astype(int) | |
| charge_state_by_run = group_charge(charge_state_df, "filename", "charge") | |
| charge_state_by_sample = group_charge(charge_state_df, "Sample", "charge") | |
| return { | |
| "plot_data": [ | |
| charge_state_by_run.to_dict(orient="index"), | |
| charge_state_by_sample.to_dict(orient="index"), | |
| ], | |
| "cats": sorted( | |
| set(charge_state_by_run.columns) | |
| | set(charge_state_by_sample.columns), | |
| key=int | |
| ) | |
| } |
🤖 Prompt for AI Agents
In pmultiqc/modules/quantms/quantms.py around lines 2258-2283, cal_charge_state
merges psm with sdrf_file_df without checking for an empty SDRF which yields no
rows and empty groupings; add an early check: if sdrf_file_df is None or
sdrf_file_df.empty then return a consistent empty result (e.g.,
{"plot_data":[{},{}], "cats": []}) or alternatively fall back to using
psm["filename"] as the Run/Sample mapping before grouping; implement the chosen
fallback so group_charge always receives a non-empty DataFrame and ensure types
match (Sample as int) before calling group_charge.
| def sample_level_modifications(df, sdrf_file_df): | ||
|
|
||
| psm = df[ | ||
| ["filename", "sequence", "charge", "Modifications"] | ||
| ].copy() | ||
|
|
||
| psm = psm.merge( | ||
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | ||
| left_on="filename", | ||
| right_on="Run" | ||
| ) | ||
|
|
||
| psm["Sample"] = psm["Sample"].astype(int) | ||
|
|
||
| mod_plot = dict() | ||
| for sample, group in psm.groupby("Sample", sort=True): | ||
|
|
||
| mod_plot_dict, _ = summarize_modifications( | ||
| group[["sequence", "charge", "Modifications"]].drop_duplicates() | ||
| ) | ||
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | ||
|
|
||
| return mod_plot |
There was a problem hiding this comment.
sample_level_modifications similarly lacks empty SDRF check.
Like cal_charge_state, this function doesn't handle the case when sdrf_file_df is empty. The merge will produce no rows, resulting in an empty mod_plot dictionary.
🔎 Suggested fix
def sample_level_modifications(df, sdrf_file_df):
+ if sdrf_file_df.empty:
+ return {}
+
psm = df[
["filename", "sequence", "charge", "Modifications"]
].copy()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def sample_level_modifications(df, sdrf_file_df): | |
| psm = df[ | |
| ["filename", "sequence", "charge", "Modifications"] | |
| ].copy() | |
| psm = psm.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| left_on="filename", | |
| right_on="Run" | |
| ) | |
| psm["Sample"] = psm["Sample"].astype(int) | |
| mod_plot = dict() | |
| for sample, group in psm.groupby("Sample", sort=True): | |
| mod_plot_dict, _ = summarize_modifications( | |
| group[["sequence", "charge", "Modifications"]].drop_duplicates() | |
| ) | |
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | |
| return mod_plot | |
| def sample_level_modifications(df, sdrf_file_df): | |
| if sdrf_file_df.empty: | |
| return {} | |
| psm = df[ | |
| ["filename", "sequence", "charge", "Modifications"] | |
| ].copy() | |
| psm = psm.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| left_on="filename", | |
| right_on="Run" | |
| ) | |
| psm["Sample"] = psm["Sample"].astype(int) | |
| mod_plot = dict() | |
| for sample, group in psm.groupby("Sample", sort=True): | |
| mod_plot_dict, _ = summarize_modifications( | |
| group[["sequence", "charge", "Modifications"]].drop_duplicates() | |
| ) | |
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | |
| return mod_plot |
🧰 Tools
🪛 Ruff (0.14.10)
2306-2306: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In pmultiqc/modules/quantms/quantms.py around lines 2286 to 2308,
sample_level_modifications doesn't handle an empty or missing SDRF
(sdrf_file_df) which causes the merge to yield no rows and an empty mod_plot;
add an early guard that checks if sdrf_file_df is None or
sdrf_file_df[["Sample","Run"]].drop_duplicates() is empty and immediately return
an empty dict, and otherwise proceed with the merge and subsequent processing
(also ensure after merge any empty result returns empty dict before casting
Sample to int to avoid exceptions).
Optimize CodeRabbitAI code suggestions
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (11)
pmultiqc/main.py (1)
54-54: Consider mentioning dark logo in log message.The log message only references the light logo, but the code also configures a dark logo (when supported). Including both in the log message would improve observability.
🔎 Proposed enhancement to log message
- log.info(f"pmultiqc: injected custom logo from local path: {pmultiqc_logo}") + log.info(f"pmultiqc: injected custom logo(s) from local path: {pmultiqc_logo}")Or for more detail:
- log.info(f"pmultiqc: injected custom logo from local path: {pmultiqc_logo}") + log.info(f"pmultiqc: injected custom logo from local path: {pmultiqc_logo}") + if hasattr(config, "custom_logo_dark") and pmultiqc_logo_dark.exists(): + log.info(f"pmultiqc: injected custom dark logo from local path: {pmultiqc_logo_dark}")pmultiqc/modules/common/common_utils.py (3)
546-552: Addstrict=Truetozip()call for safer iteration.Per static analysis (B905), using
zip()withoutstrict=Truecan silently truncate if the iterables have different lengths. Since both come from the same DataFrame, they should match, but explicit is better.🔎 Suggested fix
mod_plot_dict = dict( - zip(mod_group_processed["modifications"], mod_group_processed["percentage"]) + zip(mod_group_processed["modifications"], mod_group_processed["percentage"], strict=True) )
532-532: Use f-string conversion flag instead of explicitstr().Per Ruff RUF010, prefer
{sample!s}over{str(sample)}for cleaner f-strings.🔎 Suggested fix
- sample_key = f"Sample {str(sample)}" + sample_key = f"Sample {sample!s}"
560-564: Use f-string conversion flag and consider edge case.Per Ruff RUF010, prefer
{i!s}over{str(i)}. Also, ifgroup_col == "Sample"but the index is empty, this list comprehension is safe but worth noting.🔎 Suggested fix
if group_col == "Sample": table.index = [ - f"Sample {str(i)}" for i in table.index + f"Sample {i!s}" for i in table.index ]pmultiqc/modules/common/plots/dia.py (3)
578-582: Missing explicit return in else branch.When
can_groupby_for_stdreturnsFalse, the function logs a warning but implicitly returnsNone. This is correct behavior (caller checks at line 273), but an explicitreturn Noneimproves clarity.🔎 Suggested fix
else: log.warning("No SDRF available; failed to parse experimental groups; SD Intensity not generated.") + return None
87-90: Use f-string conversion flag instead ofstr().Per Ruff RUF010, prefer
{run!s}over{str(run)}.🔎 Suggested fix
( - f"Sample {str(run)}" + f"Sample {run!s}" if data_type == "Sample" else str(run) ): group["log_intensity"].dropna().tolist()
553-556: Use f-string conversion flag.Per Ruff RUF010, prefer
{sample!s}over{str(sample)}.🔎 Suggested fix
plot_data = { - f"Sample {str(sample)}": group["log_intensity_std"].dropna().tolist() + f"Sample {sample!s}": group["log_intensity_std"].dropna().tolist() for sample, group in grouped_std.groupby("Sample") }pmultiqc/modules/common/plots/id.py (3)
414-418: Use f-string conversion flag instead ofstr().Per Ruff RUF010, prefer
{sample!s}over{str(sample)}.🔎 Suggested fix
( - f"{sample_prefix}{str(sample)}" + f"{sample_prefix}{sample!s}" if "sdrf_samples" == source else str(sample) ): {"Count": info[value_key]}
650-650: Use f-string conversion flag.Per Ruff RUF010, prefer
{sample!s}over{str(sample)}.🔎 Suggested fix
- sample=SampleName(f"Sample {str(sample)}"), + sample=SampleName(f"Sample {sample!s}"),
718-718: Use f-string conversion flag.Per Ruff RUF010, prefer
{sample!s}over{str(sample)}.🔎 Suggested fix
- sample=SampleName(f"Sample {str(sample)}"), + sample=SampleName(f"Sample {sample!s}"),pmultiqc/modules/common/dia_utils.py (1)
854-854: Use f-string conversion flag instead ofstr().Per Ruff RUF010, prefer
{sample!s}over{str(sample)}.🔎 Suggested fix
- mod_plot[f"Sample {str(sample)}"] = mod_plot_dict + mod_plot[f"Sample {sample!s}"] = mod_plot_dict
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
environment.ymlpmultiqc/main.pypmultiqc/modules/common/common_utils.pypmultiqc/modules/common/dia_utils.pypmultiqc/modules/common/plots/dia.pypmultiqc/modules/common/plots/id.pypyproject.tomlrecipe/meta.yamlrequirements.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- environment.yml
- pyproject.toml
- requirements.txt
- recipe/meta.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
pmultiqc/modules/common/plots/dia.py (1)
pmultiqc/modules/common/common_utils.py (1)
group_charge(555-565)
pmultiqc/modules/common/plots/id.py (2)
pmultiqc/modules/quantms/sparklines.py (1)
plot(19-56)pmultiqc/modules/common/common_utils.py (1)
condition_split(55-60)
🪛 Ruff (0.14.10)
pmultiqc/modules/common/plots/dia.py
87-87: Use explicit conversion flag
Replace with conversion flag
(RUF010)
554-554: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/common_utils.py
532-532: Use explicit conversion flag
Replace with conversion flag
(RUF010)
548-548: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
562-562: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/plots/id.py
415-415: Use explicit conversion flag
Replace with conversion flag
(RUF010)
650-650: Use explicit conversion flag
Replace with conversion flag
(RUF010)
718-718: Use explicit conversion flag
Replace with conversion flag
(RUF010)
pmultiqc/modules/common/dia_utils.py
854-854: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test_maxquant_dia
- GitHub Check: test_maxquant
- GitHub Check: test_dia
- GitHub Check: test_diann
- GitHub Check: test_lfq
- GitHub Check: test_proteobench
- GitHub Check: test_mzid_mzML
- GitHub Check: test_tmt
- GitHub Check: test_mzid_mgf
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (5)
pmultiqc/main.py (2)
40-40: LGTM! Dark logo path follows consistent pattern.The dark logo path construction mirrors the existing light logo pattern and uses appropriate Path operations.
47-52: Past duplicate conditional issue has been resolved.The previous review flagged duplicate conditionals checking
custom_logo_width. The current code now correctly checks two different attributes:custom_logo_dark(line 48) andcustom_logo_width(line 51).pmultiqc/modules/common/plots/id.py (1)
619-622: LGTM - Defensive.get()access pattern.Good use of
.get()oncal_num_table_datato safely accesssdrf_samplesandms_runs, addressing the pattern from previous review comments.pmultiqc/modules/common/dia_utils.py (2)
262-289: LGTM - Clean refactor of_calculate_run_statistics.The function now returns a tuple
(stat_run, data_per_run)instead of mutating external state. This is a good separation of concerns and improves testability.
241-246: LGTM - Proper structure forcal_num_table_data.The data structure with
"sdrf_samples"and"ms_runs"keys provides clear separation between sample-level and run-level data, aligning well with the SDRF-aware workflow.
| if hasattr(config, "custom_logo_dark"): | ||
| config.custom_logo_dark = str(pmultiqc_logo_dark) |
There was a problem hiding this comment.
Verify dark logo file existence before use.
The code checks whether pmultiqc_logo exists (line 42) but does not verify pmultiqc_logo_dark exists before setting config.custom_logo_dark. If the dark logo file is missing, this could cause issues at render time.
🔎 Proposed fix to add existence check
# Supported starting from MultiQC v1.33
- if hasattr(config, "custom_logo_dark"):
+ if hasattr(config, "custom_logo_dark") and pmultiqc_logo_dark.exists():
config.custom_logo_dark = str(pmultiqc_logo_dark)🤖 Prompt for AI Agents
In pmultiqc/main.py around lines 48-49, the code sets config.custom_logo_dark
without verifying the pmultiqc_logo_dark file exists; update this to check the
file exists (e.g., Path(pmultiqc_logo_dark).exists() or os.path.exists) before
assigning config.custom_logo_dark = str(pmultiqc_logo_dark), and if the file
does not exist simply skip setting the attribute (or log a warning) to avoid
render-time errors.
| if "Modifications" in df_copy.columns and "modifications" not in df_copy.columns: | ||
| df_copy = df_copy.rename(columns={"Modifications": "modifications"}) | ||
| else: | ||
| log.warning('Detected both "Modifications" and "modifications" columns.') | ||
|
|
There was a problem hiding this comment.
Incomplete column handling when both columns exist.
When both "Modifications" and "modifications" columns exist, the code logs a warning but then proceeds to use df_copy["modifications"] on line 281, which may not exist if only "Modifications" was present. The else branch doesn't handle the case where only "modifications" exists (which is fine) but also doesn't handle the "both exist" case properly.
🔎 Suggested fix
- if "Modifications" in df_copy.columns and "modifications" not in df_copy.columns:
- df_copy = df_copy.rename(columns={"Modifications": "modifications"})
- else:
- log.warning('Detected both "Modifications" and "modifications" columns.')
+ if "Modifications" in df_copy.columns and "modifications" in df_copy.columns:
+ log.warning('Detected both "Modifications" and "modifications" columns. Using "modifications".')
+ elif "Modifications" in df_copy.columns:
+ df_copy = df_copy.rename(columns={"Modifications": "modifications"})
+ elif "modifications" not in df_copy.columns:
+ raise ValueError("Neither 'Modifications' nor 'modifications' column found in DataFrame.")🤖 Prompt for AI Agents
In pmultiqc/modules/common/common_utils.py around lines 276-280, the current
else branch treats all non-rename cases as "both exist" and only logs a warning,
causing later code to fail if the lowercase column doesn't actually exist;
change the logic to explicitly handle three cases: (1) if both "Modifications"
and "modifications" exist, populate/merge into "modifications" (e.g., fillna
from "Modifications" or prefer non-null values) and then drop the
"Modifications" column while logging a warning; (2) if only "Modifications"
exists, rename it to "modifications" as before; (3) if only "modifications"
exists, do nothing; ensure after this block df_copy always has a "modifications"
column when either original column was present.
| sample_file_df["Sample"] = sample_file_df["Sample"].astype(int) | ||
|
|
There was a problem hiding this comment.
Potential ValueError if Sample column contains non-numeric values.
astype(int) will raise a ValueError if the Sample column contains non-integer-convertible values. Consider adding error handling or validation.
🔎 Suggested fix
sample_file_df = file_df.copy()
- sample_file_df["Sample"] = sample_file_df["Sample"].astype(int)
+ try:
+ sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"], errors="raise").astype(int)
+ except (ValueError, TypeError) as e:
+ log.warning(f"Failed to convert Sample column to int: {e}. Returning empty dict.")
+ return dict()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sample_file_df["Sample"] = sample_file_df["Sample"].astype(int) | |
| sample_file_df = file_df.copy() | |
| try: | |
| sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"], errors="raise").astype(int) | |
| except (ValueError, TypeError) as e: | |
| log.warning(f"Failed to convert Sample column to int: {e}. Returning empty dict.") | |
| return dict() |
🤖 Prompt for AI Agents
In pmultiqc/modules/common/common_utils.py around lines 457-458, the direct use
of sample_file_df["Sample"].astype(int) can raise a ValueError for non-numeric
values; replace this with a safe conversion using pandas.to_numeric (e.g.,
sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"].str.strip(),
errors="coerce")), then check for resulting NaNs and either raise a clear
ValueError listing offending rows or handle them (drop/fill) as appropriate for
the pipeline, ensuring any error message includes filename/row info for easier
debugging.
| mod_plot_by_sample = dia_sample_level_modifications( | ||
| df=report_data, | ||
| sdrf_file_df=sdrf_file_df | ||
| ) | ||
|
|
||
| # Update quantms_modified with processed data | ||
| quantms_modified["plot_data"] = mod_plot_dict | ||
| quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample] | ||
| quantms_modified["cats"] = list( | ||
| sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x)) |
There was a problem hiding this comment.
dia_sample_level_modifications called unconditionally may fail with empty SDRF.
The function dia_sample_level_modifications performs a merge with sdrf_file_df without checking if it's empty. If sdrf_file_df is empty, the merge will produce an empty DataFrame, and subsequent operations may not behave as expected.
🔎 Suggested fix
- mod_plot_by_sample = dia_sample_level_modifications(
- df=report_data,
- sdrf_file_df=sdrf_file_df
- )
+ if sdrf_file_df is not None and not sdrf_file_df.empty:
+ mod_plot_by_sample = dia_sample_level_modifications(
+ df=report_data,
+ sdrf_file_df=sdrf_file_df
+ )
+ else:
+ mod_plot_by_sample = {}
# Update quantms_modified with processed data
- quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
+ if mod_plot_by_sample:
+ quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
+ else:
+ quantms_modified["plot_data"] = mod_plot_by_run📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mod_plot_by_sample = dia_sample_level_modifications( | |
| df=report_data, | |
| sdrf_file_df=sdrf_file_df | |
| ) | |
| # Update quantms_modified with processed data | |
| quantms_modified["plot_data"] = mod_plot_dict | |
| quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample] | |
| quantms_modified["cats"] = list( | |
| sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x)) | |
| if sdrf_file_df is not None and not sdrf_file_df.empty: | |
| mod_plot_by_sample = dia_sample_level_modifications( | |
| df=report_data, | |
| sdrf_file_df=sdrf_file_df | |
| ) | |
| else: | |
| mod_plot_by_sample = {} | |
| # Update quantms_modified with processed data | |
| if mod_plot_by_sample: | |
| quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample] | |
| else: | |
| quantms_modified["plot_data"] = mod_plot_by_run | |
| quantms_modified["cats"] = list( | |
| sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x)) |
🤖 Prompt for AI Agents
In pmultiqc/modules/common/dia_utils.py around lines 248 to 256,
dia_sample_level_modifications is called unconditionally with sdrf_file_df which
can be empty and cause merges to yield empty DataFrames and break downstream
logic; modify the code to first check if sdrf_file_df is empty (e.g., if
sdrf_file_df is None or sdrf_file_df.empty) and only call
dia_sample_level_modifications when it contains rows, otherwise set
mod_plot_by_sample to a safe default (empty DataFrame or structure matching
expected output) and ensure quantms_modified["plot_data"] and
quantms_modified["cats"] are populated consistently so downstream code does not
assume non-empty merge results.
| def dia_sample_level_modifications(df, sdrf_file_df): | ||
|
|
||
| report_data = df.copy() | ||
|
|
||
| report_data = report_data.merge( | ||
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | ||
| on="Run" | ||
| ) | ||
|
|
||
| report_data["Sample"] = report_data["Sample"].astype(int) | ||
|
|
||
| mod_plot = dict() | ||
| for sample, group in report_data.groupby("Sample", sort=True): | ||
|
|
||
| mod_plot_dict, _ = summarize_modifications( | ||
| group.drop_duplicates() | ||
| ) | ||
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | ||
|
|
||
| return mod_plot |
There was a problem hiding this comment.
Missing empty DataFrame guard in dia_sample_level_modifications.
The function assumes sdrf_file_df has data. If called with an empty DataFrame, the merge at line 841-844 will produce an empty result, and the subsequent groupby will iterate over nothing, returning an empty mod_plot. Consider adding an early return.
🔎 Suggested fix
def dia_sample_level_modifications(df, sdrf_file_df):
+ if sdrf_file_df is None or sdrf_file_df.empty:
+ return {}
+
report_data = df.copy()
report_data = report_data.merge(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def dia_sample_level_modifications(df, sdrf_file_df): | |
| report_data = df.copy() | |
| report_data = report_data.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| on="Run" | |
| ) | |
| report_data["Sample"] = report_data["Sample"].astype(int) | |
| mod_plot = dict() | |
| for sample, group in report_data.groupby("Sample", sort=True): | |
| mod_plot_dict, _ = summarize_modifications( | |
| group.drop_duplicates() | |
| ) | |
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | |
| return mod_plot | |
| def dia_sample_level_modifications(df, sdrf_file_df): | |
| if sdrf_file_df is None or sdrf_file_df.empty: | |
| return {} | |
| report_data = df.copy() | |
| report_data = report_data.merge( | |
| right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | |
| on="Run" | |
| ) | |
| report_data["Sample"] = report_data["Sample"].astype(int) | |
| mod_plot = dict() | |
| for sample, group in report_data.groupby("Sample", sort=True): | |
| mod_plot_dict, _ = summarize_modifications( | |
| group.drop_duplicates() | |
| ) | |
| mod_plot[f"Sample {str(sample)}"] = mod_plot_dict | |
| return mod_plot |
🧰 Tools
🪛 Ruff (0.14.10)
854-854: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In pmultiqc/modules/common/dia_utils.py around lines 837 to 856, the function
dia_sample_level_modifications assumes sdrf_file_df has rows and proceeds to
merge and group; add an early-empty guard: if sdrf_file_df is empty (or if the
merge results in an empty report_data) immediately return an empty dict (or the
same type as mod_plot) to avoid downstream processing; implement the check
before the merge (and a second check after the merge) and return {} when empty.
| if not sdrf_file_df.empty: | ||
|
|
||
| df_sub = df_sub.merge( | ||
| sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | ||
| on="Run" | ||
| ) |
There was a problem hiding this comment.
Potential AttributeError if sdrf_file_df is None.
The function signature accepts sdrf_file_df but checks sdrf_file_df.empty directly. If the caller passes None instead of an empty DataFrame, this will raise AttributeError.
🔎 Suggested fix
- if not sdrf_file_df.empty:
+ if sdrf_file_df is not None and not sdrf_file_df.empty:🤖 Prompt for AI Agents
In pmultiqc/modules/common/plots/dia.py around lines 75 to 80, the code checks
sdrf_file_df.empty directly which will raise AttributeError if sdrf_file_df is
None; change the conditional to explicitly guard against None (e.g. if
sdrf_file_df is not None and not sdrf_file_df.empty) before calling .merge, so
the merge only runs when sdrf_file_df is a real DataFrame with rows.
| if sdrf_file_df.empty: | ||
|
|
||
| bar_data = stat_data_by_run.to_dict(orient="index") | ||
| plot_label = ["by Run"] |
There was a problem hiding this comment.
Same None-safety issue for sdrf_file_df.empty check.
Consistent with draw_dia_intensity_dis, this check should guard against None.
🔎 Suggested fix
- if sdrf_file_df.empty:
+ if sdrf_file_df is None or sdrf_file_df.empty:
bar_data = stat_data_by_run.to_dict(orient="index")🤖 Prompt for AI Agents
In pmultiqc/modules/common/plots/dia.py around lines 210 to 213, the check "if
sdrf_file_df.empty:" is unsafe if sdrf_file_df can be None; change the guard to
explicitly handle None (e.g. use "if not sdrf_file_df or sdrf_file_df.empty:")
so the branch executes correctly when sdrf_file_df is None or an empty
DataFrame, then proceed to build bar_data and plot_label as before.
| if not sdrf_file_df.empty: | ||
|
|
||
| plot_data = { | ||
| condition: group["log_intensity_std"].dropna().tolist() | ||
| for condition, group in grouped_std.groupby("run_condition") | ||
| } | ||
| df_sub = df_sub.merge( | ||
| sdrf_file_df[["Sample", "Run"]].drop_duplicates(), | ||
| on="Run" | ||
| ) |
There was a problem hiding this comment.
Guard against None for sdrf_file_df in calculate_dia_intensity_std.
Same pattern as other functions - should check for None before accessing .empty.
🔎 Suggested fix
- if not sdrf_file_df.empty:
+ if sdrf_file_df is not None and not sdrf_file_df.empty:🤖 Prompt for AI Agents
In pmultiqc/modules/common/plots/dia.py around lines 540 to 545, the code checks
sdrf_file_df.empty without guarding against sdrf_file_df being None; update the
condition to first check that sdrf_file_df is not None (e.g., if sdrf_file_df is
not None and not sdrf_file_df.empty:) and keep the existing merge only when that
check passes so the function safely skips the merge when sdrf_file_df is None.
| "data_labels": ["by Run", "by Sample"] | ||
| } |
There was a problem hiding this comment.
Hardcoded data_labels may not match actual data structure.
The data_labels is hardcoded to ["by Run", "by Sample"], but if missed_cleavages_data["plot_data"] contains only one dataset (not a list), this will cause a mismatch between labels and data.
🔎 Suggested fix
+ if isinstance(missed_cleavages_data["plot_data"], list):
+ plot_label = ["by Run", "by Sample"]
+ else:
+ plot_label = ["by Run"]
+
draw_config = {
"id": "missed_cleavages",
"cpswitch": False,
"cpswitch_c_active": False,
"title": "Missed Cleavages",
"tt_decimals": 2,
"ylab": "Missed Cleavages [%]",
- "data_labels": ["by Run", "by Sample"]
+ "data_labels": plot_label
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "data_labels": ["by Run", "by Sample"] | |
| } | |
| if isinstance(missed_cleavages_data["plot_data"], list): | |
| plot_label = ["by Run", "by Sample"] | |
| else: | |
| plot_label = ["by Run"] | |
| draw_config = { | |
| "id": "missed_cleavages", | |
| "cpswitch": False, | |
| "cpswitch_c_active": False, | |
| "title": "Missed Cleavages", | |
| "tt_decimals": 2, | |
| "ylab": "Missed Cleavages [%]", | |
| "data_labels": plot_label | |
| } |
🤖 Prompt for AI Agents
In pmultiqc/modules/common/plots/id.py around lines 176-177, the data_labels
list is hardcoded to ["by Run", "by Sample"] which can mismatch when
missed_cleavages_data["plot_data"] is a single dataset (not a list); update the
code to detect if plot_data is a list and set data_labels dynamically (e.g., if
plot_data is not a list wrap it into a list or generate labels matching the
number of datasets, using sensible defaults or indexed names) so that the length
of data_labels always matches the number of datasets in plot_data.
| mc_ratio = list() | ||
|
|
||
| for k in ["ms_runs", "sdrf_samples"]: | ||
| mc_ratio.append(rebuild_dict_structure(quantms_missed_cleavages[k])) | ||
|
|
||
| mc_data = { | ||
| "plot_data": mc_ratio, | ||
| "cats": ["0", "1", ">=2"] | ||
| } | ||
|
|
||
| draw_msms_missed_cleavages(sub_sections, mc_data, False) |
There was a problem hiding this comment.
Potential KeyError when accessing quantms_missed_cleavages keys.
The loop iterates over ["ms_runs", "sdrf_samples"] but doesn't verify these keys exist in quantms_missed_cleavages. If sdrf_samples is missing, this will raise a KeyError.
🔎 Suggested fix
- mc_ratio = list()
-
- for k in ["ms_runs", "sdrf_samples"]:
- mc_ratio.append(rebuild_dict_structure(quantms_missed_cleavages[k]))
+ mc_ratio = [
+ rebuild_dict_structure(quantms_missed_cleavages[k])
+ for k in ["ms_runs", "sdrf_samples"]
+ if k in quantms_missed_cleavages and quantms_missed_cleavages[k]
+ ]
+
+ if not mc_ratio:
+ mc_ratio = [rebuild_dict_structure(quantms_missed_cleavages.get("ms_runs", {}))]🤖 Prompt for AI Agents
In pmultiqc/modules/common/plots/id.py around lines 490 to 500, the loop
directly indexes quantms_missed_cleavages for keys "ms_runs" and "sdrf_samples"
which can raise KeyError if a key is missing; change the loop to safely access
those keys (e.g., use dict.get(key, {}) or check key in
quantms_missed_cleavages) and ensure rebuild_dict_structure receives a valid
default (empty structure) so mc_ratio always appends a safe value even when a
key is absent before calling draw_msms_missed_cleavages.
Pull Request
Description
Brief description of the changes made in this PR.
Type of Change
Summary by CodeRabbit
New Features
Improvements
Updates
✏️ Tip: You can customize this high-level summary in your review settings.