Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed MultiQC_logo.png
Binary file not shown.
5 changes: 3 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:
- defaults
dependencies:
- python>=3.10
- multiqc>=1.29, <=1.32
- multiqc>=1.29, <=1.33
- pandas>=1.5
- pyteomics
- pyopenms<=3.4.0
Expand All @@ -18,4 +18,5 @@ dependencies:
- uvicorn
- requests
- redis
- statsmodels
- statsmodels
- urllib3>=2.6.1
Binary file added pmultiqc/images/pmultiqc_logo_dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions pmultiqc/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,20 @@ def pmultiqc_plugin_execution_start():
# logo
current_dir = Path(__file__).parent.resolve()
pmultiqc_logo = current_dir / "images" / "pmultiqc_logo_report.png"
pmultiqc_logo_dark = current_dir / "images" / "pmultiqc_logo_dark.png"

if pmultiqc_logo.exists():
config.custom_logo = str(pmultiqc_logo)
config.custom_logo_url = "https://github.com/bigbio/pmultiqc"
config.custom_logo_title = "pmultiqc"

# Supported starting from MultiQC v1.33
if hasattr(config, "custom_logo_dark"):
config.custom_logo_dark = str(pmultiqc_logo_dark)
Comment on lines +48 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verify dark logo file existence before use.

The code checks whether pmultiqc_logo exists (line 42) but does not verify pmultiqc_logo_dark exists before setting config.custom_logo_dark. If the dark logo file is missing, this could cause issues at render time.

🔎 Proposed fix to add existence check
     # Supported starting from MultiQC v1.33
-    if hasattr(config, "custom_logo_dark"):
+    if hasattr(config, "custom_logo_dark") and pmultiqc_logo_dark.exists():
         config.custom_logo_dark = str(pmultiqc_logo_dark)
🤖 Prompt for AI Agents
In pmultiqc/main.py around lines 48-49, the code sets config.custom_logo_dark
without verifying the pmultiqc_logo_dark file exists; update this to check the
file exists (e.g., Path(pmultiqc_logo_dark).exists() or os.path.exists) before
assigning config.custom_logo_dark = str(pmultiqc_logo_dark), and if the file
does not exist simply skip setting the attribute (or log a warning) to avoid
render-time errors.


if hasattr(config, "custom_logo_width"):
config.custom_logo_width = 118
Comment on lines 47 to 52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Combine duplicate conditional blocks.

Lines 48 and 51 both check for the same condition if hasattr(config, "custom_logo_width"). These should be merged into a single conditional block.

🔎 Proposed refactor
     # Supported starting from MultiQC v1.33
     if hasattr(config, "custom_logo_width"):
         config.custom_logo_dark = str(pmultiqc_logo_dark)
-
-    if hasattr(config, "custom_logo_width"):
         config.custom_logo_width = 118
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Supported starting from MultiQC v1.33
if hasattr(config, "custom_logo_width"):
config.custom_logo_dark = str(pmultiqc_logo_dark)
if hasattr(config, "custom_logo_width"):
config.custom_logo_width = 118
# Supported starting from MultiQC v1.33
if hasattr(config, "custom_logo_width"):
config.custom_logo_dark = str(pmultiqc_logo_dark)
config.custom_logo_width = 118
🤖 Prompt for AI Agents
In pmultiqc/main.py around lines 47 to 52, there are two separate if-blocks both
checking hasattr(config, "custom_logo_width"); merge them into a single
conditional that checks hasattr(config, "custom_logo_width") once and inside
that block set both config.custom_logo_dark = str(pmultiqc_logo_dark) and
config.custom_logo_width = 118 so both assignments occur only when the attribute
exists.


log.info(f"pmultiqc: injected custom logo from local path: {pmultiqc_logo}")
else:
log.warning(f"pmultiqc logo file not found at: {pmultiqc_logo}")
Expand Down
135 changes: 129 additions & 6 deletions pmultiqc/modules/common/common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,12 +269,17 @@ def parse_mzml(
)


def mod_group_percentage(group):
if "Modifications" in group.columns:
group.rename(columns={"Modifications": "modifications"}, inplace=True)
def mod_group_percentage(df):

counts = group["modifications"].str.split(",").explode().value_counts()
percentage_df = (counts / len(group["modifications"]) * 100).reset_index()
df_copy = df.copy()

if "Modifications" in df_copy.columns and "modifications" not in df_copy.columns:
df_copy = df_copy.rename(columns={"Modifications": "modifications"})
else:
log.warning('Detected both "Modifications" and "modifications" columns.')

Comment on lines +276 to +280
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incomplete column handling when both columns exist.

When both "Modifications" and "modifications" columns exist, the code logs a warning but then proceeds to use df_copy["modifications"] on line 281, which may not exist if only "Modifications" was present. The else branch doesn't handle the case where only "modifications" exists (which is fine) but also doesn't handle the "both exist" case properly.

🔎 Suggested fix
-    if "Modifications" in df_copy.columns and "modifications" not in df_copy.columns:
-        df_copy = df_copy.rename(columns={"Modifications": "modifications"})
-    else:
-        log.warning('Detected both "Modifications" and "modifications" columns.')
+    if "Modifications" in df_copy.columns and "modifications" in df_copy.columns:
+        log.warning('Detected both "Modifications" and "modifications" columns. Using "modifications".')
+    elif "Modifications" in df_copy.columns:
+        df_copy = df_copy.rename(columns={"Modifications": "modifications"})
+    elif "modifications" not in df_copy.columns:
+        raise ValueError("Neither 'Modifications' nor 'modifications' column found in DataFrame.")
🤖 Prompt for AI Agents
In pmultiqc/modules/common/common_utils.py around lines 276-280, the current
else branch treats all non-rename cases as "both exist" and only logs a warning,
causing later code to fail if the lowercase column doesn't actually exist;
change the logic to explicitly handle three cases: (1) if both "Modifications"
and "modifications" exist, populate/merge into "modifications" (e.g., fillna
from "Modifications" or prefer non-null values) and then drop the
"Modifications" column while logging a warning; (2) if only "Modifications"
exists, rename it to "modifications" as before; (3) if only "modifications"
exists, do nothing; ensure after this block df_copy always has a "modifications"
column when either original column was present.

counts = df_copy["modifications"].str.split(",").explode().value_counts()
percentage_df = (counts / len(df_copy["modifications"]) * 100).reset_index()
percentage_df.columns = ["modifications", "percentage"]

# Modified (Total)
Expand Down Expand Up @@ -440,4 +445,122 @@ def parse_sdrf(
True,
False,
condition_config, # config.kwargs["condition"],
)
)


def cal_num_table_at_sample(file_df, data_per_run):

if file_df.empty:
return dict()

sample_file_df = file_df.copy()
sample_file_df["Sample"] = sample_file_df["Sample"].astype(int)

Comment on lines +457 to +458
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential ValueError if Sample column contains non-numeric values.

astype(int) will raise a ValueError if the Sample column contains non-integer-convertible values. Consider adding error handling or validation.

🔎 Suggested fix
     sample_file_df = file_df.copy()
-    sample_file_df["Sample"] = sample_file_df["Sample"].astype(int)
+    try:
+        sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"], errors="raise").astype(int)
+    except (ValueError, TypeError) as e:
+        log.warning(f"Failed to convert Sample column to int: {e}. Returning empty dict.")
+        return dict()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sample_file_df["Sample"] = sample_file_df["Sample"].astype(int)
sample_file_df = file_df.copy()
try:
sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"], errors="raise").astype(int)
except (ValueError, TypeError) as e:
log.warning(f"Failed to convert Sample column to int: {e}. Returning empty dict.")
return dict()
🤖 Prompt for AI Agents
In pmultiqc/modules/common/common_utils.py around lines 457-458, the direct use
of sample_file_df["Sample"].astype(int) can raise a ValueError for non-numeric
values; replace this with a safe conversion using pandas.to_numeric (e.g.,
sample_file_df["Sample"] = pd.to_numeric(sample_file_df["Sample"].str.strip(),
errors="coerce")), then check for resulting NaNs and either raise a clear
ValueError listing offending rows or handle them (drop/fill) as appropriate for
the pipeline, ensuring any error message includes filename/row info for easier
debugging.

cal_num_table_sample = dict()
for sample, group in sample_file_df.groupby("Sample", sort=True):
proteins_set = set()
peptides_set = set()
unique_peptides_set = set()
modified_pep_set = set()

for run in group["Run"].unique():

run_data = data_per_run.get(run)
if not run_data:
continue

proteins_set.update(run_data.get("proteins", []))
peptides_set.update(run_data.get("peptides", []))
unique_peptides_set.update(run_data.get("unique_peptides", []))
modified_pep_set.update(run_data.get("modified_peps", []))

cal_num_table_sample[str(sample)] = {
"protein_num": len(proteins_set),
"peptide_num": len(peptides_set),
"unique_peptide_num": len(unique_peptides_set),
"modified_peptide_num": len(modified_pep_set),
}

return cal_num_table_sample


def cal_msms_identified_rate(ms2_num_data, identified_data):

identified_rate = dict()
for m, msms_info in identified_data.items():
identified_ms2 = msms_info.get("Identified", 0)
all_ms2 = ms2_num_data.get(m, {}).get("MS2_Num", 0)
if all_ms2:
identified_rate[m] = {
"Identified Rate": identified_ms2 / all_ms2 * 100
}

return identified_rate


def aggregate_msms_identified_rate(
mzml_table,
identified_msms_spectra,
sdrf_file_df=None
):
identified_rate_by_run = cal_msms_identified_rate(
ms2_num_data=mzml_table,
identified_data=identified_msms_spectra
)

if sdrf_file_df is None:
return identified_rate_by_run

else:
identified_by_sample = dict()
ms2_num_by_sample = dict()

sdrf_file_df["Sample"] = sdrf_file_df["Sample"].astype(int)

for sample, group in sdrf_file_df.groupby("Sample", sort=True):
runs = group["Run"]

sample_identified_ms2 = sum(
identified_msms_spectra.get(run, {}).get("Identified", 0)
for run in runs
)
sample_all_ms2 = sum(
mzml_table.get(run, {}).get("MS2_Num", 0)
for run in runs
)

sample_key = f"Sample {str(sample)}"

identified_by_sample[sample_key] = {"Identified": sample_identified_ms2}
ms2_num_by_sample[sample_key] = {"MS2_Num": sample_all_ms2}

identified_rate_by_sample = cal_msms_identified_rate(
ms2_num_data=ms2_num_by_sample,
identified_data=identified_by_sample
)

return [identified_rate_by_run, identified_rate_by_sample]

def summarize_modifications(df):

mod_group_processed = mod_group_percentage(df)
mod_plot_dict = dict(
zip(mod_group_processed["modifications"], mod_group_processed["percentage"])
)
modified_cat = mod_group_processed["modifications"]

return mod_plot_dict, modified_cat


def group_charge(df, group_col, charge_col):

table = df.groupby([group_col, charge_col], sort=True).size().unstack(fill_value=0)
table.columns = table.columns.astype(str)

if group_col == "Sample":
table.index = [
f"Sample {str(i)}" for i in table.index
]

return table

111 changes: 83 additions & 28 deletions pmultiqc/modules/common/dia_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
from pmultiqc.modules.common.file_utils import file_prefix
from pmultiqc.modules.common.common_utils import (
evidence_rt_count,
mod_group_percentage
cal_num_table_at_sample,
summarize_modifications
)
from pmultiqc.modules.core.section_groups import add_sub_section
from pmultiqc.modules.common.plots.id import draw_ids_rt_count
Expand All @@ -33,7 +34,6 @@ def parse_diann_report(
sample_df,
file_df,
ms_with_psm,
cal_num_table_data,
quantms_modified,
ms_paths,
msstats_input_valid=False
Expand All @@ -53,7 +53,7 @@ def parse_diann_report(
_process_modifications(report_data)

# Process run-specific data
_process_run_data(report_data, ms_with_psm, cal_num_table_data, quantms_modified)
cal_num_table_data = _process_run_data(report_data, ms_with_psm, quantms_modified, file_df)

# Handle files without PSM
ms_without_psm = _handle_files_without_psm(ms_paths, ms_with_psm, cal_num_table_data)
Expand Down Expand Up @@ -105,15 +105,15 @@ def _draw_diann_plots(sub_sections, report_data, heatmap_color_list, sample_df,
"""Draw all DIA-NN plots."""
# Draw intensity plots and heatmap
if "Precursor.Quantity" in report_data.columns:
draw_dia_intensitys(sub_sections["quantification"], report_data)
draw_dia_intensitys(sub_sections["quantification"], report_data, file_df)
_draw_heatmap(sub_sections["summary"], report_data, heatmap_color_list)

# Draw other plots
log.info("Draw the DIA MS1 subsection.")
draw_dia_ms1(sub_sections["ms1"], report_data)

log.info("Draw the DIA MS2 subsection.")
draw_dia_ms2s(sub_sections["ms2"], report_data)
draw_dia_ms2s(sub_sections["ms2"], report_data, file_df)

log.info("Draw the DIA mass_error subsection.")
draw_dia_mass_error(sub_sections["mass_error"], report_data)
Expand Down Expand Up @@ -209,37 +209,58 @@ def find_diann_modified(peptide):
report_data["Modifications"] = report_data["Modified.Sequence"].apply(find_diann_modified)


def _process_run_data(report_data, ms_with_psm, cal_num_table_data, quantms_modified):
"""Process run-specific data including modifications and statistics."""
def _process_run_data(df, ms_with_psm, quantms_modified, sdrf_file_df):
"""
Process run-specific data including modifications and statistics.
"""

log.info("Processing DIA mod_plot_dict.")
mod_plot_dict = dict()

report_data = df[
["Run", "Modified.Sequence", "Modifications", "Protein.Group", "sequence"]
].copy()

mod_plot_by_run = dict()
modified_cats = list()

statistics_at_run = dict()
data_per_run = dict()

for run_file, group in report_data.groupby("Run"):
run_file = str(run_file)
ms_with_psm.append(run_file)

# Process modifications for this run
mod_group_processed = mod_group_percentage(group.drop_duplicates())
mod_plot_dict[run_file] = dict(
zip(mod_group_processed["modifications"], mod_group_processed["percentage"])
)
modified_cats.extend(mod_group_processed["modifications"])
mod_plot_dict, modified_cat = summarize_modifications(group.drop_duplicates())
mod_plot_by_run[run_file] = mod_plot_dict
modified_cats.extend(modified_cat)

# Calculate statistics for this run
_calculate_run_statistics(group, run_file, cal_num_table_data)
statistics_at_run[run_file], data_per_run[run_file] = _calculate_run_statistics(group)

num_table_at_sample = cal_num_table_at_sample(sdrf_file_df, data_per_run)

cal_num_table_data = {
"sdrf_samples": num_table_at_sample,
"ms_runs": statistics_at_run
}

mod_plot_by_sample = dia_sample_level_modifications(
df=report_data,
sdrf_file_df=sdrf_file_df
)

# Update quantms_modified with processed data
quantms_modified["plot_data"] = mod_plot_dict
quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
quantms_modified["cats"] = list(
sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x))
Comment on lines +248 to 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

dia_sample_level_modifications called unconditionally may fail with empty SDRF.

The function dia_sample_level_modifications performs a merge with sdrf_file_df without checking if it's empty. If sdrf_file_df is empty, the merge will produce an empty DataFrame, and subsequent operations may not behave as expected.

🔎 Suggested fix
-    mod_plot_by_sample = dia_sample_level_modifications(
-        df=report_data,
-        sdrf_file_df=sdrf_file_df
-    )
+    if sdrf_file_df is not None and not sdrf_file_df.empty:
+        mod_plot_by_sample = dia_sample_level_modifications(
+            df=report_data,
+            sdrf_file_df=sdrf_file_df
+        )
+    else:
+        mod_plot_by_sample = {}
 
     # Update quantms_modified with processed data
-    quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
+    if mod_plot_by_sample:
+        quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
+    else:
+        quantms_modified["plot_data"] = mod_plot_by_run
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mod_plot_by_sample = dia_sample_level_modifications(
df=report_data,
sdrf_file_df=sdrf_file_df
)
# Update quantms_modified with processed data
quantms_modified["plot_data"] = mod_plot_dict
quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
quantms_modified["cats"] = list(
sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x))
if sdrf_file_df is not None and not sdrf_file_df.empty:
mod_plot_by_sample = dia_sample_level_modifications(
df=report_data,
sdrf_file_df=sdrf_file_df
)
else:
mod_plot_by_sample = {}
# Update quantms_modified with processed data
if mod_plot_by_sample:
quantms_modified["plot_data"] = [mod_plot_by_run, mod_plot_by_sample]
else:
quantms_modified["plot_data"] = mod_plot_by_run
quantms_modified["cats"] = list(
sorted(modified_cats, key=lambda x: (x == "Modified (Total)", x))
🤖 Prompt for AI Agents
In pmultiqc/modules/common/dia_utils.py around lines 248 to 256,
dia_sample_level_modifications is called unconditionally with sdrf_file_df which
can be empty and cause merges to yield empty DataFrames and break downstream
logic; modify the code to first check if sdrf_file_df is empty (e.g., if
sdrf_file_df is None or sdrf_file_df.empty) and only call
dia_sample_level_modifications when it contains rows, otherwise set
mod_plot_by_sample to a safe default (empty DataFrame or structure matching
expected output) and ensure quantms_modified["plot_data"] and
quantms_modified["cats"] are populated consistently so downstream code does not
assume non-empty merge results.

)

return cal_num_table_data

def _calculate_run_statistics(group, run_file, cal_num_table_data):

def _calculate_run_statistics(group):
"""Calculate statistics for a specific run."""
cal_num_table_data[run_file] = {"protein_num": len(set(group["Protein.Group"]))}
cal_num_table_data[run_file]["peptide_num"] = len(set(group["sequence"]))

peptides = set(group["Modified.Sequence"])
modified_pep = list(
Expand All @@ -251,8 +272,21 @@ def _calculate_run_statistics(group, run_file, cal_num_table_data):
pep for pep, prots in group_peptides.items() if len(set(prots)) == 1
]

cal_num_table_data[run_file]["unique_peptide_num"] = len(unique_peptides)
cal_num_table_data[run_file]["modified_peptide_num"] = len(modified_pep)
stat_run = {
"protein_num": len(set(group["Protein.Group"])),
"peptide_num": len(set(group["sequence"])),
"unique_peptide_num": len(unique_peptides),
"modified_peptide_num": len(modified_pep)
}

data_per_run = {
"proteins": set(group["Protein.Group"]),
"peptides": set(group["sequence"]),
"unique_peptides": unique_peptides,
"modified_peps": modified_pep
}

return stat_run, data_per_run


def _handle_files_without_psm(ms_paths, ms_with_psm, cal_num_table_data):
Expand All @@ -261,7 +295,7 @@ def _handle_files_without_psm(ms_paths, ms_with_psm, cal_num_table_data):

for i in ms_without_psm:
log.warning("No PSM found in '{}'!".format(i))
cal_num_table_data[i] = {
cal_num_table_data["ms_runs"][i] = {
"protein_num": 0,
"peptide_num": 0,
"unique_peptide_num": 0,
Expand All @@ -274,14 +308,13 @@ def _handle_files_without_psm(ms_paths, ms_with_psm, cal_num_table_data):
## Removed draw_dia_heatmap wrapper; call cal_dia_heatmap and dia_plots.draw_heatmap directly.


def draw_dia_intensitys(sub_section, report_df):
def draw_dia_intensitys(sub_section, report_df, sdrf_file_df):
df_sub = report_df[report_df["Precursor.Quantity"] > 0].copy()
df_sub["log_intensity"] = np.log2(df_sub["Precursor.Quantity"])

dia_plots.draw_dia_intensity_dis(sub_section, df_sub)
dia_plots.draw_dia_intensity_dis(sub_section, df_sub, sdrf_file_df)

if dia_plots.can_groupby_for_std(report_df, "Run"):
dia_plots.draw_dia_intensity_std(sub_section, df_sub)
dia_plots.draw_dia_intensity_std(sub_section, df_sub, sdrf_file_df)


def draw_dia_ms1(sub_section, df):
Expand All @@ -293,13 +326,13 @@ def draw_dia_ms1(sub_section, df):
dia_plots.draw_dia_ms1_area(sub_section, df_sub)


def draw_dia_ms2s(sub_section, df):
def draw_dia_ms2s(sub_section, df, sdrf_file_df):
# Distribution of Precursor Charges
if "Precursor.Charge" in df.columns:
dia_plots.draw_dia_whole_exp_charge(sub_section, df)

# Charge-state of Per File
dia_plots.draw_dia_ms2_charge(sub_section, df)
dia_plots.draw_dia_ms2_charge(sub_section, df, sdrf_file_df)


def draw_dia_mass_error(sub_section, df):
Expand Down Expand Up @@ -798,4 +831,26 @@ def create_protein_table(report_df, sample_df, file_df):

result_dict = {i: v for i, (_, v) in enumerate(table_dict.items(), start=1)}

return result_dict, headers
return result_dict, headers


def dia_sample_level_modifications(df, sdrf_file_df):

report_data = df.copy()

report_data = report_data.merge(
right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(),
on="Run"
)

report_data["Sample"] = report_data["Sample"].astype(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential ValueError on astype(int) conversion.

Similar to other files, if the Sample column contains non-numeric values, this will raise a ValueError. Consider using pd.to_numeric with error handling for robustness.

🔎 Suggested fix
-    report_data["Sample"] = report_data["Sample"].astype(int)
+    try:
+        report_data["Sample"] = pd.to_numeric(report_data["Sample"], errors="raise").astype(int)
+    except (ValueError, TypeError):
+        log.warning("Failed to convert Sample column to int in dia_sample_level_modifications")
+        return {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
report_data["Sample"] = report_data["Sample"].astype(int)
try:
report_data["Sample"] = pd.to_numeric(report_data["Sample"], errors="raise").astype(int)
except (ValueError, TypeError):
log.warning("Failed to convert Sample column to int in dia_sample_level_modifications")
return {}
🤖 Prompt for AI Agents
In pmultiqc/modules/common/dia_utils.py around line 846, the direct call
report_data["Sample"] = report_data["Sample"].astype(int) can raise ValueError
if non-numeric values exist; replace with converting via
pandas.to_numeric(report_data["Sample"], errors="coerce") and then handle
resulting NaNs (e.g., log/report offending rows and either drop them or raise a
clear error) before casting to int, ensuring robust numeric conversion and
explicit handling of invalid entries.


mod_plot = dict()
for sample, group in report_data.groupby("Sample", sort=True):

mod_plot_dict, _ = summarize_modifications(
group.drop_duplicates()
)
mod_plot[f"Sample {str(sample)}"] = mod_plot_dict

return mod_plot
Comment on lines +837 to +856
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing empty DataFrame guard in dia_sample_level_modifications.

The function assumes sdrf_file_df has data. If called with an empty DataFrame, the merge at line 841-844 will produce an empty result, and the subsequent groupby will iterate over nothing, returning an empty mod_plot. Consider adding an early return.

🔎 Suggested fix
 def dia_sample_level_modifications(df, sdrf_file_df):
 
+    if sdrf_file_df is None or sdrf_file_df.empty:
+        return {}
+
     report_data = df.copy()
 
     report_data = report_data.merge(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def dia_sample_level_modifications(df, sdrf_file_df):
report_data = df.copy()
report_data = report_data.merge(
right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(),
on="Run"
)
report_data["Sample"] = report_data["Sample"].astype(int)
mod_plot = dict()
for sample, group in report_data.groupby("Sample", sort=True):
mod_plot_dict, _ = summarize_modifications(
group.drop_duplicates()
)
mod_plot[f"Sample {str(sample)}"] = mod_plot_dict
return mod_plot
def dia_sample_level_modifications(df, sdrf_file_df):
if sdrf_file_df is None or sdrf_file_df.empty:
return {}
report_data = df.copy()
report_data = report_data.merge(
right=sdrf_file_df[["Sample", "Run"]].drop_duplicates(),
on="Run"
)
report_data["Sample"] = report_data["Sample"].astype(int)
mod_plot = dict()
for sample, group in report_data.groupby("Sample", sort=True):
mod_plot_dict, _ = summarize_modifications(
group.drop_duplicates()
)
mod_plot[f"Sample {str(sample)}"] = mod_plot_dict
return mod_plot
🧰 Tools
🪛 Ruff (0.14.10)

854-854: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In pmultiqc/modules/common/dia_utils.py around lines 837 to 856, the function
dia_sample_level_modifications assumes sdrf_file_df has rows and proceeds to
merge and group; add an early-empty guard: if sdrf_file_df is empty (or if the
merge results in an empty report_data) immediately return an empty dict (or the
same type as mod_plot) to avoid downstream processing; implement the check
before the merge (and a second check after the merge) and return {} when empty.

Loading