Skip to content

Latest commit

 

History

History
406 lines (327 loc) · 20.7 KB

File metadata and controls

406 lines (327 loc) · 20.7 KB

CLAUDE.md — Developer Guide for AI Assistants

This file is for Claude (and future AI assistants) working in this repository. Read this before making any changes.


Project Overview

PyViscel is a Python port of the VisCello R/Bioconductor single-cell explorer. It provides a Dash web application for interactive visualization and annotation of single-cell transcriptomics data stored in AnnData/h5ad format.

Stack: Python 3.12, Dash 4.x, Plotly 6.x, AnnData, pandas 2.x, numpy, scanpy.


Architecture

src/pyviscel/
├── app.py               # Dash app factory + all callbacks (largest file)
├── ui_components.py     # Layout builders (no callbacks, pure HTML/Dash components)
├── plotting.py          # Plotly figure builders (no Dash, pure numpy/plotly)
├── cello_class.py       # Cello / CelloCollection data model
├── io.py                # load_adata / save_adata / validate_adata
├── dim_reduction.py     # PCA / tSNE / UMAP wrappers
├── clustering.py        # Leiden / Louvain / density clustering
├── differential_expression.py  # Chi-sq / MWU / sSeq DE
├── enrichment.py        # GO/KEGG via gseapy
├── heatmap.py           # Annotated heatmap
└── convert/
    └── from_r.py        # R VisCello → AnnData conversion

The app is structured as a factory function create_app(adata) inside app.py. All callbacks are registered inside that function (closure pattern) so they share access to the mutable adata object via _get_adata().


Key Dash / Plotly Gotchas (hard-won knowledge)

Plotly 6.x breaking changes

  • hoverinfo="skip" suppresses selectedData — points with hoverinfo="skip" are NOT included in the selectedData event in Plotly 6+. Use hoverinfo="none" instead.
  • This applies to every trace that the user must be able to lasso-select.

go.Scattergl vs go.Scatter

  • go.Scattergl does not reliably surface customdata in selectedData. Always use go.Scatter for any trace where lasso selection must extract customdata. (Scattergl is fine for display-only traces where selection is not needed.)

customdata extraction from selectedData

Points in selectedData["points"] may have customdata as a scalar or [scalar]. Always unwrap lists before casting to int:

cd = pt.get("customdata")
while isinstance(cd, (list, tuple)):
    if not cd: cd = None; break
    cd = cd[0]
if cd is not None:
    idx = int(cd)

Also provide a pointNumber fallback for traces where customdata may be absent.

Dynamic trace structure breaks lasso indices

If you remove cells from a trace on re-render, pointNumber indices shift. A preserved Plotly lasso selection will then map to wrong global cell indices. Fix: background traces must always contain ALL cells (static structure). Use selectionrevision (not uirevision) to clear the lasso highlight after each selection event without resetting zoom/pan.

uirevision vs selectionrevision

  • uirevision — preserves zoom, pan, camera angle. Change it to reset the view.
  • selectionrevision — controls selection highlight only. Change it to clear the lasso without resetting the view. Increment it after each selection event.

Dash 4.x

  • Multiple callbacks writing the same Output require allow_duplicate=True.
  • prevent_initial_call=True is required on most callbacks to avoid firing at page load.

pandas 2.x

  • Categorical fillna: series.fillna("NA") raises TypeError on Categorical dtypes. Use: series.astype(object).fillna("NA").astype(str)
  • Safe in-place assignment: adata.obs.loc[bool_mask, col] = value (CoW-safe).

Cell Selection Pipeline

2-D projections

User lasso on vc-scatter
  → handle_cell_selection (selectedData)
  → vc-store-cells (list of global cell indices)
  → vc-cell-count (display label)
  → User clicks Confirm
  → confirm_selection
  → adata.obs["Manual_Selection"] = Group 1 / 2 / 3 ...
  → vc-store-group-counter incremented
  → vc-color-dd options updated (Manual_Selection appears)

3-D projections (camera-angle projection)

User rotates vc-scatter (3D)
  → track_3d_camera (relayoutData → vc-3d-camera)
  → User clicks "Snapshot Current View"
  → render_3d_proj_view
      reads vc-3d-camera + adata.obsm[proj_key]
      _project_3d_to_camera(xyz, camera) → px, py
      renders vc-3d-proj-view (go.Scatter, customdata=ci_array, dragmode=lasso)
  → User lassos on vc-3d-proj-view
  → handle_3d_proj_selection (selectedData)
  → vc-store-cells (same as 2-D from here)
  → User clicks Confirm → same confirm_selection callback

_project_3d_to_camera math

  1. Compute forward vector: fwd = (center - eye) / |center - eye|
  2. Right axis: right = cross(fwd, up) / |cross(fwd, up)|
  3. Up-ortho axis: up_ortho = cross(right, fwd)
  4. Normalize point cloud to [-1,1]³
  5. px = xyz_norm @ right, py = xyz_norm @ up_ortho

Important Store IDs

Store ID Type Purpose
vc-store-cells memory Current lasso selection (list of global cell indices)
vc-store-group-counter memory Next group number for Manual_Selection
vc-3d-camera memory Last known Plotly camera dict for 3D scatter
vc-store-sel-history memory Legacy — kept in layout but no longer used by callbacks

UI Component IDs (selection-related)

ID Component Purpose
vc-scatter dcc.Graph Main scatter plot (2D or 3D)
vc-cell-count html.Small Displays "N cells selected"
vc-confirm-annotation-btn dbc.Button Saves current selection as next Group
vc-3d-sel-panel html.Div Hidden for 2D; shown for 3D
vc-3d-snapshot-btn dbc.Button Takes camera-angle snapshot
vc-3d-proj-view dcc.Graph 2D projection canvas (lasso here)
vc-3d-proj-status html.Small Status/instruction text
vc-3d-proj-clear-btn dbc.Button Clears projection + resets store-cells

Testing

pytest tests/ -q          # should be 390 passed
  • Tests are in tests/ — one file per module.
  • test_app.py tests layout structure and callback helpers via direct function calls.
  • Never mock the AnnData object — tests build real small AnnData fixtures.
  • All gseapy API calls in test_enrichment.py are mocked with unittest.mock.patch.
  • All mygene.info API calls in test_enrichment.py are mocked with unittest.mock.patch("requests.post", ...).
  • All 412 tests must pass before committing.

What Was Changed (session history)

Bug fixes applied in sessions up to 2026-03-19

app.py

  • confirm_selection: Changed except Exception: raise PreventUpdate → always increment group_counter even when scatter re-render fails. This was the root cause of Manual_Selection never appearing in the Color By dropdown.
  • confirm_selection: Removed vc-store-sel-history State (no longer needed).
  • handle_cell_selection: Added pointNumber fallback + robust customdata unwrapping for both scalar and [scalar] formats.
  • _get_projection_options: Fixed fallback to guard with cello_name in adata.uns.get("cellos", {}) before iterating obsm.
  • Replaced entire 3D multi-view system (3×axis dropdowns, 3×view renders, 3×view handlers, apply/clear/summarize) with camera-angle projection approach:
    • _project_3d_to_camera helper
    • track_3d_camera callback
    • render_3d_proj_view callback (Snapshot button)
    • handle_3d_proj_selection callback (lasso → vc-store-cells)
    • clear_3d_proj_selection callback (Clear button)
  • Added dcc.Store(id="vc-3d-camera") to layout.

plotting.py

  • expression_scatter: Changed all 4 go.Scatterglgo.Scatter; added ci_array = np.array(cell_indices) and customdata=ci_array[mask] to every trace so lasso selection works on gene expression views.
  • All traces: changed hoverinfo="skip"hoverinfo="none" (Plotly 6 fix).
  • scatter_plot cover0 background trace: same hoverinfo fix.
  • Fixed fillna("NA") for Categorical columns (pandas 2.x CoW).

ui_components.py

  • Removed _sel_view_block (3-panel multi-view layout).
  • Added _camera_sel_panel() with new IDs: vc-3d-snapshot-btn, vc-3d-proj-clear-btn, vc-3d-proj-status, vc-3d-proj-view.
  • vc-3d-sel-panel now renders _camera_sel_panel() instead of the old 3-grid layout.

Bug fixes applied in session 2026-03-20

app.pytrack_3d_camera

  • Fixed 3-D camera tracking: Plotly 6 sends rotation events as {"scene.camera": {"eye": …, "up": …, "center": …}} (nested dict under one key), NOT as flat "scene.camera.eye" / "scene.camera.up" / "scene.camera.center" keys. The old code looked for the flat keys → always got empty dict → always raised PreventUpdatevc-3d-camera store never updated → Snapshot always used default angle. Fix: check "scene.camera" first (nested form), then fall back to flat keys.

plotting.pyscatter_plot_3d (line ~676)

  • Fixed pandas Categorical fillna crash in the 3-D colour path. The 2-D path (scatter_plot) already used the safe pattern; the 3-D path did not. Changed values.fillna("NA")values.astype(object).fillna("NA").astype(str). (Same fix as documented in the pandas 2.x gotcha above.)

Bug fixes applied in session 2026-03-20 (DE, enrichment, UI)

app.py

  • run_de (bidirectional DEGs): single run_de_test call, split by log2fc > 0 (Group 1) and log2fc < 0 (Group 2, fold-change negated). Both groups now appear in DE result tabs.
  • _de_df_to_records: fixed bool→float corruption from select_dtypes(include=[np.number]), which includes np.bool_. Now explicitly excludes bool columns before rounding. Without this fix, significant became 1.0/0.0 after JSON round-trip, breaking downstream df[df["significant"]] filters.
  • All df[df["significant"]]df[df["significant"].astype(bool)] for safety.
  • update_palette_options / render_scatter: default colormap for gene expression changed from "rainbow2" to "viridis".
  • run_go_enrichment: added min_overlap=3 (was 5), diagnostic status message (gene count, organism, go_type), full traceback in error tabs. vc-go-status div shows errors and completion status. min_overlap lowered to avoid silently discarding valid results.
  • update_de_proj callback: added vc-de-proj-dd projection selector in DE panel.
  • render_de_scatter / render_de_gene_scatter: new callbacks for DE scatter + gene expression scatter using selected projection (supports 3D).
  • Added lazy gene search callbacks (search_gene_options, search_de_gene_options) — return ≤50 matches on keystroke instead of loading all var_names at start.

plotting.py

  • expression_scatter / expression_scatter_3d: default pal changed to "viridis".

ui_components.py

  • DE controls row: added vc-de-proj-dd projection dropdown (5-column layout).
  • results_panel: restructured from 2-col to 3-col — scatter | gene expression scatter | heatmap.
  • Added vc-de-gene-search dropdown and vc-de-gene-scatter graph (middle column).
  • Added vc-go-organism dropdown (default "hsa") in enrichment section.
  • GO_TYPES KEGG value corrected: "KEGG""kegg".
  • Added "all" option to GO_TYPES.
  • Added vc-go-status div for error/status display.

enrichment.py

  • Added _validate_gene_symbols() — rejects bool lists, "True"/"False" string lists, and auto-coerces pandas bool Series to gene names with a warning.
  • Added organisms: rno (rat), dme (fly), dre (zebrafish/fish), sce (yeast) to ENRICHR_LIBRARIES, _ENRICHR_ORGANISM, _VALID_ORGANISMS.
  • run_enrichment gets organism: str | None = None parameter that overrides adata.uns config.
  • Background warning in run_enrichment fires only when caller explicitly supplies background_symbols (not for the default adata.var_names).
  • _parse_enrichr_result: normalises Genes column from list or string → semicolon-separated string. gseapy sometimes returns a Python list instead of a string.

differential_expression.py

  • feature_name_column dtype check: if the configured column is not string/object dtype (e.g. a boolean "highly_variable" column), falls back to adata.var_names and clears the bad config entry from adata.uns so the warning fires only once per session.

tests/test_enrichment.py (30 new tests, 91 total)

  • TestValidateGeneSymbols — 8 tests: bool list, bool-string list, pandas bool Series, pandas Index, empty list.
  • TestParseEnrichrResultListGenes — 2 tests: list-type Genes column handling.
  • TestNewOrganisms — 12 tests: rno/dme/dre/sce registry + organism string + API routing.
  • TestRunEnrichment extended — 6 tests: organism= override for all new organisms.
  • Fixed test_background_ignored_warning to call run_enrichment (warning lives there, not in compute_go).

Changes applied in session 2026-03-23 (Heatmap UI + Full Enrichment Suite)

ui_components.py

  • Heatmap: replaced 10/20/50/100 button group with free-form number input (vc-de-top-n-input, default 50, min 1). Removed "⚠ Recommend ≤ 50 genes for performance" warning.
  • GO_ORGANISMS: removed rno (Enrichr does not support rat).
  • GO_TYPES: expanded to 9 options — BP, MF, CC, All GO (go_all), KEGG, WikiPathways (wiki), MSigDB Hallmark (msigdb), Reactome Pathways (reactome), All.
  • enrichment_section: full replacement — ORA/GSEA mode toggle (vc-enrich-mode), fast-mode checkbox (vc-enrich-fast-mode), side-by-side Group 1/Group 2 dotplot+table layout (vc-enrich-dotplot-g1/g2, vc-enrich-table-g1/g2), hidden GSEA panel (vc-enrich-gsea-results), library warning div (vc-go-lib-warning).

app.py

  • vc-hmap-topn-store default changed from 30 → 50.
  • store_top_n: rewritten to read from number input instead of 4 buttons.
  • download_heatmap: changed from write_htmlwrite_image(format="png", scale=2) using kaleido.
  • Added stores: vc-store-enrich-g1, vc-store-enrich-g2, vc-store-gsea.
  • Removed old run_go_enrichment (single-group, tabs-based) and download_go_table (Excel) callbacks.
  • gate_library_options: new callback — hides MSigDB/Reactome for non-human/mouse organisms.
  • toggle_enrich_mode: new callback — shows ORA or GSEA results panel based on mode selector.
  • run_ora_enrichment: new callback — runs ORA for both DE groups simultaneously; produces side-by-side dotplots + tables; ID mismatch warning (<50% overlap).
  • run_gsea_enrichment: new callback — builds signed log2FC ranked list from DE results (g1 positive, g2 negated back to negative), runs run_gsea_prerank(), splits results by NES sign, renders mountain plots.
  • download_enrichment_csv: new callback — downloads ORA (both groups) or GSEA results as .csv.

enrichment.py — major overhaul

  • ENRICHR_LIBRARIES: fully replaced. GO 2025 for hsa/mmu, GO 2018 for dme/dre/sce/cel/rno. KEGG: KEGG_2019_Human / KEGG_2019_Mouse / KEGG_2019 (others). WikiPathways: WikiPathways_2024_Human / WikiPathways_2024_Mouse / WikiPathways_2018 (others). MSigDB (MSigDB_Hallmark_2020) and Reactome (Reactome_Pathways_2024) for hsa/mmu only.
  • _VALID_GO_TYPES: added wiki, msigdb, reactome, go_all.
  • _HUMAN_MOUSE_ONLY_TYPES = frozenset({"msigdb", "reactome"}): new constant — these types require mouse→human conversion for mmu.
  • mouse_to_human_online(): new function — replaces HMD file lookup with mygene.info API (2 batch POSTs, no key, ~0.4 s for 30 genes).
  • compute_go(): auto-triggers convert_mouse_to_human for mmu + msigdb/reactome; uses mouse_to_human_online() when hmd_path=None (no longer raises ValueError); added min_gene_set_size=10 / max_gene_set_size=500 filters; retry (3×, exponential backoff); low-overlap warning (<50%).
  • _parse_enrichr_result(): added gene_ratio = overlap_count / overlap_total column.
  • run_enrichment(): passes through min_gene_set_size / max_gene_set_size.
  • _compute_running_es(): new helper — weighted GSEA running enrichment score algorithm.
  • _parse_gsea_result() / _GSEA_COLS: new GSEA result normalizer.
  • run_gsea_prerank(): new function — wraps gseapy.prerank() with retry, returns {results, ranking, gene_sets} dict for downstream mountain plotting.

plotting.py

  • enrichment_dotplot(): new function — Plotly bubble chart (x = gene_ratio, size = overlap_count, color = pval_adj, top 10 terms).
  • gsea_mountain_plot(): new function — Plotly enrichment score curve with inline running ES computation, hit rug, peak marker, NES/FDR annotation.

tests/test_enrichment.py (22 new tests, 412 total)

  • Updated: test_hsa_bp_uses_2025_library, test_correct_library_for_hsa_kegg (→ KEGG_2019_Human), test_sce_kegg_library_defined (→ KEGG_2019), test_dme/dre_kegg_uses_kegg_2019, test_convert_mouse_to_human_uses_online_when_no_hmd_path.
  • Added: TestHumanMouseOnlyTypes (10 tests), TestMouseToHumanOnline (4 tests), TestRunGseaPrerank (5 tests), TestComputeRunningEs (3 tests).

System fix

  • Removed corrupt macOS metadata file /Volumes/Shared/Concord/src/._concord_sc.egg-info that was blocking all pip installs.
  • Installed kaleido (required for PNG heatmap export).

Enrichment Suite — Library Map (as of 2026-03-23)

go_type Human (hsa) Mouse (mmu) Fly/Fish/Yeast/Worm
BP GO_Biological_Process_2025 same GO_Biological_Process_2018
MF GO_Molecular_Function_2025 same GO_Molecular_Function_2018
CC GO_Cellular_Component_2025 same GO_Cellular_Component_2018
kegg KEGG_2019_Human KEGG_2019_Mouse KEGG_2019
wiki WikiPathways_2024_Human WikiPathways_2024_Mouse WikiPathways_2018
msigdb MSigDB_Hallmark_2020 same*
reactome Reactome_Pathways_2024 same*

*Mouse msigdb/reactome: auto-converts mouse symbols → human orthologs via mygene.info before calling Enrichr.

GSEA Prerank Pipeline

DE results (vc-store-de)
  → g1 genes: use stored log2fc as-is  (positive)
  → g2 genes: negate stored log2fc back (negative, since app.py negates them on storage)
  → combine → pd.Series sorted descending
  → run_gsea_prerank(ranked_series, organism, go_type, permutations)
      → gseapy.get_library(lib)  ← downloads + caches gene set dict
      → gseapy.prerank(rnk=ranked_series, gene_sets=lib_dict, ...)
      → returns {results, ranking, gene_sets}
  → split by NES sign: positive NES = Group 1 up, negative NES = Group 2 up
  → gsea_mountain_plot() for top term in each direction

pandas / numpy gotchas (additions)

select_dtypes(include=[np.number]) includes np.bool_

In numpy's type hierarchy, np.bool_ is a subtype of integer. df.select_dtypes(include=[np.number]) therefore includes boolean columns. Always exclude them explicitly before numeric operations:

bool_cols = set(df.select_dtypes(include=[bool]).columns)
numeric_cols = [c for c in df.select_dtypes(include=[np.number]).columns
                if c not in bool_cols]

Bug fixes applied in session 2026-03-24 (GSEA runtime fix)

src/pyviscel/enrichment.py

Three bugs in run_gsea_prerank() and _parse_gsea_result() prevented GSEA from running:

  1. Wrong parameter name (weighted_score_typeweight):

    • gseapy v1.1.7 renamed weighted_score_type to weight in gseapy.prerank().
    • The old call passed weighted_score_type=weighted_score_type, which was silently ignored (absorbed into **kwargs), causing gseapy to use its default and raise internally.
    • Fix: gseapy.prerank(..., weight=weighted_score_type).
  2. Duplicate "term" column in _parse_gsea_result:

    • gseapy v1.1.7 res2d has two columns: "Name" (library source, e.g. "prerank") and "Term" (gene set name).
    • The old rename_map mapped both "Name""term" and "Term""term", creating a duplicate column that caused df[_GSEA_COLS] to return a malformed DataFrame.
    • Fix: drop "Name" before renaming; only "Term""term".
  3. Percentage strings in tag_pct / gene_pct:

    • gseapy returns "Tag %" and "Gene %" as strings like "10.00%". pd.to_numeric() with errors="coerce" silently turned them into NaN.
    • Fix: strip trailing % and divide by 100 before converting.

Root cause was identified via inspect.signature(gseapy.prerank) in the prior session.


Known Issues (as of 2026-03-24)

  • dash_table.DataTable deprecation warning from Dash — no functional impact.
  • Enrichr API uses its own built-in background — background_symbols is ignored in online mode. Use compute_go_offline() for custom-background ORA.
  • GSEA Prerank with 1000 permutations can take several minutes; use fast mode (100 perms) for exploratory work.
  • mygene.info mouse→human conversion requires internet access; offline runs with MSigDB/Reactome for mmu will fail.