Skip to content

Merge dev with main branch#12

Closed
marioernestovaldes wants to merge 32 commits intomainfrom
dev
Closed

Merge dev with main branch#12
marioernestovaldes wants to merge 32 commits intomainfrom
dev

Conversation

@marioernestovaldes
Copy link
Owner

@marioernestovaldes marioernestovaldes commented Jun 17, 2025

Merge dev with main branch

Summary by Sourcery

Merge development into main branch, applying code style standardizations, performance optimizations in scoring routines, expanded test coverage, and updated documentation and project configuration.

Enhancements:

  • Refactor and vectorize core scoring functions (populate_distance_dict, compute_mean_distance/variance) to improve performance and readability
  • Update get_neighborhood_dict and concordance metric implementations with consistent typing, numpy-style logic, and enhanced docstrings
  • Streamline bootstrap and alignment pipeline in generate() by batching DataFrame assembly and optimizing Kabsch alignment
  • Standardize docstring style, import ordering, and PEP8 formatting across all modules

Build:

  • Add Ruff configuration in pyproject.toml to exclude tutorial notebooks from linting

Documentation:

  • Add AGENTS.md contributor guide covering development environment setup, testing, and PR guidelines

Tests:

  • Introduce new equivalence tests for distance computations, generate(), scoring, and concordance functions
  • Update existing test suite in test.py for improved error handling and formatting

Chores:

  • Remove unused imports and legacy code (e.g., dash dependencies) and clean up module imports

marioernestovaldes and others added 29 commits June 11, 2025 16:48
…core.variance-performance

Add variance speed benchmark test
…-calculation-in-populate_distance_dict

Improve distance dict build
…mean_variance_distance-function

Improve distance variance calculation
…-handling-with-numpy

Improve distance calculations
…for-requested-task

[dynamicviz] Run formatting and linting
…-in-list-and-update-tests

[dynamicviz] optimize bootstrap concatenation
…1-from-compute_mean_distance

[dynamicviz] remove artificial delay
…and-pytest-checks

[dynamicviz] Fix test imports
…_distance_dict-for-performance

[dynamicviz] optimize distance computation
…e_distance_dict-for-performance

[dynamicviz] optimize distance computation
…nce-score-calculation

[dynamicviz] Fix concordance metrics and add tests
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 17, 2025

Reviewer's Guide

This PR merges the development branch into main by refactoring core scoring and bootstrap algorithms for vectorized performance, standardizing code style and docstrings across modules, modernizing the visualization APIs, optimizing DataFrame assembly in bootstrapping, unifying distance metric implementations, updating CI lint configurations, and significantly expanding test coverage and contributor documentation.

Class diagram for refactored scoring and concordance functions in score.py

classDiagram
    class ScoreUtils {
        +variance(df, method, k, X_orig, neighborhoods, normalize_pairwise_distance, include_original, return_times)
        +stability(df, method, alpha, k, X_orig, neighborhoods, normalize_pairwise_distance, include_original, return_times)
        +stability_from_variance(mean_variance_distances, alpha)
        +get_neighborhood_dict(method, k, keys, neighborhoods, X_orig)
        +populate_distance_dict(neighborhood_dict, embeddings, bootstrap_indices)
        +compute_mean_distance(dist_dict, normalize_pairwise_distance)
        +compute_mean_variance_distance(dist_dict, normalize_pairwise_distance, mean_pairwise_distance)
    }
    class ConcordanceUtils {
        +concordance(df, X_orig, method, k, bootstrap_number)
        +ensemble_concordance(df, X_orig, methods, k, bootstrap_number, verbose)
        +get_jaccard(X_orig, X_red, k, precomputed)
        +get_distortion(X_orig, X_red, k, precomputed)
        +get_mean_projection_error(X_orig, X_red)
        +get_stretch(X_orig, X_red)
    }
    ScoreUtils <.. ConcordanceUtils : uses
Loading

Class diagram for refactored bootstrapping and DR pipeline in boot.py

classDiagram
    class BootUtils {
        +generate(X, method, Y, B, sigma_noise, no_bootstrap, random_seed, save, num_jobs, use_n_pcs, subsample, return_times, **kwargs)
        +bootstrap(X, method, B, sigma_noise, no_bootstrap, random_seed, num_jobs, use_n_pcs, subsample, **kwargs)
        +run_one_bootstrap(X, method, sigma_noise, no_bootstrap, random_seeded_sequence, b, use_n_pcs, subsample, **kwargs)
        +dimensionality_reduction(X, method, **kwargs)
    }
Loading

Class diagram for modernized visualization API in viz.py

classDiagram
    class VizUtils {
        +interactive(df, label, show, save, colors, xlabel, ylabel, title, legend_title, vmin, vmax, alpha, width, height, dpi, dim)
        +animated(df, label, save, get_frames, colors, cmap, width, height, dpi, xlabel, ylabel, title, title_fontsize, major_fontsize, minor_fontsize, vmin, vmax, marker, alpha, solid_cbar, show_legend, solid_legend, legend_fontsize, dim, **kwargs)
        +stacked(df, label, show, save, colors, cmap, width, height, dpi, xlabel, ylabel, title, title_fontsize, major_fontsize, minor_fontsize, vmin, vmax, marker, alpha, solid_cbar, show_legend, solid_legend, legend_fontsize, dim, frame, return_fig, **kwargs)
        +fig2img(fig, dpi)
    }
Loading

Class diagram for AnnData conversion utilities in utils.py

classDiagram
    class AnnDataUtils {
        +convert_anndata(adata, obs, obsm, n_components)
        +regenerate_anndata(adata, df, bootstrap_number, obsm)
    }
Loading

File-Level Changes

Change Details Files
Vectorize scoring functions and reduce nested loops for distance/variance computations
  • Rewrote populate_distance_dict using batched pairwise_distances and array slicing
  • Replaced per-entry loops in compute_mean_distance with list comprehensions and array operations
  • Transformed compute_mean_variance_distance into block-based numpy variance calculations
dynamicviz/score.py
tests/test_distance_equiv.py
Standardize code style: signatures, imports, and docstrings
  • Expanded function definitions to multi-line with consistent indentation
  • Ordered imports alphabetically and removed unused/commented code
  • Replaced single-quoted comments with triple-quoted docstrings
dynamicviz/score.py
dynamicviz/viz.py
dynamicviz/boot.py
dynamicviz/utils.py
dynamicviz/__init__.py
Optimize bootstrap generation and alignment by batching DataFrame concatenation
  • Accumulate per-bootstrap DataFrames in a list then concat once
  • Simplified run_one_bootstrap and bootstrap loops with clearer argument formatting
dynamicviz/boot.py
tests/test_generate_equiv.py
Modernize interactive and animated visualization APIs
  • Refactored viz.interactive and viz.animated for cleaner Plotly/Matplotlib calls
  • Removed unused Dash imports and expanded argument mapping
  • Unified color mapping and frame saving logic
dynamicviz/viz.py
tests/test_generate_equiv.py
Unify distance metric functions with robust precomputed handling
  • Replaced kneighbors-based Jaccard and distortion with argsort and vector formulas
  • Standardized input flags for precomputed distance matrices
dynamicviz/score.py
tests/test_concordance_equiv.py
Add CI lint configuration to exclude tutorial notebooks
  • Introduced [tool.ruff].extend_exclude in pyproject.toml
pyproject.toml
Introduce contributor guide and documentation
  • Added AGENTS.md with setup, testing, and PR instructions
AGENTS.md
Bolster test coverage with equivalence and performance tests
  • Added tests for scoring equivalence and speed in tests/test_score_speed.py
  • Added generate equivalence test in tests/test_generate_equiv.py
  • Expanded distance and concordance equivalence tests
tests/test_score_speed.py
tests/test_distance_equiv.py
tests/test_concordance_equiv.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @marioernestovaldes - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `dynamicviz/score.py:305` </location>
<code_context>
+                        dist = 0.0
+                    dist_dict[key1][key2].append(dist)
+
+    for key1 in dist_dict.keys():
+        for key2 in dist_dict[key1].keys():
+            dist_dict[key1][key2] = np.asarray(dist_dict[key1][key2], dtype=float)
+
+    return dist_dict
</code_context>

<issue_to_address>
Potential for inconsistent array lengths in distance lists.

Arrays may have varying lengths due to differing numbers of co-occurrences. Please document or handle cases where (i, j) pairs have few or no co-occurrences to prevent downstream issues.
</issue_to_address>

### Comment 2
<location> `dynamicviz/score.py:330` </location>
<code_context>
+        for key2 in dist_dict[key1].keys()
+    ]
+
+    if not arrays:
+        return np.nan
+
+    maxlen = max(len(a) for a in arrays)
</code_context>

<issue_to_address>
Returning np.nan for empty arrays may mask upstream issues.

Consider raising an exception or warning instead, unless np.nan is explicitly expected and handled.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    if not arrays:
        return np.nan
=======
    if not arrays:
        raise ValueError("No arrays found in dist_dict; cannot compute mean pairwise distance.")
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `dynamicviz/score.py:532` </location>
<code_context>
-    stretchs = 1-stretchs
-    
-    return(stretchs)
+    stretchs = (U - np.min(U)) / (np.max(U) - np.min(U))
+    stretchs = 1 - stretchs
+
+    return stretchs
</code_context>

<issue_to_address>
Division by zero if all U values are equal in stretch calculation.

Handle the case where np.max(U) == np.min(U) to avoid division by zero.
</issue_to_address>

### Comment 4
<location> `dynamicviz/score.py:565` </location>
<code_context>
+    ), "Error: number of observations are not consistent"
+
     # set k to a globally relevant value if None
     if k is None:
-        k = round(X_orig.shape[0]/2-1)
+        k = round(X_orig.shape[0] / 2 - 1)
     if k < 5:
-        raise Exception('k needs to be >= 5 or number of observations in X is too small')
</code_context>

<issue_to_address>
Default k value may be too large for small datasets.

Consider capping k or choosing a default that better handles small datasets to avoid errors or meaningless results.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # set k to a globally relevant value if None
    if k is None:
        k = round(X_orig.shape[0] / 2 - 1)
    if k < 5:
        raise Exception(
            "k needs to be >= 5 or number of observations in X is too small"
        )
=======
    # set k to a globally relevant value if None, with sensible capping for small datasets
    if k is None:
        max_k = min(20, X_orig.shape[0] - 1)
        k = max(5, min(round(X_orig.shape[0] / 2 - 1), max_k))
    if k < 5 or k > X_orig.shape[0] - 1:
        raise Exception(
            f"k needs to be >= 5 and <= number of observations in X - 1 (got k={k}, n={X_orig.shape[0]})"
        )
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `dynamicviz/score.py:644` </location>
<code_context>
-    
-    return(ensemble_metric, pointwise_metrics_list)
\ No newline at end of file
+    weighted_metrics_list = [
+        pointwise_metrics_list[i] * np.abs(pc1_score)[i]
+        for i in range(len(pointwise_metrics_list))
+    ]
+    ensemble_metric = np.sum(weighted_metrics_list, axis=0) / np.sum(np.abs(pc1_score))
+
+    return (ensemble_metric, pointwise_metrics_list)
</code_context>

<issue_to_address>
No normalization for negative or zero sum of weights in ensemble_concordance.

Division by zero will occur if all pc1_score values are zero, and negative values may cause ambiguous results. Please add a check or normalization to handle these cases.
</issue_to_address>

### Comment 6
<location> `dynamicviz/viz.py:95` </location>
<code_context>
-                     })
-
-    if isinstance(save, str): # save as HTML
+    if df[label].dtype == np.float:  # continuous labels
+        fig = px.scatter(
+            df_px,
+            x="x1",
+            y="x2",
+            animation_frame="bootstrap_number",
+            animation_group="original_index",
+            color=label,
+            hover_name=label,
+            title=title,
+            range_x=[-dim, dim],
+            range_y=[-dim, dim],
+            range_color=(vmin, vmax),
+            opacity=alpha,
+            width=width,
+            height=height,
+            labels={"x1": xlabel, "x2": ylabel, label: legend_title},
+        )
+    elif isinstance(colors, list):  # if colors is specified for discrete labels
</code_context>

<issue_to_address>
Use of deprecated np.float dtype check.

Use np.issubdtype(df[label].dtype, np.floating) instead for compatibility with newer NumPy versions.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    if df[label].dtype == np.float:  # continuous labels
        fig = px.scatter(
            df_px,
            x="x1",
            y="x2",
            animation_frame="bootstrap_number",
            animation_group="original_index",
            color=label,
            hover_name=label,
            title=title,
            range_x=[-dim, dim],
            range_y=[-dim, dim],
            range_color=(vmin, vmax),
            opacity=alpha,
            width=width,
            height=height,
            labels={"x1": xlabel, "x2": ylabel, label: legend_title},
        )
=======
    if np.issubdtype(df[label].dtype, np.floating):  # continuous labels
        fig = px.scatter(
            df_px,
            x="x1",
            y="x2",
            animation_frame="bootstrap_number",
            animation_group="original_index",
            color=label,
            hover_name=label,
            title=title,
            range_x=[-dim, dim],
            range_y=[-dim, dim],
            range_color=(vmin, vmax),
            opacity=alpha,
            width=width,
            height=height,
            labels={"x1": xlabel, "x2": ylabel, label: legend_title},
        )
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `dynamicviz/viz.py:419` </location>
<code_context>
-        
+
     # automatically set alpha based on heuristic
     if alpha is None:
-        alpha = 0.2/(df.shape[0]/1000)
+        alpha = 0.2 / (df.shape[0] / 1000)
+
+    fig = plt.figure(figsize=(width, height))
</code_context>

<issue_to_address>
Heuristic for alpha may produce values >1 for small datasets.

Consider capping alpha at 1.0 or using a more robust scaling method for small datasets (df.shape[0] < 200), as the current heuristic may yield unintended values above 1.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # automatically set alpha based on heuristic
    if alpha is None:
        alpha = 0.2 / (df.shape[0] / 1000)

    fig = plt.figure(figsize=(width, height))
=======
    # automatically set alpha based on heuristic
    if alpha is None:
        if df.shape[0] < 200:
            alpha = 1.0
        else:
            alpha = 0.2 / (df.shape[0] / 1000)
        alpha = min(alpha, 1.0)

    fig = plt.figure(figsize=(width, height))
>>>>>>> REPLACE

</suggested_fix>

### Comment 8
<location> `dynamicviz/utils.py:31` </location>
<code_context>
+    """
     # get X from adata.obsm or adata.X
     obsm_keys = list(adata.obsm.keys())
     if obsm in obsm_keys:
         X = adata.obsm[obsm]
         if obsm == "X_pca":
-            X = X[:,:n_components]
+            X = X[:, :n_components]
     else:
         X = adata.X
</code_context>

<issue_to_address>
No check for None value of n_components in convert_anndata.

Add a condition to only slice X if n_components is not None to avoid returning an empty array.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +305 to +307
for key1 in dist_dict.keys():
for key2 in dist_dict[key1].keys():
dist_dict[key1][key2] = np.asarray(dist_dict[key1][key2], dtype=float)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Potential for inconsistent array lengths in distance lists.

Arrays may have varying lengths due to differing numbers of co-occurrences. Please document or handle cases where (i, j) pairs have few or no co-occurrences to prevent downstream issues.

Comment on lines +330 to +331
if not arrays:
return np.nan
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Returning np.nan for empty arrays may mask upstream issues.

Consider raising an exception or warning instead, unless np.nan is explicitly expected and handled.

Suggested change
if not arrays:
return np.nan
if not arrays:
raise ValueError("No arrays found in dist_dict; cannot compute mean pairwise distance.")

Comment on lines +532 to +533
stretchs = (U - np.min(U)) / (np.max(U) - np.min(U))
stretchs = 1 - stretchs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Division by zero if all U values are equal in stretch calculation.

Handle the case where np.max(U) == np.min(U) to avoid division by zero.

Comment on lines 564 to +570
# set k to a globally relevant value if None
if k is None:
k = round(X_orig.shape[0]/2-1)
k = round(X_orig.shape[0] / 2 - 1)
if k < 5:
raise Exception('k needs to be >= 5 or number of observations in X is too small')

if method == 'spearman':
raise Exception(
"k needs to be >= 5 or number of observations in X is too small"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Default k value may be too large for small datasets.

Consider capping k or choosing a default that better handles small datasets to avoid errors or meaningless results.

Suggested change
# set k to a globally relevant value if None
if k is None:
k = round(X_orig.shape[0]/2-1)
k = round(X_orig.shape[0] / 2 - 1)
if k < 5:
raise Exception('k needs to be >= 5 or number of observations in X is too small')
if method == 'spearman':
raise Exception(
"k needs to be >= 5 or number of observations in X is too small"
)
# set k to a globally relevant value if None, with sensible capping for small datasets
if k is None:
max_k = min(20, X_orig.shape[0] - 1)
k = max(5, min(round(X_orig.shape[0] / 2 - 1), max_k))
if k < 5 or k > X_orig.shape[0] - 1:
raise Exception(
f"k needs to be >= 5 and <= number of observations in X - 1 (got k={k}, n={X_orig.shape[0]})"
)

Comment on lines 644 to 648
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): No normalization for negative or zero sum of weights in ensemble_concordance.

Division by zero will occur if all pc1_score values are zero, and negative values may cause ambiguous results. Please add a check or normalization to handle these cases.

### CONCORDANCE SCORES -- see our publication for mathematical details


def get_jaccard(X_orig, X_red, k, precomputed=[False, False]):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Replace mutable default arguments with None (default-mutable-arg)

return np.asarray(jaccards, dtype=float)


def get_distortion(X_orig, X_red, k, precomputed=[False, False]):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Replace mutable default arguments with None (default-mutable-arg)

Comment on lines 87 to +90
color_dict = {}
for li, lab in enumerate(np.unique(df_px[label])):
color_dict[lab] = colors[li]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
color_dict = {}
for li, lab in enumerate(np.unique(df_px[label])):
color_dict[lab] = colors[li]
color_dict = {
lab: colors[li] for li, lab in enumerate(np.unique(df_px[label]))
}


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

return fig


def animated(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

return images


def stacked(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in stacked - 7% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

@marioernestovaldes marioernestovaldes changed the title Dev Merge dev with main branch Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant