Conversation
…core.variance-performance Add variance speed benchmark test
…-calculation-in-populate_distance_dict Improve distance dict build
…mean_variance_distance-function Improve distance variance calculation
…-handling-with-numpy Improve distance calculations
…for-requested-task [dynamicviz] Run formatting and linting
…-in-list-and-update-tests [dynamicviz] optimize bootstrap concatenation
…1-from-compute_mean_distance [dynamicviz] remove artificial delay
…and-pytest-checks [dynamicviz] Fix test imports
…_distance_dict-for-performance [dynamicviz] optimize distance computation
…e_distance_dict-for-performance [dynamicviz] optimize distance computation
…nce-score-calculation [dynamicviz] Fix concordance metrics and add tests
Reviewer's GuideThis PR merges the development branch into main by refactoring core scoring and bootstrap algorithms for vectorized performance, standardizing code style and docstrings across modules, modernizing the visualization APIs, optimizing DataFrame assembly in bootstrapping, unifying distance metric implementations, updating CI lint configurations, and significantly expanding test coverage and contributor documentation. Class diagram for refactored scoring and concordance functions in score.pyclassDiagram
class ScoreUtils {
+variance(df, method, k, X_orig, neighborhoods, normalize_pairwise_distance, include_original, return_times)
+stability(df, method, alpha, k, X_orig, neighborhoods, normalize_pairwise_distance, include_original, return_times)
+stability_from_variance(mean_variance_distances, alpha)
+get_neighborhood_dict(method, k, keys, neighborhoods, X_orig)
+populate_distance_dict(neighborhood_dict, embeddings, bootstrap_indices)
+compute_mean_distance(dist_dict, normalize_pairwise_distance)
+compute_mean_variance_distance(dist_dict, normalize_pairwise_distance, mean_pairwise_distance)
}
class ConcordanceUtils {
+concordance(df, X_orig, method, k, bootstrap_number)
+ensemble_concordance(df, X_orig, methods, k, bootstrap_number, verbose)
+get_jaccard(X_orig, X_red, k, precomputed)
+get_distortion(X_orig, X_red, k, precomputed)
+get_mean_projection_error(X_orig, X_red)
+get_stretch(X_orig, X_red)
}
ScoreUtils <.. ConcordanceUtils : uses
Class diagram for refactored bootstrapping and DR pipeline in boot.pyclassDiagram
class BootUtils {
+generate(X, method, Y, B, sigma_noise, no_bootstrap, random_seed, save, num_jobs, use_n_pcs, subsample, return_times, **kwargs)
+bootstrap(X, method, B, sigma_noise, no_bootstrap, random_seed, num_jobs, use_n_pcs, subsample, **kwargs)
+run_one_bootstrap(X, method, sigma_noise, no_bootstrap, random_seeded_sequence, b, use_n_pcs, subsample, **kwargs)
+dimensionality_reduction(X, method, **kwargs)
}
Class diagram for modernized visualization API in viz.pyclassDiagram
class VizUtils {
+interactive(df, label, show, save, colors, xlabel, ylabel, title, legend_title, vmin, vmax, alpha, width, height, dpi, dim)
+animated(df, label, save, get_frames, colors, cmap, width, height, dpi, xlabel, ylabel, title, title_fontsize, major_fontsize, minor_fontsize, vmin, vmax, marker, alpha, solid_cbar, show_legend, solid_legend, legend_fontsize, dim, **kwargs)
+stacked(df, label, show, save, colors, cmap, width, height, dpi, xlabel, ylabel, title, title_fontsize, major_fontsize, minor_fontsize, vmin, vmax, marker, alpha, solid_cbar, show_legend, solid_legend, legend_fontsize, dim, frame, return_fig, **kwargs)
+fig2img(fig, dpi)
}
Class diagram for AnnData conversion utilities in utils.pyclassDiagram
class AnnDataUtils {
+convert_anndata(adata, obs, obsm, n_components)
+regenerate_anndata(adata, df, bootstrap_number, obsm)
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @marioernestovaldes - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `dynamicviz/score.py:305` </location>
<code_context>
+ dist = 0.0
+ dist_dict[key1][key2].append(dist)
+
+ for key1 in dist_dict.keys():
+ for key2 in dist_dict[key1].keys():
+ dist_dict[key1][key2] = np.asarray(dist_dict[key1][key2], dtype=float)
+
+ return dist_dict
</code_context>
<issue_to_address>
Potential for inconsistent array lengths in distance lists.
Arrays may have varying lengths due to differing numbers of co-occurrences. Please document or handle cases where (i, j) pairs have few or no co-occurrences to prevent downstream issues.
</issue_to_address>
### Comment 2
<location> `dynamicviz/score.py:330` </location>
<code_context>
+ for key2 in dist_dict[key1].keys()
+ ]
+
+ if not arrays:
+ return np.nan
+
+ maxlen = max(len(a) for a in arrays)
</code_context>
<issue_to_address>
Returning np.nan for empty arrays may mask upstream issues.
Consider raising an exception or warning instead, unless np.nan is explicitly expected and handled.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if not arrays:
return np.nan
=======
if not arrays:
raise ValueError("No arrays found in dist_dict; cannot compute mean pairwise distance.")
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `dynamicviz/score.py:532` </location>
<code_context>
- stretchs = 1-stretchs
-
- return(stretchs)
+ stretchs = (U - np.min(U)) / (np.max(U) - np.min(U))
+ stretchs = 1 - stretchs
+
+ return stretchs
</code_context>
<issue_to_address>
Division by zero if all U values are equal in stretch calculation.
Handle the case where np.max(U) == np.min(U) to avoid division by zero.
</issue_to_address>
### Comment 4
<location> `dynamicviz/score.py:565` </location>
<code_context>
+ ), "Error: number of observations are not consistent"
+
# set k to a globally relevant value if None
if k is None:
- k = round(X_orig.shape[0]/2-1)
+ k = round(X_orig.shape[0] / 2 - 1)
if k < 5:
- raise Exception('k needs to be >= 5 or number of observations in X is too small')
</code_context>
<issue_to_address>
Default k value may be too large for small datasets.
Consider capping k or choosing a default that better handles small datasets to avoid errors or meaningless results.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# set k to a globally relevant value if None
if k is None:
k = round(X_orig.shape[0] / 2 - 1)
if k < 5:
raise Exception(
"k needs to be >= 5 or number of observations in X is too small"
)
=======
# set k to a globally relevant value if None, with sensible capping for small datasets
if k is None:
max_k = min(20, X_orig.shape[0] - 1)
k = max(5, min(round(X_orig.shape[0] / 2 - 1), max_k))
if k < 5 or k > X_orig.shape[0] - 1:
raise Exception(
f"k needs to be >= 5 and <= number of observations in X - 1 (got k={k}, n={X_orig.shape[0]})"
)
>>>>>>> REPLACE
</suggested_fix>
### Comment 5
<location> `dynamicviz/score.py:644` </location>
<code_context>
-
- return(ensemble_metric, pointwise_metrics_list)
\ No newline at end of file
+ weighted_metrics_list = [
+ pointwise_metrics_list[i] * np.abs(pc1_score)[i]
+ for i in range(len(pointwise_metrics_list))
+ ]
+ ensemble_metric = np.sum(weighted_metrics_list, axis=0) / np.sum(np.abs(pc1_score))
+
+ return (ensemble_metric, pointwise_metrics_list)
</code_context>
<issue_to_address>
No normalization for negative or zero sum of weights in ensemble_concordance.
Division by zero will occur if all pc1_score values are zero, and negative values may cause ambiguous results. Please add a check or normalization to handle these cases.
</issue_to_address>
### Comment 6
<location> `dynamicviz/viz.py:95` </location>
<code_context>
- })
-
- if isinstance(save, str): # save as HTML
+ if df[label].dtype == np.float: # continuous labels
+ fig = px.scatter(
+ df_px,
+ x="x1",
+ y="x2",
+ animation_frame="bootstrap_number",
+ animation_group="original_index",
+ color=label,
+ hover_name=label,
+ title=title,
+ range_x=[-dim, dim],
+ range_y=[-dim, dim],
+ range_color=(vmin, vmax),
+ opacity=alpha,
+ width=width,
+ height=height,
+ labels={"x1": xlabel, "x2": ylabel, label: legend_title},
+ )
+ elif isinstance(colors, list): # if colors is specified for discrete labels
</code_context>
<issue_to_address>
Use of deprecated np.float dtype check.
Use np.issubdtype(df[label].dtype, np.floating) instead for compatibility with newer NumPy versions.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if df[label].dtype == np.float: # continuous labels
fig = px.scatter(
df_px,
x="x1",
y="x2",
animation_frame="bootstrap_number",
animation_group="original_index",
color=label,
hover_name=label,
title=title,
range_x=[-dim, dim],
range_y=[-dim, dim],
range_color=(vmin, vmax),
opacity=alpha,
width=width,
height=height,
labels={"x1": xlabel, "x2": ylabel, label: legend_title},
)
=======
if np.issubdtype(df[label].dtype, np.floating): # continuous labels
fig = px.scatter(
df_px,
x="x1",
y="x2",
animation_frame="bootstrap_number",
animation_group="original_index",
color=label,
hover_name=label,
title=title,
range_x=[-dim, dim],
range_y=[-dim, dim],
range_color=(vmin, vmax),
opacity=alpha,
width=width,
height=height,
labels={"x1": xlabel, "x2": ylabel, label: legend_title},
)
>>>>>>> REPLACE
</suggested_fix>
### Comment 7
<location> `dynamicviz/viz.py:419` </location>
<code_context>
-
+
# automatically set alpha based on heuristic
if alpha is None:
- alpha = 0.2/(df.shape[0]/1000)
+ alpha = 0.2 / (df.shape[0] / 1000)
+
+ fig = plt.figure(figsize=(width, height))
</code_context>
<issue_to_address>
Heuristic for alpha may produce values >1 for small datasets.
Consider capping alpha at 1.0 or using a more robust scaling method for small datasets (df.shape[0] < 200), as the current heuristic may yield unintended values above 1.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# automatically set alpha based on heuristic
if alpha is None:
alpha = 0.2 / (df.shape[0] / 1000)
fig = plt.figure(figsize=(width, height))
=======
# automatically set alpha based on heuristic
if alpha is None:
if df.shape[0] < 200:
alpha = 1.0
else:
alpha = 0.2 / (df.shape[0] / 1000)
alpha = min(alpha, 1.0)
fig = plt.figure(figsize=(width, height))
>>>>>>> REPLACE
</suggested_fix>
### Comment 8
<location> `dynamicviz/utils.py:31` </location>
<code_context>
+ """
# get X from adata.obsm or adata.X
obsm_keys = list(adata.obsm.keys())
if obsm in obsm_keys:
X = adata.obsm[obsm]
if obsm == "X_pca":
- X = X[:,:n_components]
+ X = X[:, :n_components]
else:
X = adata.X
</code_context>
<issue_to_address>
No check for None value of n_components in convert_anndata.
Add a condition to only slice X if n_components is not None to avoid returning an empty array.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for key1 in dist_dict.keys(): | ||
| for key2 in dist_dict[key1].keys(): | ||
| dist_dict[key1][key2] = np.asarray(dist_dict[key1][key2], dtype=float) |
There was a problem hiding this comment.
question: Potential for inconsistent array lengths in distance lists.
Arrays may have varying lengths due to differing numbers of co-occurrences. Please document or handle cases where (i, j) pairs have few or no co-occurrences to prevent downstream issues.
| if not arrays: | ||
| return np.nan |
There was a problem hiding this comment.
suggestion (bug_risk): Returning np.nan for empty arrays may mask upstream issues.
Consider raising an exception or warning instead, unless np.nan is explicitly expected and handled.
| if not arrays: | |
| return np.nan | |
| if not arrays: | |
| raise ValueError("No arrays found in dist_dict; cannot compute mean pairwise distance.") |
| stretchs = (U - np.min(U)) / (np.max(U) - np.min(U)) | ||
| stretchs = 1 - stretchs |
There was a problem hiding this comment.
issue: Division by zero if all U values are equal in stretch calculation.
Handle the case where np.max(U) == np.min(U) to avoid division by zero.
| # set k to a globally relevant value if None | ||
| if k is None: | ||
| k = round(X_orig.shape[0]/2-1) | ||
| k = round(X_orig.shape[0] / 2 - 1) | ||
| if k < 5: | ||
| raise Exception('k needs to be >= 5 or number of observations in X is too small') | ||
|
|
||
| if method == 'spearman': | ||
| raise Exception( | ||
| "k needs to be >= 5 or number of observations in X is too small" | ||
| ) |
There was a problem hiding this comment.
suggestion: Default k value may be too large for small datasets.
Consider capping k or choosing a default that better handles small datasets to avoid errors or meaningless results.
| # set k to a globally relevant value if None | |
| if k is None: | |
| k = round(X_orig.shape[0]/2-1) | |
| k = round(X_orig.shape[0] / 2 - 1) | |
| if k < 5: | |
| raise Exception('k needs to be >= 5 or number of observations in X is too small') | |
| if method == 'spearman': | |
| raise Exception( | |
| "k needs to be >= 5 or number of observations in X is too small" | |
| ) | |
| # set k to a globally relevant value if None, with sensible capping for small datasets | |
| if k is None: | |
| max_k = min(20, X_orig.shape[0] - 1) | |
| k = max(5, min(round(X_orig.shape[0] / 2 - 1), max_k)) | |
| if k < 5 or k > X_orig.shape[0] - 1: | |
| raise Exception( | |
| f"k needs to be >= 5 and <= number of observations in X - 1 (got k={k}, n={X_orig.shape[0]})" | |
| ) |
dynamicviz/score.py
Outdated
There was a problem hiding this comment.
issue (bug_risk): No normalization for negative or zero sum of weights in ensemble_concordance.
Division by zero will occur if all pc1_score values are zero, and negative values may cause ambiguous results. Please add a check or normalization to handle these cases.
| ### CONCORDANCE SCORES -- see our publication for mathematical details | ||
|
|
||
|
|
||
| def get_jaccard(X_orig, X_red, k, precomputed=[False, False]): |
There was a problem hiding this comment.
issue (code-quality): Replace mutable default arguments with None (default-mutable-arg)
| return np.asarray(jaccards, dtype=float) | ||
|
|
||
|
|
||
| def get_distortion(X_orig, X_red, k, precomputed=[False, False]): |
There was a problem hiding this comment.
issue (code-quality): Replace mutable default arguments with None (default-mutable-arg)
| color_dict = {} | ||
| for li, lab in enumerate(np.unique(df_px[label])): | ||
| color_dict[lab] = colors[li] | ||
|
|
There was a problem hiding this comment.
suggestion (code-quality): We've found these issues:
- Convert for loop into dictionary comprehension (
dict-comprehension) - Low code quality found in interactive - 19% (
low-code-quality)
| color_dict = {} | |
| for li, lab in enumerate(np.unique(df_px[label])): | |
| color_dict[lab] = colors[li] | |
| color_dict = { | |
| lab: colors[li] for li, lab in enumerate(np.unique(df_px[label])) | |
| } |
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| return fig | ||
|
|
||
|
|
||
| def animated( |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Use f-string instead of string concatenation [×2] (
use-fstring-for-concatenation) - Extract code out into function (
extract-method) - Low code quality found in animated - 3% (
low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| return images | ||
|
|
||
|
|
||
| def stacked( |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in stacked - 7% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
…rkflow-before-merge [dynamicviz] Add workflow for pytest
Merge dev with main branch
Summary by Sourcery
Merge development into main branch, applying code style standardizations, performance optimizations in scoring routines, expanded test coverage, and updated documentation and project configuration.
Enhancements:
Build:
Documentation:
Tests:
Chores: