Skip to content

Conversation

sanjaynagi
Copy link
Collaborator

@sanjaynagi sanjaynagi commented Sep 5, 2025

This PR adds advanced haplotype clustering, allowing users to cut the tree and assign clusters, and to plot haplotype data (amino acid mutations by default) below the tree.

TODO

  • Add tests, notebook
  • Currently the code loads the SNP genotypes to get SNP effects, this is unnecessary and ideally we can load the SNP effects straight from the haplotypes

@sanjaynagi
Copy link
Collaborator Author

We now use the snp_effects() function... didnt know that existed, but makes sense

@sanjaynagi
Copy link
Collaborator Author

I also add grey lines to the amino acid plot of the diplotype clustering functions - it looks a lot neater.

And the haplotype function can now support multiple transcripts.

@ahernank
Copy link
Collaborator

This is super cool, @sanjaynagi. Thanks very much. I am struggling with bandwidth this week but have asked @leehart and @jonbrenas to take a look so we can merge asap. :)

@sanjaynagi
Copy link
Collaborator Author

Awesome, thanks @ahernank

@leehart
Copy link
Collaborator

leehart commented Sep 18, 2025

Just adding a corresponding issue for this #819

@leehart
Copy link
Collaborator

leehart commented Sep 19, 2025

Many thanks @sanjaynagi - This is great!

Pylint is telling me there's a potential glitch around the following code in the transcript_haplotypes function in the AnophelesHapClustAnalysis class:

        # Get SNP genotype allele counts for the transcript, applying snp_query
        df_eff = (
            self.snp_effects(
                transcript=transcript,
            )
            .query(snp_query)
            .reset_index(drop=True)
        )

Raises problem:

Instance of 'AnophelesHapClustAnalysis' has no 'snp_effects' member

I gather the issue is that AnophelesHapClustAnalysis inherits from AnophelesHapData and AnophelesSnpData, which both inherit from (AnophelesSampleMetadata, AnophelesGenomeFeaturesData, AnophelesGenomeSequenceData) but none of those have snp_effects, which is in AnophelesSnpFrequencyAnalysis instead.

class AnophelesSnpFrequencyAnalysis(AnophelesSnpData, AnophelesFrequencyAnalysis):

Obviously all the tests pass, and the examples in the notebooks appear to work, so I don't quite understand why something like an AttributeError is not already being raised somewhere.

We should probably check to see whether this warning from Pylint is right, anyhow. We might need to modify the inheritance chain, e.g. adding AnophelesSnpFrequencyAnalysis to AnophelesHapClustAnalysis, or move snp_effects into AnophelesSnpData, or maybe there's a different way to call snp_effects, etc.


# Get SNP genotype allele counts for the transcript, applying snp_query
df_eff = (
self.snp_effects(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pylint reports:

Instance of 'AnophelesHapClustAnalysis' has no 'snp_effects' member

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. also not clear why the code was working - but i have added the Frequency class to hapclust class now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Lee

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sanjaynagi . The snp_effects function is in AnophelesSnpFrequencyAnalysis rather than AnophelesFrequencyAnalysis, and AnophelesFrequencyAnalysis only inherits from AnophelesBase, so probably won't make snp_effects available to AnophelesHapClustAnalysis, if I understand correctly.

Also, since AnophelesSnpFrequencyAnalysis inherits from AnophelesSnpData, I suspect we might need to put it before AnophelesSnpData in the AnophelesHapClustAnalysis base classes list. That made we wonder whether we need AnophelesSnpData there at all, but when I tried removing it locally, I got a method resolution error.

Unfortunately, I'm starting to suspect that there might be a larger architectural issue here, around inheritance, namely https://en.wikipedia.org/wiki/Multiple_inheritance#The_diamond_problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad. I will replace the import of AnophelesFrequencyAnalysis and see what happens

@sanjaynagi
Copy link
Collaborator Author

Also, here is a notebook with a demo of the function https://colab.research.google.com/drive/1HqDJL0cz9gAKaeXdwYjcGXx-X1GYJznL?usp=sharing

@leehart
Copy link
Collaborator

leehart commented Sep 30, 2025

Looks like we're still waiting to try using AnophelesSnpFrequencyAnalysis (which has snp_effects) instead of AnophelesFrequencyAnalysis for the AnophelesHapClustAnalysis class, to satisfy Pylint.

I believe AnophelesSnpFrequencyAnalysis will need to be placed before AnophelesSnpData in the list, because AnophelesSnpFrequencyAnalysis inherits from AnophelesSnpData, e.g.

class AnophelesHapClustAnalysis(AnophelesSnpFrequencyAnalysis, AnophelesHapData, AnophelesSnpData):

We might be able to address the potential diamond inheritance problem in a separate issue.

Otherwise, I reckon this looks good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants