Conversation
|
TODO:
|
alexanian
left a comment
There was a problem hiding this comment.
I know we're still drafting and i apparently have lots of thoughts but support merging in a version of this that handles basically all the list-importing logic, then leaving the commec business logic changes for another PR
|
|
||
| # Seperate versioning for the output JSON. | ||
| JSON_COMMEC_FORMAT_VERSION = "0.3" | ||
| JSON_COMMEC_FORMAT_VERSION = "0.4" |
There was a problem hiding this comment.
Something we should prioritise in the next few weeks is better support for previous versions of the JSON output, since right now we just crash out about it 😱
There was a problem hiding this comment.
Just trying to understand the rationale for versioning of JSON output - is it important for downstream parsing using commec flag? In other words, is commec flag not backwards compatible with older JSON outputs?
commec/config/result.py
Outdated
| time_taken: str = "" | ||
| date_run: str = "" | ||
| search_tool_info: SearchToolInfo = field(default_factory=SearchToolInfo) | ||
| regulation_list_info : list[RegulationList] = field(default_factory=list) |
There was a problem hiding this comment.
I think the repo that we use as a source for this is called pathogen-lists; I think probably regulation-lists is clearer, maybe we should rename the repo though?
There was a problem hiding this comment.
I have convinced myself over the course of this conversation that we should call these "Control Lists" both here and in that repo, since we want to be able to include things that are not just national regulations, a company could have a cute lil personal control list
| # --pretty? | ||
| # --markdown? csv tsv etc |
There was a problem hiding this comment.
I think it's fine that we just provide a CSV for now
commec/control_list/cli.py
Outdated
|
|
||
| return parser_obj | ||
|
|
||
| def regulation_list_information(): |
There was a problem hiding this comment.
some name to indicate that this is a string / summary?
commec/control_list/containers.py
Outdated
| GENBANK = "GenbankProtein" | ||
| UNIPROT = "UniprotID" |
There was a problem hiding this comment.
Let's remove these since we don't actually support them yet, and I don't want anyone (including me in 3 months) to be misled
| * Region : Region, in which case its acronym is returned in the set. | ||
| * Custom Region : str, i.e. EU, returns the set() of all containing alpha-2 codes | ||
| * Arbitrary String : str, uses pycountry to fuzzy search the region, returns the alpha-2 code. |
There was a problem hiding this comment.
I am in favour of our efforts to let users type in whatever their hearts desire
| def get_regulation(accession : str, accession_fmt : data.AccessionFormat) -> list[tuple[RegulationList, TaxidRegulation]]: | ||
| """ | ||
| Check the given Accession against all imported regulated lists. | ||
| The input Accession can be a TaxID, GenBank protein, or Uniprot ID. |
There was a problem hiding this comment.
going to continue to beat my drum of "but we only know about taxids right now" but otherwise I am supportive
There was a problem hiding this comment.
Is there a strong use case for including support for sequence accessions from multiple databases in the control list functionality?
There was a problem hiding this comment.
Yeah, @manu-script would love your take on this; at present, we're only supporting lists of taxa, but we think at some point we should expand this to load up annotations for the biorisk... but as @MichaelBarnett and I are talking about it right now, we're not sure that makes sense.
I continue to think we should rephrase this as "taxid" rather than "accession" but I am willing to be outvoted
There was a problem hiding this comment.
I'm also in favor of limiting the scope of commec list to taxids for this PR to keep it simple.
IMO, there are usually multiple GenBank/UniProt accessions for a given controlled toxin, but only the representative accession is annotated in the control lists for practical reasons (based on my limited understanding of how the control lists were annotated with accessions). It would not be ideal to return "accession not found in the control list" if a user queries a different but valid accession for a controlled toxin.
commec/tools/blast_tools.py
Outdated
| logger.debug("Checking %s unique taxids", len(unique_taxids)) | ||
| # Build a mapping {taxid: truthiness} | ||
| taxid_to_regulated = { | ||
| taxid: bool(get_regulation(int(taxid), AccessionFormat.TAXID)) |
There was a problem hiding this comment.
so this is where we'd potentially use the list mode to change the logic? or maybe this should just be a different thing that's like exists_in_control_lists (but doesn't make a claim about regulation yet)
| TAXID_SYNTHETIC_CONSTRUCTS = 32630 | ||
| TAXID_VECTORS = 29278 |
There was a problem hiding this comment.
should we move these to constants while we're here?
| dest="regions", | ||
| nargs="+", | ||
| default=[], | ||
| help="A list of countries or regions to add context to list compliance i.e. NZ US CH", |
There was a problem hiding this comment.
could include more about valid region_group values for this (e.g. Australia Group) somewhere...
2e756b5 to
aba1d0e
Compare
d7e7251 to
47d6857
Compare
…xid retrieval recursive should work.
…egulated taxid. Ready for implementing cli.
…bug fixes regarding that process - change the dtype of taxid on import, should always be dealing with ints.
…tureWarning with the child taxa LUT file panda dataframe query by assigning types to all columns, columns with NaNs now affecting the same warning for the regulation taxids dataframe.
…hard coded column names from multiple places, integrated --regions as a way to pass regional context for list import. Working as intended.
…rting to search, included 3 limit in the fuzzy search allowance, which in particular allows United States as a search term.
…rlap, include tests in overlap checks. General code tidying.
…xid info printing. Tidied outputs.
…all is lower case
…annotations list.
…ch term for country search.
…genbank, uniprot, and taxid accesions. Is easily extendable for other accesion types in the future.
…s, updates the system for better use with mutliple accession formats.
…ns extended, minor bug fixes and logging support during load, additional tests for duplicate list entrants.
…t int inputs and assumes taxid.
…e list information, fixed pycountry dependancy not added to environment.yaml
…json version, and addition of list info to the functional json. Updated the regulation tests to use a valid ListMode instead of int.
5fbe4c4 to
865f757
Compare
…gic, as well as to generate the output string for the control list, updates ControlLists to have a single region, as multiple regions are handled by custom regions, like EU, AG. Which also simplifies several region grabbing code snippets. Better handles error generation on ControlList parsing of region and use data.
…cronym conversions for a control list.
alexanian
left a comment
There was a problem hiding this comment.
A few tiny comments, working on a full review but wanted to submit these instead of leaving them until Monday
| #species: str = "" | ||
| #genus : str = "" | ||
| #superkingdom: str = "" |
There was a problem hiding this comment.
| #species: str = "" | |
| #genus : str = "" | |
| #superkingdom: str = "" |
best to delete? but we should add this info to our lists in a future version?
| parser_obj.add_argument( | ||
| "-l", | ||
| "--list", | ||
| dest="showlists", | ||
| action="store_true", | ||
| help="Print a summary of all imported Control Lists", | ||
| ) |
There was a problem hiding this comment.
I think this is just what the default should be if no flags are provided (vs. the current)
(/mnt/data/conda-envs/commec-dev) [ec2-user@ip-172-31-93-20 cm-dbs]$ commec list -d ~/cm-dbs/
The Common Mechanism : List
────────┐
ERROR │ commec list requires --lists/-l or --accessions/-a as input.
|
|
||
| # Seperate versioning for the output JSON. | ||
| JSON_COMMEC_FORMAT_VERSION = "0.3" | ||
| JSON_COMMEC_FORMAT_VERSION = "0.4" |
There was a problem hiding this comment.
Just trying to understand the rationale for versioning of JSON output - is it important for downstream parsing using commec flag? In other words, is commec flag not backwards compatible with older JSON outputs?
| def get_regulation(accession : str, accession_fmt : data.AccessionFormat) -> list[tuple[RegulationList, TaxidRegulation]]: | ||
| """ | ||
| Check the given Accession against all imported regulated lists. | ||
| The input Accession can be a TaxID, GenBank protein, or Uniprot ID. |
There was a problem hiding this comment.
Is there a strong use case for including support for sequence accessions from multiple databases in the control list functionality?
commec/screen-default-config.yaml
Outdated
| regulated_lists: | ||
| path: "{default}regulated_lists" |
There was a problem hiding this comment.
Should we use control_lists instead of regulated_lists? Are they the same or different by definition?
| regulated_lists: | |
| path: "{default}regulated_lists" | |
| control_lists: | |
| path: "{default}control_lists" |
There was a problem hiding this comment.
Ah yes, there's a rename in progress for all instances of regulated lists, to be converted to control lists. due to some file name expectations, this is one of the last things changed.
| def add_args(parser_obj: argparse.ArgumentParser) -> argparse.ArgumentParser: | ||
| """ | ||
| Add Control List module arguments to an ArgumentParser object. | ||
| """ |
There was a problem hiding this comment.
We could also add -y, --config CONFIG_YAML option to read the commec-dbs base path if we are going to distribute the control_lists folder in the core database from the next release.
There was a problem hiding this comment.
Yes I like this idea - originally I wanted to pass the config yaml, but then figured it was overkill and just point to the folder (what if someone was comparing outputs from two different control lists or versions, for example, and pointed it to different directories) But it should really support interpolating from a config file too for some users ease of use.
…ith main. Test have been manually checked for consistency.
…ol_lists in the default yaml inputs.
…erly unregulated and uncontrolled runs where nothing is controlled and everything passes, minor updates and tidying to the config file, and addition of fake control lists folder for functional tests to pass.
…s, to allow users to pass -y to commec list, in lieu of -d (databases dir).
…luded Fungi as a valid control list category, and updated logic for setting preferred name to treat empty strings as None as intended.
… gracefully the exits needed for when control lists are not found, and controlling for whether skip tx or not is used. Control list is deferred in the case of skip-tx to minimise verbosity of error handling during control list import failure.
…gestion safety, and Toxin label for Category. Minor additional debug comments.
… taxid text parsing. Fixed a domain vs domains grabbing bug from the annotations json output, removed the check for the existance of the no longer used biorisk controlled taxids csv path.
Background
This update gives Commec a new suite of functionality - the ability to parse Regulation Annotations with a regional context. This allows the Taxonomy search to FLAG based on TaxID, GenBank, or Uniprot accession on a per Regulation List basis. Lists can be excluded based on
Regional Contextand their input completely ignored, or used democratically to WARN, or FLAG hits from Taxonomy Searches.Furthermore, this functionality can be explored and interrogated using the add
commec listcli commands.Changes
Regulationmodule added, a standalone module that handles list ingestion, and API for interrogating processed lists.commec listcommand line interface allows quick interrogation of any accession when directed to the annotated regulation list database location. As well as providing regional context inputs, and .csv summary outputs.Example:
New features
Breaking changes