Skip to content

Comments

Dynamic Annotations#88

Open
MichaelBarnett wants to merge 70 commits intodevelopfrom
dynamic-annotations
Open

Dynamic Annotations#88
MichaelBarnett wants to merge 70 commits intodevelopfrom
dynamic-annotations

Conversation

@MichaelBarnett
Copy link
Contributor

Background

This update gives Commec a new suite of functionality - the ability to parse Regulation Annotations with a regional context. This allows the Taxonomy search to FLAG based on TaxID, GenBank, or Uniprot accession on a per Regulation List basis. Lists can be excluded based on Regional Context and their input completely ignored, or used democratically to WARN, or FLAG hits from Taxonomy Searches.

Furthermore, this functionality can be explored and interrogated using the add commec list cli commands.

Changes

  • Commec Regulation module added, a standalone module that handles list ingestion, and API for interrogating processed lists.
  • commec list command line interface allows quick interrogation of any accession when directed to the annotated regulation list database location. As well as providing regional context inputs, and .csv summary outputs.

Example:

(commec-dev) root@Michael-Laptop:~/repo/working/common-mechanism# commec list -d ../../commec-databases/commec-dbs/regulated_lists/ -o ../list.csv -l -a 11084
 The Common Mechanism : List
────────┐
ERROR   │ 51 imported regulated annotations were bad entries with no TaxID, Genbank, or Uniprot Accession:
        │                                                         name  category list_acronym
        │                                Paracoccidioides brasiliensis Eukaryota        COSHH
        │                     Sporadic Creutzfeldt-Jakob disease agent  Proteins        COSHH
        │                                Sporadic fatal insomnia agent  Proteins        COSHH
        │                Variably protease-resistant prionopathy agent  Proteins        COSHH

 ... manually truncated for brevity ...

        │                                           Neosaxitoxin (NEO)    Toxins       UKSECL
        │ Novel coronaviridae (eg bat coronaviruses WIVI or SHC011)...   Viruses        COSHH
        │  Run in --verbose mode for raw row input details.
WARNING │ The following imported regulated annotations were duplicates with differing metadata:
        │ accession                                           name  category list_acronym
        │     54388                         Salmonella paratyphi A  Bacteria        ATCSA
        │      5667                        Leishmania brasiliensis Eukaryota        COSHH
        │      5666                              Naegleria fowleri Eukaryota        COSHH
        │    565995                              Reston ebolavirus   Viruses        COSHH
        │     11084 Central European tick-borne encephalitis virus   Viruses        COSHH
        │     11084                                Hanzalova virus   Viruses        COSHH
        │     11084                                     Hypr virus   Viruses        COSHH
        │     11084         Siberian tick-borne encephalitis virus   Viruses        COSHH
        │     11276                   Vesicular stomatitis Alagoas   Viruses        COSHH
        │     11276                   Vesicular stomatitis Indiana   Viruses        COSHH
        │     11276                Vesicular stomatitis New Jersey   Viruses        COSHH
INFO    │  *----------* REGULATION LISTS *----------*
INFO    │ The following Regulation Lists have been identified:
        │ [SAPO] Specified Animal Pathogen Order - United Kingdom
        │ (https://www.hse.gov.uk/pubns/books/hsg280.htm)
        │ Regulated Taxid Entries: 34, Status : Compliance
        │ [ATCSA] Schedule 5 of the Anti-terrorism, Crime and Security Act 2001 - United Kingdom
        │ (https://www.legislation.gov.uk/ukpga/2001/24/2002-05-31/data.pdf)
        │ Regulated Taxid Entries: 66, Status : Compliance
        │ [COSHH] Control of Substances Hazardous to Health Regulations 2002 - United Kingdom
        │ (https://www.hse.gov.uk/pubns/misc208.htm)
        │ Regulated Taxid Entries: 172, Status : Compliance
        │ [UKSECL] UK Strategic Export Control List - United Kingdom
        │ (https://assets.publishing.service.gov.uk/media/660d281067958c001f365abe/uk-strategic-export-control-list.pdf)
        │ Regulated Taxid Entries: 155, Status : Compliance
        │ [CLBA] Combined list of biological agents update August 2024 - The Netherlands
        │ (https://www.bureaubiosecurity.nl/documenten/combined-list-of-biological-agents-update-august-2024)
        │ Regulated Taxid Entries: 0, Status : Compliance
        │ [DUIECLPRC] Dual-Use Items Export Control List of the People’s Republic of China - People's Republic of China
        │ (https://www.china-briefing.com/news/china-issues-new-export-control-regulations/)
        │ Regulated Taxid Entries: 0, Status : Compliance
        │ [SCOMET] India Export Controls (SCOMET List) - India
        │ (https://www.mea.gov.in/Portal/Images/SCOMET-List-2021.pdf)
        │ Regulated Taxid Entries: 0, Status : Compliance
        │     [Total number of Taxid Relationships:17058]
INFO    │  *----------* REGULATED TAXIDS *----------*
INFO    │ Regulation Annotations for supplied taxids (#1):
INFO    │    > Taxid 11084: Regulated by the following lists:
        │    > Viruses Tick-borne encephalitis virus (Russian Spring-Summer encephalitis virus) regulated by Schedule 5
        │ of the Anti-terrorism, Crime and Security Act 2001 [ATCSA]
        │    > Viruses Absettarov virus regulated by Control of Substances Hazardous to Health Regulations 2002 [COSHH]
INFO    │ Writing output list summary to "../list.csv.csv" ...
────────┘

New features

  • Adds regulation annotation to logging during taxonomic screening steps.
  • Adds user ability to interrogate lists directly through commec cli and looking at outputs. Or through generating a .csv containing the condensed list information.

Breaking changes

  • Additional yaml configuration inputs for regional context, and for regulation annotation database location. Defaults will work if user has latest updates from commec-databases.
  • Updates the JSON output version, with additional regulation list information in the commec info, and taxonomic hit information levels.

@MichaelBarnett
Copy link
Contributor Author

TODO:

  • Merge with Extra Taxonomic Info branch and synchronize new testing environment.
  • Add Taxa list acronym info to each taxonomy hit output.

Copy link
Member

@alexanian alexanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're still drafting and i apparently have lots of thoughts but support merging in a version of this that handles basically all the list-importing logic, then leaving the commec business logic changes for another PR


# Seperate versioning for the output JSON.
JSON_COMMEC_FORMAT_VERSION = "0.3"
JSON_COMMEC_FORMAT_VERSION = "0.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we should prioritise in the next few weeks is better support for previous versions of the JSON output, since right now we just crash out about it 😱

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand the rationale for versioning of JSON output - is it important for downstream parsing using commec flag? In other words, is commec flag not backwards compatible with older JSON outputs?

time_taken: str = ""
date_run: str = ""
search_tool_info: SearchToolInfo = field(default_factory=SearchToolInfo)
regulation_list_info : list[RegulationList] = field(default_factory=list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the repo that we use as a source for this is called pathogen-lists; I think probably regulation-lists is clearer, maybe we should rename the repo though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have convinced myself over the course of this conversation that we should call these "Control Lists" both here and in that repo, since we want to be able to include things that are not just national regulations, a company could have a cute lil personal control list

Comment on lines +66 to +67
# --pretty?
# --markdown? csv tsv etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine that we just provide a CSV for now


return parser_obj

def regulation_list_information():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some name to indicate that this is a string / summary?

Comment on lines 34 to 35
GENBANK = "GenbankProtein"
UNIPROT = "UniprotID"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these since we don't actually support them yet, and I don't want anyone (including me in 3 months) to be misled

Comment on lines +80 to +82
* Region : Region, in which case its acronym is returned in the set.
* Custom Region : str, i.e. EU, returns the set() of all containing alpha-2 codes
* Arbitrary String : str, uses pycountry to fuzzy search the region, returns the alpha-2 code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of our efforts to let users type in whatever their hearts desire

def get_regulation(accession : str, accession_fmt : data.AccessionFormat) -> list[tuple[RegulationList, TaxidRegulation]]:
"""
Check the given Accession against all imported regulated lists.
The input Accession can be a TaxID, GenBank protein, or Uniprot ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to continue to beat my drum of "but we only know about taxids right now" but otherwise I am supportive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong use case for including support for sequence accessions from multiple databases in the control list functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @manu-script would love your take on this; at present, we're only supporting lists of taxa, but we think at some point we should expand this to load up annotations for the biorisk... but as @MichaelBarnett and I are talking about it right now, we're not sure that makes sense.

I continue to think we should rephrase this as "taxid" rather than "accession" but I am willing to be outvoted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of limiting the scope of commec list to taxids for this PR to keep it simple.

IMO, there are usually multiple GenBank/UniProt accessions for a given controlled toxin, but only the representative accession is annotated in the control lists for practical reasons (based on my limited understanding of how the control lists were annotated with accessions). It would not be ideal to return "accession not found in the control list" if a user queries a different but valid accession for a controlled toxin.

logger.debug("Checking %s unique taxids", len(unique_taxids))
# Build a mapping {taxid: truthiness}
taxid_to_regulated = {
taxid: bool(get_regulation(int(taxid), AccessionFormat.TAXID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is where we'd potentially use the list mode to change the logic? or maybe this should just be a different thing that's like exists_in_control_lists (but doesn't make a claim about regulation yet)

Comment on lines 20 to 21
TAXID_SYNTHETIC_CONSTRUCTS = 32630
TAXID_VECTORS = 29278
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move these to constants while we're here?

dest="regions",
nargs="+",
default=[],
help="A list of countries or regions to add context to list compliance i.e. NZ US CH",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could include more about valid region_group values for this (e.g. Australia Group) somewhere...

Michael Barnett added 26 commits November 17, 2025 22:18
…egulated taxid. Ready for implementing cli.
…bug fixes regarding that process - change the dtype of taxid on import, should always be dealing with ints.
…tureWarning with the child taxa LUT file panda dataframe query by assigning types to all columns, columns with NaNs now affecting the same warning for the regulation taxids dataframe.
…hard coded column names from multiple places, integrated --regions as a way to pass regional context for list import. Working as intended.
…rting to search, included 3 limit in the fuzzy search allowance, which in particular allows United States as a search term.
…rlap, include tests in overlap checks. General code tidying.
…genbank, uniprot, and taxid accesions. Is easily extendable for other accesion types in the future.
…s, updates the system for better use with mutliple accession formats.
…ns extended, minor bug fixes and logging support during load, additional tests for duplicate list entrants.
…e list information, fixed pycountry dependancy not added to environment.yaml
…json version, and addition of list info to the functional json. Updated the regulation tests to use a valid ListMode instead of int.
@MichaelBarnett MichaelBarnett changed the base branch from main to develop November 17, 2025 09:53
Michael Barnett added 3 commits November 18, 2025 10:18
…gic, as well as to generate the output string for the control list, updates ControlLists to have a single region, as multiple regions are handled by custom regions, like EU, AG. Which also simplifies several region grabbing code snippets. Better handles error generation on ControlList parsing of region and use data.
@MichaelBarnett MichaelBarnett marked this pull request as ready for review November 18, 2025 22:30
Copy link
Member

@alexanian alexanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tiny comments, working on a full review but wanted to submit these instead of leaving them until Monday

Comment on lines +224 to +226
#species: str = ""
#genus : str = ""
#superkingdom: str = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#species: str = ""
#genus : str = ""
#superkingdom: str = ""

best to delete? but we should add this info to our lists in a future version?

Comment on lines +36 to +42
parser_obj.add_argument(
"-l",
"--list",
dest="showlists",
action="store_true",
help="Print a summary of all imported Control Lists",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just what the default should be if no flags are provided (vs. the current)

(/mnt/data/conda-envs/commec-dev) [ec2-user@ip-172-31-93-20 cm-dbs]$ commec list -d ~/cm-dbs/
 The Common Mechanism : List
────────┐
ERROR   │ commec list requires --lists/-l or --accessions/-a as input.


# Seperate versioning for the output JSON.
JSON_COMMEC_FORMAT_VERSION = "0.3"
JSON_COMMEC_FORMAT_VERSION = "0.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand the rationale for versioning of JSON output - is it important for downstream parsing using commec flag? In other words, is commec flag not backwards compatible with older JSON outputs?

def get_regulation(accession : str, accession_fmt : data.AccessionFormat) -> list[tuple[RegulationList, TaxidRegulation]]:
"""
Check the given Accession against all imported regulated lists.
The input Accession can be a TaxID, GenBank protein, or Uniprot ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong use case for including support for sequence accessions from multiple databases in the control list functionality?

Comment on lines 26 to 27
regulated_lists:
path: "{default}regulated_lists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use control_lists instead of regulated_lists? Are they the same or different by definition?

Suggested change
regulated_lists:
path: "{default}regulated_lists"
control_lists:
path: "{default}control_lists"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, there's a rename in progress for all instances of regulated lists, to be converted to control lists. due to some file name expectations, this is one of the last things changed.

def add_args(parser_obj: argparse.ArgumentParser) -> argparse.ArgumentParser:
"""
Add Control List module arguments to an ArgumentParser object.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add -y, --config CONFIG_YAML option to read the commec-dbs base path if we are going to distribute the control_lists folder in the core database from the next release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like this idea - originally I wanted to pass the config yaml, but then figured it was overkill and just point to the folder (what if someone was comparing outputs from two different control lists or versions, for example, and pointed it to different directories) But it should really support interpolating from a config file too for some users ease of use.

Michael Barnett added 4 commits January 10, 2026 12:38
…ith main. Test have been manually checked for consistency.
…erly unregulated and uncontrolled runs where nothing is controlled and everything passes, minor updates and tidying to the config file, and addition of fake control lists folder for functional tests to pass.
@alexanian alexanian mentioned this pull request Jan 22, 2026
Michael Barnett added 8 commits January 26, 2026 16:28
…s, to allow users to pass -y to commec list, in lieu of -d (databases dir).
…luded Fungi as a valid control list category, and updated logic for setting preferred name to treat empty strings as None as intended.
… gracefully the exits needed for when control lists are not found, and controlling for whether skip tx or not is used. Control list is deferred in the case of skip-tx to minimise verbosity of error handling during control list import failure.
…gestion safety, and Toxin label for Category. Minor additional debug comments.
… taxid text parsing. Fixed a domain vs domains grabbing bug from the annotations json output, removed the check for the existance of the no longer used biorisk controlled taxids csv path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants