diff --git a/README.md b/README.md index 95ee7e9..55be25e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,15 @@ # clinvar-gk-python -Project for reading and normalizing ClinVar variants into GA4GH GKS forms +Project for reading and normalizing ClinVar variants into GA4GH GKS forms. + +## Setup + +### Prerequisites + +1. **Docker** (or podman) - Required to run the variation-normalization services +2. **Python 3.11+** - Required for the main application +3. **SeqRepo database** - Local sequence repository +4. **UTA database** - Local Universal Transcript Archive (only needed for liftover) ## Installation @@ -16,46 +25,142 @@ cd clinvar-gk-python pip install -e '.[dev]' ``` -## Configuration +### Database Services Setup -A SeqRepo and UTA instance should be downloaded and/or running locally. +This project requires several database services that can be easily set up using the Docker compose configuration from the variation-normalization project. -The preferred method is to download a SeqRepo DB dir to your local filesystem, and to run the UTA docker image. +1. Download the compose.yaml file from variation-normalization v0.15.0 (matching the version in pyproject.toml): -See: https://github.com/biocommons/anyvar?tab=readme-ov-file#required-dependencies +```bash +curl -o variation-normalizer-compose.yaml https://raw.githubusercontent.com/cancervariants/variation-normalization/0.15.0/compose.yaml +``` + +2. Start the required services: -The docker compose file in `vrs-python` or `anyvar` can be used / trimmed down to run a UTA container from a snapshot and bind it to an available local port. If you have postgresql running natively on your system on port 5432 you need to modify the compose file in order to ensure the left hand side of the UTA port field is some other port like 5433. And then make sure you use that port number in the `UTA_DB_URL` variable below. +```bash +docker compose -f variation-normalizer-compose.yaml up -d +``` +(*or `uvx podman-compose` for podman*) -e.g. In the file below change the left hand side of `.services.uta.ports[0]` to `127.0.0.1:5433`. +This will start: +- **UTA database** (port 5432): Universal Transcript Archive for transcript mapping +- **Gene Normalizer database** (port 8000): Gene normalization service +- **Variation Normalizer API** (port 8001): Variation normalization service -https://github.com/ga4gh/vrs-python/blob/main/docker-compose.yml +**Note on Port Conflicts**: If you already have services running on these ports, you can modify the port mappings in `variation-normalizer-compose.yaml`: +- For UTA database: Change `5432:5432` to `5433:5432` (or another available port) +- For Gene Normalizer: Change `8000:8000` to `8002:8000` (or another available port) +- For Variation Normalizer API: Change `8001:80` to `8003:80` (or another available port) -Then run with podman (or docker) compose: +Verify containers are running on the desired ports, e.g. the UTA postgres is running on host port 5433 and the gene normalizer db is on port 8000: ``` -uvx podman-compose -f ./docker-compose.yml up +docker ps -a | grep 'uta\|gene-norm' ``` -Verify the UTA postgres is running on host port 5433: +### Environment Configuration + +Set up the required environment variables. You can use the provided `env.sh` as a reference: + +```bash +# SeqRepo configuration - Update path to your local SeqRepo installation +export SEQREPO_ROOT_DIR=/usr/local/share/seqrepo/2024-12-20 +export SEQREPO_DATAPROXY_URL=seqrepo+file://${SEQREPO_ROOT_DIR} + +# Database URLs (using the Docker compose services) +export UTA_DB_URL=postgresql://anonymous:anonymous@localhost:5432/uta/uta_20241220 +export GENE_NORM_DB_URL=http://localhost:8000 ``` -podman ps -a | grep uta + +**Important**: If you modified the ports in the compose file, update the corresponding environment variables accordingly (e.g., change `5432` to `5433` in `UTA_DB_URL` if you changed the UTA port). + +### Python Installation + +Install the project and its dependencies: + +```bash +pip install -e '.[dev]' ``` -## Using +## Running -Point the tool at a SeqRepo database directory at a `seqrepo-rest-service` HTTP URL. +### Basic Usage + +The `clinvar_gk_pilot` main entrypoint can automatically handle downloading `gs://` URLs. It places the file in a directory called `buckets`, with the bucket name and the same path prefix. e.g. `gs://clinvar-gks/2025-07-06/dev/vi.json.gz` gets automatically downloaded to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`. The input file is expected to be compressed with GZIP and in JSONL/NDJSON format with each line being a JSON object. + +The output is written to the same path as the local input file, but under an `output` directory in the current working directory. e.g. for the input filename `gs://clinvar-gks/2025-07-06/dev/vi.json.gz`, the file will be auto-cached to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` and the output will be written to `output/buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` -And to a postgresql server containing the UTA database. -The `clinvar_gk_pilot` main entrypoint can automatically handle downloading `gs://` URLs. It places the file in a directory called `buckets`, with the bucket name and the same path prefix. e.g. `gs://clinvar-gks/2025-07-06/dev/vi.json.gz` gets automatically downloaded to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`. +Process a ClinVar variant-identity file: -```sh -export UTA_DB_URL=postgresql://anonymous@localhost:5433/uta/uta_20241220 -export SEQREPO_DATAPROXY_URL='seqrepo+file:///Users/kferrite/dev/data/seqrepo/2024-12-20' -python clinvar_gk_pilot/main.py --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4 2>&1 | tee 2025-07-06.log +```bash +python clinvar_gk_pilot/main.py --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4 ``` -Parallelism is configurable and uses python multiprocessing and multiprocessing queues. Some parallelism is significantly beneficial but since there is interprocess communication overhead and they are hitting the same database there can be diminishing returns. On a Macbook Pro with 16 cores, setting parallelism to 4-6 provides clear benefit, but exceeding 10 saturates the machine and may be counterproductive. The code will partition the input file into `` number of files and each worker will process one, and then the outputs will be combined. +### Command Line Options -If parallelism is enabled, each worker also monitors its child process and terminates excessively long tasks. +- `--filename`: Input file path (supports local files and gs:// URLs) +- `--parallelism`: Number of worker processes for parallel processing (default: 1) +- `--liftover`: Enable liftover functionality for genomic coordinate conversion -The output is written to the same path as the local input file, but under an `output` directory in the current working directory. e.g. for the input filename `gs://clinvar-gks/2025-07-06/dev/vi.json.gz`, the file will be auto-cached to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` and the output will be written to `output/buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` +### Example Commands + +Process a local file: +```bash +clinvar-gk-pilot --filename sample-input.ndjson.gz --parallelism 4 +``` + +Process a file from Google Cloud Storage: +```bash +clinvar-gk-pilot --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4 +``` + +### Parallelism + +Parallelism is configurable and uses python multiprocessing and multiprocessing queues. Some parallelism is significantly beneficial but since there is interprocess communication overhead and they are hitting the same filesystem there can be diminishing returns. On a Macbook Pro with 16 cores, setting parallelism to 4-6 provides clear benefit, but exceeding 10 saturates the machine and may be counterproductive. The code will partition the input file into `` number of files and each worker will process one, and then the outputs will be combined. + +If parallelism is enabled, each worker also monitors its child process, terminates excessively long tasks, and add an error annotation to the output record for that variant indicating that it exceeded the time limit. + + +### Important Notes on Liftover + +When using the `--liftover` option, the application will send queries to the UTA PostgreSQL database for genomic coordinate conversion. Due to Docker's default shared memory constraints, high parallelism combined with liftover can cause out-of-memory errors. + +**Recommendations:** +- Keep `--parallelism` on the lower side (2-4) when using `--liftover` and when UTA is in docker +- Alternatively, increase the `shm_size` for the UTA container in `variation-normalizer-compose.yaml`: + +```yaml +services: + uta: + # ... other configuration + shm_size: 256m # Add this line to increase shared memory to 256MB +``` + +## Development + + +### Testing + +Run the test suite: +```bash +pytest +``` + +Run specific tests: +```bash +pytest test/test_cli.py::test_parse_args +``` + +### Code Quality + +Check and fix code quality issues: +```bash +# Check code quality +./lint.sh + +# Apply automatic fixes +./lint.sh apply +``` + +The lint script runs: +- black, isort, ruff, pylint diff --git a/clinvar_gk_pilot/cli.py b/clinvar_gk_pilot/cli.py index ee9e6b1..7f818c1 100644 --- a/clinvar_gk_pilot/cli.py +++ b/clinvar_gk_pilot/cli.py @@ -18,4 +18,9 @@ def parse_args(args: List[str]) -> dict: "Set to 0 to run in main thread." ), ) + parser.add_argument( + "--liftover", + action="store_true", + help="Enable attempting to liftover non-GRCh38 genomic variants to GRCh38", + ) return vars(parser.parse_args(args)) diff --git a/clinvar_gk_pilot/main.py b/clinvar_gk_pilot/main.py index b8e479c..f68e1cb 100644 --- a/clinvar_gk_pilot/main.py +++ b/clinvar_gk_pilot/main.py @@ -1,16 +1,22 @@ +import asyncio import contextlib import gzip +import importlib.util import json +import logging import multiprocessing import os import pathlib import queue import sys +import tempfile from functools import partial from typing import List +import requests from ga4gh.vrs.dataproxy import create_dataproxy from ga4gh.vrs.extras.translator import AlleleTranslator, CnvTranslator +from ga4gh.vrs.models import CopyChange from clinvar_gk_pilot.cli import parse_args from clinvar_gk_pilot.gcs import ( @@ -42,30 +48,35 @@ } -def process_line(line: str) -> str: +def process_line(line: str, opts: dict = None) -> str: """ Takes a line of JSON, processes it, and returns the result as a JSON string. """ clinvar_json = json.loads(line) result = None - if clinvar_json.get("issue") is None: + # if clinvar_json.get("issue") is None: + if True: cls = clinvar_json["vrs_class"] if cls == "Allele": - result = allele(clinvar_json) + result = allele(clinvar_json, opts or {}) elif cls == "CopyNumberChange": - result = copy_number_change(clinvar_json) + result = copy_number_change(clinvar_json, opts or {}) elif cls == "CopyNumberCount": - result = copy_number_count(clinvar_json) + result = copy_number_count(clinvar_json, opts or {}) content = {"in": clinvar_json, "out": result} return json.dumps(content) def _task_worker( - task_queue: multiprocessing.Queue, return_queue: multiprocessing.Queue + task_queue: multiprocessing.Queue, return_queue: multiprocessing.Queue, init_fn=None ): """ Worker function that processes tasks from a queue. """ + # Run any per-process initialization + if init_fn: + init_fn() + while True: task = task_queue.get() if task is None: @@ -73,11 +84,48 @@ def _task_worker( return_queue.put(task()) -def worker(file_name_gz: str, output_file_name: str) -> None: +# Define init function to set up QueryHandler and event loop in this process +def init_query_handler(): + from variation.query import QueryHandler + + global query_handler, event_loop + query_handler = QueryHandler() + + # Create a persistent event loop for this worker process + event_loop = asyncio.new_event_loop() + asyncio.set_event_loop(event_loop) + + +def run_async_with_persistent_loop(coro): + """ + Run an async coroutine using the persistent event loop for this worker process. + This replaces asyncio.run() calls to avoid creating/destroying event loops repeatedly. + """ + global event_loop + return event_loop.run_until_complete(coro) + + +def worker(file_name_gz: str, output_file_name: str, opts: dict = None) -> None: """ Takes an input file (a GZIP file of newline delimited), runs `process_line` on each line, and writes the output to a new GZIP file called `output_file_name`. """ + + # Set up file-specific logger + log_file_name = f"{file_name_gz}.log" + file_logger = logging.getLogger(f"worker_{os.path.basename(file_name_gz)}") + file_logger.setLevel(logging.INFO) + + # Create file handler with the same format as log_conf.json + file_handler = logging.FileHandler(log_file_name) + formatter = logging.Formatter( + fmt="%(asctime)s - %(module)s - %(levelname)s - %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", + ) + file_handler.setFormatter(formatter) + file_logger.addHandler(file_handler) + file_logger.propagate = False # Prevent duplicate logs + with ( gzip.open(file_name_gz, "rt", encoding="utf-8") as input_file, gzip.open(output_file_name, "wt", encoding="utf-8") as output_file, @@ -89,15 +137,19 @@ def worker(file_name_gz: str, output_file_name: str) -> None: def make_background_process(): p = multiprocessing.Process( target=_task_worker, - args=(task_queue, return_queue), + args=(task_queue, return_queue, init_query_handler), ) return p + print("Making background process _task_worker") background_process = make_background_process() background_process.start() + line_number = -1 for line in input_file: - task_queue.put(partial(process_line, line)) + line_number += 1 + file_logger.info(f"Processing line (index: {line_number}): {line}") + task_queue.put(partial(process_line, line, opts)) try: ret = return_queue.get(timeout=task_timeout) except queue.Empty: @@ -121,18 +173,24 @@ def make_background_process(): task_queue.put(None) background_process.join() + # Clean up logger handler + file_handler.close() + file_logger.removeHandler(file_handler) + -def process_as_json_single_thread(input_file_name: str, output_file_name: str) -> None: +def process_as_json_single_thread( + input_file_name: str, output_file_name: str, opts: dict = None +) -> None: with gzip.open(input_file_name, "rt", encoding="utf-8") as f_in: with gzip.open(output_file_name, "wt", encoding="utf-8") as f_out: for line in f_in: - f_out.write(process_line(line)) + f_out.write(process_line(line, opts)) f_out.write("\n") print(f"Output written to {output_file_name}") def process_as_json( - input_file_name: str, output_file_name: str, parallelism: int + input_file_name: str, output_file_name: str, parallelism: int, opts: dict = None ) -> None: """ Process `input_file_name` in parallel and write the results to `output_file_name`. @@ -141,17 +199,34 @@ def process_as_json( part_input_file_names = partition_file_lines_gz(input_file_name, parallelism) part_output_file_names = [f"{ofn}.out" for ofn in part_input_file_names] + print(f"Partitioned filenames: {part_output_file_names}") workers = [] + worker_info = [] # Start a worker per file name - for part_ifn, part_ofn in zip(part_input_file_names, part_output_file_names): - w = multiprocessing.Process(target=worker, args=(part_ifn, part_ofn)) + for i, (part_ifn, part_ofn) in enumerate( + zip(part_input_file_names, part_output_file_names) + ): + w = multiprocessing.Process(target=worker, args=(part_ifn, part_ofn, opts)) w.start() workers.append(w) + worker_info.append((i, w, part_ifn)) + + print(f"Started {len(workers)} workers", flush=True) # Wait for all workers to finish - for w in workers: - w.join() + remaining_workers = worker_info.copy() + while remaining_workers: + for worker_idx, w, part_ifn in remaining_workers.copy(): + w.join(timeout=5) + if not w.is_alive(): + remaining_workers.remove((worker_idx, w, part_ifn)) + + if remaining_workers: + still_running = [ + f"Worker {idx} ({part_ifn})" for idx, w, part_ifn in remaining_workers + ] + print(f"Still running: {', '.join(still_running)}", flush=True) with gzip.open(output_file_name, "wt", encoding="utf-8") as f_out: for part_ofn in part_output_file_names: @@ -168,44 +243,123 @@ def process_as_json( print(f"Output written to {output_file_name}") -def allele(clinvar_json: dict) -> dict: +def allele(clinvar_json: dict, opts: dict) -> dict: try: - tlr = allele_translators[clinvar_json.get("assembly_version", "38")] - vrs = tlr.translate_from(var=clinvar_json["source"], fmt=clinvar_json["fmt"]) - return vrs.model_dump(exclude_none=True) + assembly_version = clinvar_json.get("assembly_version", "38") + source = clinvar_json["source"] + fmt = clinvar_json["fmt"] + + if fmt == "spdi" or not opts.get("liftover", False): + if fmt == "spdi" and assembly_version != "38": + raise ValueError( + f"Unexpected assembly '{assembly_version}' for SPDI expression {source}" + ) + vrs_variant = query_handler.vrs_python_tlr.translate_from(source, fmt=fmt) + if vrs_variant.location.sequence: + vrs_variant.location.sequence = None + return vrs_variant.model_dump(exclude_none=True) + elif fmt == "hgvs": + if opts.get("liftover", False): + # do /normalize. This also automatically tries to liftover to GRCh38 + result = run_async_with_persistent_loop( + query_handler.normalize_handler.normalize( + q=source, + ) + ) + if result.variation: + vrs_variant = result.variation + if vrs_variant.location.sequence: + vrs_variant.location.sequence = None + return vrs_variant.model_dump(exclude_none=True) + else: + return {"errors": json.dumps(result.warnings)} + else: + raise ValueError(f"Unsupported format: {fmt}") except Exception as e: - logger.error(f"Exception raised in 'allele' processing: {clinvar_json}: {e}") - return {"errors": str(e)} + error_msg = f"Unexpected error: {repr(e)}" + logger.error(f"Exception in allele: {clinvar_json}: {error_msg}") + return {"errors": error_msg} -def copy_number_count(clinvar_json: dict) -> dict: +def copy_number_change(clinvar_json: dict, opts: dict) -> dict: + """ + Create a VRS CopyNumberChange variation using the variation-normalization module. + + Returns: + Dictionary containing VRS representation or error information + """ try: - tlr = cnv_translators[clinvar_json.get("assembly_version", "38")] - kwargs = {"copies": clinvar_json["absolute_copies"]} - vrs = tlr.translate_from( - var=clinvar_json["source"], fmt=clinvar_json["fmt"], **kwargs + # Extract required parameters from clinvar_json + hgvs_expr = clinvar_json["source"] + + # Get baseline_copies by offsetting by one from absolute_copies + if clinvar_json["variation_type"] in ["Deletion", "copy number loss"]: + copy_change = CopyChange.LOSS + elif clinvar_json["variation_type"] in ["Duplication", "copy number gain"]: + copy_change = CopyChange.GAIN + else: + return {"errors": f"Unknown variation_type: {clinvar_json}"} + + result = run_async_with_persistent_loop( + query_handler.to_copy_number_handler.hgvs_to_copy_number_change( + hgvs_expr=hgvs_expr, + copy_change=copy_change, + do_liftover=opts.get("liftover", False), + ) ) - return vrs.model_dump(exclude_none=True) + if result.copy_number_change: + vrs_variant = result.copy_number_change + if vrs_variant.location.sequence: + vrs_variant.location.sequence = None + return vrs_variant.model_dump(exclude_none=True) + else: + return {"errors": json.dumps(result.warnings)} + except Exception as e: - logger.error( - f"Exception raised in 'copy_number_count' processing: {clinvar_json}: {e}" - ) - return {"errors": str(e)} + error_msg = f"Unexpected error: {repr(e)}" + logger.error(f"Exception in copy_number_change: {clinvar_json}: {error_msg}") + return {"errors": error_msg} -def copy_number_change(clinvar_json: dict) -> dict: +def copy_number_count(clinvar_json: dict, opts: dict) -> dict: + """ + Create a VRS CopyNumberCount variation using the variation-normalization service. + + Returns: + Dictionary containing VRS representation or error information + """ try: - tlr = cnv_translators[clinvar_json.get("assembly_version", "38")] - kwargs = {"copy_change": clinvar_json["copy_change_type"]} - vrs = tlr.translate_from( - var=clinvar_json["source"], fmt=clinvar_json["fmt"], **kwargs + # Extract required parameters from clinvar_json + hgvs_expr = clinvar_json["source"] + absolute_copies = int(clinvar_json["absolute_copies"]) + + # Get baseline_copies by offsetting by one from absolute_copies + if clinvar_json["variation_type"] in ["Deletion", "copy number loss"]: + baseline_copies = absolute_copies + 1 + elif clinvar_json["variation_type"] in ["Duplication", "copy number gain"]: + baseline_copies = absolute_copies - 1 + else: + return {"errors": f"Unknown variation_type: {clinvar_json}"} + + result = run_async_with_persistent_loop( + query_handler.to_copy_number_handler.hgvs_to_copy_number_count( + hgvs_expr=hgvs_expr, + baseline_copies=baseline_copies, + do_liftover=opts.get("liftover", False), + ) ) - return vrs.model_dump(exclude_none=True) + if result.copy_number_count: + vrs_variant = result.copy_number_count + if vrs_variant.location.sequence: + vrs_variant.location.sequence = None + return vrs_variant.model_dump(exclude_none=True) + else: + return {"errors": json.dumps(result.warnings)} + except Exception as e: - logger.error( - f"Exception raised in 'copy_number_change' processing: {clinvar_json}: {e}" - ) - return {"errors": str(e)} + error_msg = f"Unexpected error: {repr(e)}" + logger.error(f"Exception in copy_number_count: {clinvar_json}: {error_msg}") + return {"errors": error_msg} def partition_file_lines_gz(local_file_path_gz: str, partitions: int) -> List[str]: @@ -215,7 +369,7 @@ def partition_file_lines_gz(local_file_path_gz: str, partitions: int) -> List[st Return a list of `partitions` file names that are a roughly equal number of lines from `local_file_path_gz`. """ - filenames = [f"{local_file_path_gz}.part_{i+1}" for i in range(partitions)] + filenames = [f"{local_file_path_gz}.part_{i + 1}" for i in range(partitions)] # Read the file and write each line to a file, looping through the output files with gzip.open(local_file_path_gz, "rt") as f: @@ -232,6 +386,36 @@ def partition_file_lines_gz(local_file_path_gz: str, partitions: int) -> List[st return filenames +def initialize_variation_normalizer_ref_data(): + """Download and import the variation normalizer reference data script at runtime""" + # URL to the script + script_url = "https://raw.githubusercontent.com/GenomicMedLab/variation-normalizer-manuscript/issue-116/analysis/download_cool_seq_tool_files.py" + + # Download the script + response = requests.get(script_url, timeout=30) + response.raise_for_status() + + # Create a temporary file and write the script content + with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as temp_file: + temp_file.write(response.text) + temp_file_path = temp_file.name + + try: + # Import the module from the temporary file + spec = importlib.util.spec_from_file_location( + "download_cool_seq_tool_files", temp_file_path + ) + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + + # Call the download function + module.download_cool_seq_tool_files(is_docker_env=False) + + finally: + # Clean up the temporary file + os.unlink(temp_file_path) + + def main(argv=sys.argv[1:]): """ Process the --filename argument (expected as 'gs://..../filename.json.gz') @@ -251,30 +435,51 @@ def main(argv=sys.argv[1:]): # Make parents os.makedirs(os.path.dirname(outfile), exist_ok=True) + # Initialize the variation-normalizer to use specific snapshotted reference data. + initialize_variation_normalizer_ref_data() + if opts["parallelism"] == 0: - process_as_json_single_thread(local_file_name, outfile) + process_as_json_single_thread(local_file_name, outfile, opts) else: - process_as_json(local_file_name, outfile, opts["parallelism"]) + process_as_json(local_file_name, outfile, opts["parallelism"], opts) if __name__ == "__main__": + # Importing and initializing the variation-normalizer QueryHandler + # requires env vars to be set and dynamodb jar to be run locally and pointed to with GENE_NORM_DB_URL + # https://github.com/clingen-data-model/architecture/tree/master/helm/charts/clingen-vicc/docker/dynamodb + # In the `dynamodb` directory above, build: + # podman build -t gene-normalizer-dynamodb:latest . + # Then run it (uses host gcloud config to authenticate to our bucket which has a snapshot of the gene database) + # podman run -it -p 8001:8000 -v $HOME/.config/gcloud:/config/gcloud -v dynamodb:/data -e DATA_DIR=/data -e GOOGLE_APPLICATION_CREDENTIALS=/config/gcloud/application_default_credentials.json -e CLOUDSDK_CONFIG=/config/gcloud gene-normalizer-dynamodb:latest + creds_contents = """[default] + aws_access_key_id = asdf + aws_secret_access_key = asdf""" + aws_fake_creds_filename = "aws_fake_creds" + with open(aws_fake_creds_filename, "w") as f: + f.write(creds_contents) + os.environ["AWS_SHARED_CREDENTIALS_FILE"] = str( + pathlib.Path.cwd() / aws_fake_creds_filename + ) + if "GENE_NORM_DB_URL" not in os.environ: + raise RuntimeError("Must set GENE_NORM_DB_URL (e.g. http://localhost:8001)") + if "SEQREPO_ROOT_DIR" not in os.environ: + raise RuntimeError( + "Must set SEQREPO_ROOT_DIR (e.g. /Users/kferrite/dev/data/seqrepo/2024-12-20)" + ) + if "UTA_DB_URL" not in os.environ: + raise RuntimeError( + "Must set UTA_DB_URL (e.g. postgresql://anonymous@localhost:5433/uta/uta_20241220)" + ) + if len(sys.argv) == 1: main( [ "--filename", "gs://clinvar-gk-pilot/2025-03-23/dev/vi.json.gz", "--parallelism", - "10", + "2", ] ) - - # main( - # [ - # "--filename", - # "vi-100000.json.gz", - # "--parallelism", - # "1", - # ] - # ) else: main(sys.argv[1:]) diff --git a/pyproject.toml b/pyproject.toml index 09a728d..e1da952 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -9,9 +9,12 @@ license = { text = "MIT" } classifiers = ["Programming Language :: Python :: 3"] dependencies = [ "google-cloud-storage~=2.13.0", - "ga4gh.vrs[extras] @ git+https://github.com/ga4gh/vrs-python@2.1.1", + "ga4gh.vrs[extras] @ git+https://github.com/ga4gh/vrs-python@2.1.3", "gunicorn==22.0.0", "flask~=3.0.3", + "requests~=2.0", + "variation-normalizer @ git+https://github.com/cancervariants/variation-normalization@0.15.0", + # "variation-normalizer-manuscript @ git+https://github.com/GenomicMedLab/variation-normalizer-manuscript@issue-116", ] dynamic = ["version"] diff --git a/test/test_cli.py b/test/test_cli.py index 8505376..4d7a7ab 100644 --- a/test/test_cli.py +++ b/test/test_cli.py @@ -6,4 +6,5 @@ def test_parse_args(): opts = parse_args(argv) assert opts["filename"] == "test.txt" assert opts["parallelism"] == 1 - assert len(opts) == 2 + assert opts["liftover"] is False + assert len(opts) == 3