Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 128 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
# clinvar-gk-python
Project for reading and normalizing ClinVar variants into GA4GH GKS forms

Project for reading and normalizing ClinVar variants into GA4GH GKS forms.

## Setup

### Prerequisites

1. **Docker** (or podman) - Required to run the variation-normalization services
2. **Python 3.11+** - Required for the main application
3. **SeqRepo database** - Local sequence repository
4. **UTA database** - Local Universal Transcript Archive (only needed for liftover)

## Installation

Expand All @@ -16,46 +25,142 @@ cd clinvar-gk-python
pip install -e '.[dev]'
```

## Configuration
### Database Services Setup

A SeqRepo and UTA instance should be downloaded and/or running locally.
This project requires several database services that can be easily set up using the Docker compose configuration from the variation-normalization project.

The preferred method is to download a SeqRepo DB dir to your local filesystem, and to run the UTA docker image.
1. Download the compose.yaml file from variation-normalization v0.15.0 (matching the version in pyproject.toml):

See: https://github.com/biocommons/anyvar?tab=readme-ov-file#required-dependencies
```bash
curl -o variation-normalizer-compose.yaml https://raw.githubusercontent.com/cancervariants/variation-normalization/0.15.0/compose.yaml
```

2. Start the required services:

The docker compose file in `vrs-python` or `anyvar` can be used / trimmed down to run a UTA container from a snapshot and bind it to an available local port. If you have postgresql running natively on your system on port 5432 you need to modify the compose file in order to ensure the left hand side of the UTA port field is some other port like 5433. And then make sure you use that port number in the `UTA_DB_URL` variable below.
```bash
docker compose -f variation-normalizer-compose.yaml up -d
```
(*or `uvx podman-compose` for podman*)

e.g. In the file below change the left hand side of `.services.uta.ports[0]` to `127.0.0.1:5433`.
This will start:
- **UTA database** (port 5432): Universal Transcript Archive for transcript mapping
- **Gene Normalizer database** (port 8000): Gene normalization service
- **Variation Normalizer API** (port 8001): Variation normalization service

https://github.com/ga4gh/vrs-python/blob/main/docker-compose.yml
**Note on Port Conflicts**: If you already have services running on these ports, you can modify the port mappings in `variation-normalizer-compose.yaml`:
- For UTA database: Change `5432:5432` to `5433:5432` (or another available port)
- For Gene Normalizer: Change `8000:8000` to `8002:8000` (or another available port)
- For Variation Normalizer API: Change `8001:80` to `8003:80` (or another available port)

Then run with podman (or docker) compose:
Verify containers are running on the desired ports, e.g. the UTA postgres is running on host port 5433 and the gene normalizer db is on port 8000:
```
uvx podman-compose -f ./docker-compose.yml up
docker ps -a | grep 'uta\|gene-norm'
```

Verify the UTA postgres is running on host port 5433:
### Environment Configuration

Set up the required environment variables. You can use the provided `env.sh` as a reference:

```bash
# SeqRepo configuration - Update path to your local SeqRepo installation
export SEQREPO_ROOT_DIR=/usr/local/share/seqrepo/2024-12-20
export SEQREPO_DATAPROXY_URL=seqrepo+file://${SEQREPO_ROOT_DIR}

# Database URLs (using the Docker compose services)
export UTA_DB_URL=postgresql://anonymous:anonymous@localhost:5432/uta/uta_20241220
export GENE_NORM_DB_URL=http://localhost:8000
```
podman ps -a | grep uta

**Important**: If you modified the ports in the compose file, update the corresponding environment variables accordingly (e.g., change `5432` to `5433` in `UTA_DB_URL` if you changed the UTA port).

### Python Installation

Install the project and its dependencies:

```bash
pip install -e '.[dev]'
```

## Using
## Running

Point the tool at a SeqRepo database directory at a `seqrepo-rest-service` HTTP URL.
### Basic Usage

The `clinvar_gk_pilot` main entrypoint can automatically handle downloading `gs://` URLs. It places the file in a directory called `buckets`, with the bucket name and the same path prefix. e.g. `gs://clinvar-gks/2025-07-06/dev/vi.json.gz` gets automatically downloaded to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`. The input file is expected to be compressed with GZIP and in JSONL/NDJSON format with each line being a JSON object.

The output is written to the same path as the local input file, but under an `output` directory in the current working directory. e.g. for the input filename `gs://clinvar-gks/2025-07-06/dev/vi.json.gz`, the file will be auto-cached to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` and the output will be written to `output/buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`

And to a postgresql server containing the UTA database.

The `clinvar_gk_pilot` main entrypoint can automatically handle downloading `gs://` URLs. It places the file in a directory called `buckets`, with the bucket name and the same path prefix. e.g. `gs://clinvar-gks/2025-07-06/dev/vi.json.gz` gets automatically downloaded to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`.
Process a ClinVar variant-identity file:

```sh
export UTA_DB_URL=postgresql://anonymous@localhost:5433/uta/uta_20241220
export SEQREPO_DATAPROXY_URL='seqrepo+file:///Users/kferrite/dev/data/seqrepo/2024-12-20'
python clinvar_gk_pilot/main.py --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4 2>&1 | tee 2025-07-06.log
```bash
python clinvar_gk_pilot/main.py --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4
```

Parallelism is configurable and uses python multiprocessing and multiprocessing queues. Some parallelism is significantly beneficial but since there is interprocess communication overhead and they are hitting the same database there can be diminishing returns. On a Macbook Pro with 16 cores, setting parallelism to 4-6 provides clear benefit, but exceeding 10 saturates the machine and may be counterproductive. The code will partition the input file into `<parallelism>` number of files and each worker will process one, and then the outputs will be combined.
### Command Line Options

If parallelism is enabled, each worker also monitors its child process and terminates excessively long tasks.
- `--filename`: Input file path (supports local files and gs:// URLs)
- `--parallelism`: Number of worker processes for parallel processing (default: 1)
- `--liftover`: Enable liftover functionality for genomic coordinate conversion

The output is written to the same path as the local input file, but under an `output` directory in the current working directory. e.g. for the input filename `gs://clinvar-gks/2025-07-06/dev/vi.json.gz`, the file will be auto-cached to `buckets/clinvar-gks/2025-07-06/dev/vi.json.gz` and the output will be written to `output/buckets/clinvar-gks/2025-07-06/dev/vi.json.gz`
### Example Commands

Process a local file:
```bash
clinvar-gk-pilot --filename sample-input.ndjson.gz --parallelism 4
```

Process a file from Google Cloud Storage:
```bash
clinvar-gk-pilot --filename gs://clinvar-gks/2025-07-06/dev/vi.json.gz --parallelism 4
```

### Parallelism

Parallelism is configurable and uses python multiprocessing and multiprocessing queues. Some parallelism is significantly beneficial but since there is interprocess communication overhead and they are hitting the same filesystem there can be diminishing returns. On a Macbook Pro with 16 cores, setting parallelism to 4-6 provides clear benefit, but exceeding 10 saturates the machine and may be counterproductive. The code will partition the input file into `<parallelism>` number of files and each worker will process one, and then the outputs will be combined.

If parallelism is enabled, each worker also monitors its child process, terminates excessively long tasks, and add an error annotation to the output record for that variant indicating that it exceeded the time limit.


### Important Notes on Liftover

When using the `--liftover` option, the application will send queries to the UTA PostgreSQL database for genomic coordinate conversion. Due to Docker's default shared memory constraints, high parallelism combined with liftover can cause out-of-memory errors.

**Recommendations:**
- Keep `--parallelism` on the lower side (2-4) when using `--liftover` and when UTA is in docker
- Alternatively, increase the `shm_size` for the UTA container in `variation-normalizer-compose.yaml`:

```yaml
services:
uta:
# ... other configuration
shm_size: 256m # Add this line to increase shared memory to 256MB
```

## Development


### Testing

Run the test suite:
```bash
pytest
```

Run specific tests:
```bash
pytest test/test_cli.py::test_parse_args
```

### Code Quality

Check and fix code quality issues:
```bash
# Check code quality
./lint.sh

# Apply automatic fixes
./lint.sh apply
```

The lint script runs:
- black, isort, ruff, pylint
5 changes: 5 additions & 0 deletions clinvar_gk_pilot/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,9 @@ def parse_args(args: List[str]) -> dict:
"Set to 0 to run in main thread."
),
)
parser.add_argument(
"--liftover",
action="store_true",
help="Enable attempting to liftover non-GRCh38 genomic variants to GRCh38",
)
return vars(parser.parse_args(args))
Loading