feat: Session refactoring #1174

project-defiant · 2025-11-27T11:52:29Z

✨ Context

This PR is set of unrelated changes (refactoring), that I collected post eQTL catalogue GTEx v10 credible sets harmonization.

🛠 What does this PR implement

Running tests with non-shared spark session

The test target is now spitted into 2 targets test-no-shared-spark-session and test-shared-spark-session. Both targets are dependencies of test target, that later combines the coverage from both pytest runs.

test-no-shared-spark-session - first pass runs all tests that need to set up the Session with custom parameters (fixture uses automatic cleanup of the spark if it exists). These tests can not be run in parallel by design, as they can not attach to the same global SparkSession object. The tests are referenced by no_shared_spark pytest mark.
test-shared-spark-session - other tests that can use default session (all currently implemented), these tests can safely run in parallel.

Running tests with external jar dependencies to spark

Additional test target test-no-shared-spark-session-web-dependencies that is not run by default.

This target includes tests marked with download_jars_from_web and no_shared_spark pytest marks. These tests require additional jars (like enhanced_bgzip_codec) that needs to be pre-fetched along with their dependencies from maven before the test can run successfully, hence they are not run by default as they require internet access.

Session object refactoring

The changes to the Session object are done mainly to allow to easily recreate the session object, without needing to pass it to downstream functions. For example, now running

def transform_dataset(path, session) -> DataFrame:
   session = session or Session.find()
   df = session.spark.read.parquet(path)
   ... 
   return df

does not require passing the Session by reference, the function should be able to utilize the Session.find() constructor to recreate the Session object from existing SparkSession or throw exception if no SparkSession is found.

Changes

introduction of Session.find() method that searches for SparkSession and recreates the Session object with it. If no SparkSession is find, the method throws an exception.
Session now sets all attributes (like write_mode or partition_number) directly to the SparkConf object passed through the spark.gentropy.* parameters, for example spark.gentropy.write_mode. This allows for the Session to be just a lightweight wrapper that does not have a lot of overhead when recreating it.
Session default constructor now passes all of the attributes to SparkConf during first startup
default hail path to the pip installed jar if no hail_home is provided

External dependency composition

Refactored session allows for setting up multiple jar dependencies by composing them in

spark.jars
spark.jars.packages
spark.driver.extraClassPath
spark.executor.extraClassPath

This is required when adding new dependency via extended_spark_conf can override hail jar configuration (the SparkConf.set overrides, but not composes via separated string.

Default constructor behavior

Refactored session default constructor behavior:

Create default values for expected parameters
build expected configuration
verify if SparkSession exists
3.1 if exists, use it and set up logger and conf attributes
3.2 compare existing config to expected config and warn if there is something unexpected
if not exist, build new session with expected config and start hail if requested.

Startup logging

The startup logging for SparkSession

n [2]: Session()
26/02/11 07:58:45 WARN Utils: Your hostname, mindos resolves to a loopback address: 127.0.1.1; using 192.168.0.100 instead (on interface eno1)
26/02/11 07:58:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/11 07:58:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is now silenced via the configuration in log4j.properties, the default value of INFO is later set in the log4j logger (after Session is created, which will prevent the startup logs)

Self contained *setup**_config methods

The Session._setup_*_config methods should allow for SparkConf composition.

Data loading

The Session.load_data is now refactored to allow for loading native file formats ( NativeFileFormat enum), also if the file is tsv or csv we allow for reading urllib, this is a fallback, as the SparkFiles solution never worked on dataproc.

Others

This PR also includes changes needed to be done for eQTL Catalogue processing of latest data in our buckets.

🙈 Missing

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g uv run pre-commit run --all-files)?

Copilot

Pull request overview

This PR refactors gentropy.common.session.Session into a more feature-rich SparkSession wrapper (including file loading helpers, log4j defaults, dynamic allocation, and enums), updates multiple datasources/steps to use the new APIs, and reorganizes parts of the test suite (including “no_spark” tests).

Changes:

Major rewrite of Session (new enums, config composition, URL loading, log4j assets, updated write-mode handling).
eQTL Catalogue ingestion updates (metadata path parameterization, refactors in study index/finemapping readers, chromosome normalization).
Test suite updates (new no_spark tests, fixture changes, Session/load_data tests, Hail init test adjustments).

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/gentropy/no_spark/test_no_spark.py	Adds “no_spark” tests for Session creation/config and a web-dependent BGZIP→Parquet test.
tests/gentropy/datasource/gnomad/test_gnomad_ld.py	Switches Hail init to use `session.spark.sparkContext` instead of raw `SparkSession`.
tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py	Removes a long BGZIP codec test and introduces a new (currently unused) Spark-active flag.
tests/gentropy/datasource/finngen/test_finngen_finemapping.py	Switches test to use `Session` fixture and updates Hail init + Spark argument passing.
tests/gentropy/datasource/eqtl_catalogue/test_eqtl_catalogue.py	Fixes `_setup` fixture return annotation (now `None`).
tests/gentropy/conftest.py	Refactors Spark fixture lifecycle and adjusts several fixtures for new datasource APIs/types.
tests/gentropy/common/test_session.py	Adds tests for `Session.load_data` and HTTP/HTTPS loading via monkeypatching `urlopen`.
src/utils/spark.py	Minor docstring formatting tweak.
src/gentropy/l2g.py	Updates feature matrix load to explicitly pass parquet format.
src/gentropy/eqtl_catalogue.py	Adds configurable metadata path, removes explicit Session passing to readers, and coalesces study index output.
src/gentropy/datasource/finngen_meta/summary_statistics.py	Updates enhanced-BGZIP gating, forces threadpool map evaluation, and reformats assertions.
src/gentropy/datasource/eqtl_catalogue/study_index.py	Removes pandas dependency, validates blacklist methods, loads metadata via `Session.load_data`, and introduces enums for mappings.
src/gentropy/datasource/eqtl_catalogue/finemapping.py	Moves to `Session.find()`-based reads, adds chromosome normalization, and uses `NativeFileFormat`.
src/gentropy/datasource/eqtl_catalogue/init.py	Introduces `QuantificationMethod` and `StudyType` `StrEnum`s.
src/gentropy/dataset/dataset.py	Adds generic `Dataset.read()` powered by `Session.load_data`.
src/gentropy/dataset/colocalisation.py	Defers imports to function scope to avoid top-level dependencies.
src/gentropy/config.py	Extends `SessionConfig` and adds `eqtl_catalogue_metadata_path` to config.
src/gentropy/common/session.py	Large refactor: enums, config assembly, log4j integration, URL loading, runtime-conf updating, and new `find()`.
src/gentropy/assets/log4j.properties	Adds log4j properties asset for Spark driver logging configuration.
src/gentropy/init.py	Expands pyspark pandas-on-spark warning suppression.
pyproject.toml	Updates pytest xdist distribution, adds `no_spark` marker (but not `webtest`).
docs/python_api/common/session.md	Updates API docs export list (now includes `SparkWriteMode`).
Makefile	Splits test targets into no-spark/spark/web and combines coverage outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml

src/gentropy/dataset/dataset.py

src/gentropy/common/session.py

tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

src/gentropy/datasource/finngen_meta/summary_statistics.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py

src/gentropy/common/session.py

src/gentropy/datasource/eqtl_catalogue/study_index.py

src/gentropy/datasource/eqtl_catalogue/finemapping.py

src/gentropy/dataset/dataset.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/datasource/eqtl_catalogue/finemapping.py

src/gentropy/datasource/eqtl_catalogue/study_index.py

tests/gentropy/common/test_session.py

src/gentropy/common/session.py

tests/gentropy/conftest.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/gentropy/eqtl_catalogue.py:59

read_credible_set_from_source / read_lbf_from_source are invoked without the session parameter, so they rely on Session.find() internally. Since this step already has an explicit session, pass it through to avoid depending on global active Spark state and to ensure consistent Spark config is used.

        credible_sets_df = EqtlCatalogueFinemapping.read_credible_set_from_source(
            credible_set_path=[
                f"{eqtl_catalogue_paths_imported}/{qtd_id}.credible_sets.tsv"
                for qtd_id in studies_to_ingest
            ],
        )
        lbf_df = EqtlCatalogueFinemapping.read_lbf_from_source(
            lbf_path=[
                f"{eqtl_catalogue_paths_imported}/{qtd_id}.lbf_variable.txt"
                for qtd_id in studies_to_ingest
            ],
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

tests/gentropy/no_spark/test_no_spark.py

src/gentropy/eqtl_catalogue.py

src/gentropy/datasource/eqtl_catalogue/study_index.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

src/gentropy/eqtl_catalogue.py

tests/gentropy/no_spark/test_no_spark.py

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gentropy/common/session.py

src/gentropy/eqtl_catalogue.py

tests/gentropy/conftest.py

Makefile

pyproject.toml

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/gentropy/conftest.py

Makefile

tests/gentropy/no_spark/test_no_spark.py

src/gentropy/eqtl_catalogue.py

DSuveges

It's a very elaborate update with a number of updates. I see the value in how handling the session evolved and requires more and more sophisticated machinery. My only actionable comment is the use of format='tsv', which fails in my local tests. Also it might be an issue that upon reading tsv/csv, the header might be lost without header=True.

DSuveges · 2026-02-11T13:30:52Z

src/gentropy/common/session.py

+    PARQUET = "parquet"
+    CSV = "csv"
+    TSV = "tsv"
+    JSON = "json"


When we have json files, that usually means jsonl, isn't that parallelizable?

DSuveges · 2026-02-11T13:31:26Z

src/gentropy/common/session.py

+        ...         "spark.executor.cores": "4",
+        ...         "spark.executor.memory": "8g",
+        ...     },
+        ... ) # doctest: +SKIP


Does this notation mean these examples are excluded from tests?

yes, these are non-runnable examples.

src/gentropy/common/session.py

DSuveges · 2026-02-11T13:50:56Z

src/gentropy/common/session.py

+        return SparkWriteMode(
+            self.conf.get(
+                "spark.gentropy.writeMode", SparkWriteMode.ERROR_IF_EXISTS.value
+            )


I'm not sure how this works.... Isn't the type of SparkWriteMode.ERROR_IF_EXISTS.value string? However the type of self.conf["spark.gentropy.writeMode"] is SparkWriteMode? See row 160:

self._write_mode = write_mode or SparkWriteMode.ERROR_IF_EXISTS

this is a bit more confusing, the enum was there to ensure the type hints infer to some set of valuable options rather then random string, which can get more confusing. The SparkSession requires the str value, but we could get away with specifying the Enum as type, but pass it as a string. The implicit str(EnumType) will fallback to actual EnumType.value. I had removed the Enum from the type hints to make it more clear

DSuveges · 2026-02-11T13:58:27Z

src/gentropy/datasource/eqtl_catalogue/__init__.py

+from enum import StrEnum
+
+
+class QuantificationMethod(StrEnum):


This is a very generic comment, but I'm wondering if the study types and the other enums are specific for eqtl catalogue data ingestion? I assume the same enums can be re-used in different parsers. eg. ukb ppp maybe? Shouldn't this be placed somewhere shared?

I agree, I did not want to move it out in this PR though, as. it breaks a lot of stuff.

DSuveges · 2026-02-11T14:09:15Z

src/gentropy/datasource/eqtl_catalogue/finemapping.py

-                sep="\t",
-                header=True,
                schema=cls.raw_credible_set_schema,
+                fmt=NativeFileFormat.TSV.value,


See my other comments about tsv. My experience:

In [3]: spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='tsv').show()

Fails. However this works:

spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='csv', sep='\t', header=True).show()

Also, besides specifying the field separator, it seems the header defaults to False, so this needs to be adjusted in the session.load method.

DSuveges · 2026-02-11T14:09:35Z

src/gentropy/datasource/eqtl_catalogue/finemapping.py

-                sep="\t",
-                header=True,
                schema=cls.raw_lbf_schema,
+                fmt=NativeFileFormat.TSV.value,


DSuveges · 2026-02-11T14:14:25Z

src/gentropy/config.py

    eqtl_catalogue_paths_imported: str = MISSING
    eqtl_catalogue_study_index_out: str = MISSING
    eqtl_catalogue_credible_sets_out: str = MISSING
+    eqtl_catalogue_metadata_path: str = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/fe3c4b4ed911b3a184271a6aadcd8c8769a66aba/data_tables/dataset_metadata.tsv"


Seeing a hardcoded url in the config looks very strange. Is it intentional?

No, I guess this should not be default, I forgot about it, thanks!

DSuveges · 2026-02-11T14:17:50Z

tests/gentropy/common/test_session.py

+    [
+        pytest.param("test_path.parquet", "parquet", {}, id="parquet"),
+        pytest.param("test_path.csv", "csv", {}, id="csv"),
+        pytest.param("test_path.tsv", "tsv", {}, id="tsv"),


So, you are testing for tsv... how does it work here? When tsv and csv the data is read, there's no test if the header is there, right?

DSuveges · 2026-02-11T14:21:27Z

tests/gentropy/no_spark/test_no_spark.py

+    _stop_active_spark()
+
+
+@pytest.mark.no_shared_spark


I don't really know how these marks work in pytests.

You can use the marks to filter the tests you want to read, or mark tests to bahave differently. This is just a marker that points - these tests are suppose to run without shared spark session, the actual implementation is written elsewere (typically with pytest.request fixture that allows you to access the test and modify it's behavior.

On the other note the pytest -m 'no_shared_spark will run only the tests with this mark.

Thanks for explaining!

github-actions bot added size-M Dataset Step Datasource labels Nov 27, 2025

SzymonSzyszkowski and others added 5 commits February 3, 2026 16:55

feat(dataset): universal reader

663c870

refactor(datasource): eqtl catalogue parser allow for parquet files

eb2f22d

chore: update tests

5c9dc7d

chore: add mock for tests

5cffc6d

feat: update

e8118cc

project-defiant force-pushed the gtex-v10 branch from 78a5a10 to e8118cc Compare February 4, 2026 16:21

github-actions bot added documentation Improvements or additions to documentation size-L and removed size-M labels Feb 4, 2026

project-defiant added 2 commits February 7, 2026 12:12

feat: updates to gentropy session

6076e62

chore: add format

ae3459b

github-actions bot added size-XL and removed size-L labels Feb 7, 2026

project-defiant requested a review from Copilot February 7, 2026 12:19

project-defiant changed the title ~~WIP Gtex v10~~ feat: Session refactoring Feb 7, 2026

Copilot started reviewing on behalf of project-defiant February 7, 2026 12:20 View session

project-defiant self-assigned this Feb 7, 2026

Copilot AI reviewed Feb 7, 2026

View reviewed changes

fix: tests

9e0e9b3

github-actions bot added the Feature label Feb 7, 2026

feat: update test command

a805e98

project-defiant requested a review from Copilot February 7, 2026 13:52

Copilot started reviewing on behalf of project-defiant February 7, 2026 13:52 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

src/gentropy/common/session.py Show resolved Hide resolved

src/gentropy/common/session.py Show resolved Hide resolved

src/gentropy/datasource/finngen_meta/summary_statistics.py Outdated Show resolved Hide resolved

feat: post review

dc98467

project-defiant requested a review from Copilot February 7, 2026 18:45

Copilot started reviewing on behalf of project-defiant February 7, 2026 20:13 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

feat: address comments

eba5ccf

project-defiant requested a review from Copilot February 9, 2026 11:57

Copilot started reviewing on behalf of project-defiant February 9, 2026 11:58 View session

chore: session added

75d0c3a

Copilot AI reviewed Feb 9, 2026

View reviewed changes

src/gentropy/common/session.py Show resolved Hide resolved

project-defiant requested a review from Copilot February 9, 2026 12:10

Copilot started reviewing on behalf of project-defiant February 9, 2026 12:10 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

chore: address issues

d46a88b

project-defiant requested a review from Copilot February 9, 2026 12:42

Copilot started reviewing on behalf of project-defiant February 9, 2026 12:43 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

SzymonSzyszkowski added 2 commits February 9, 2026 13:03

chore: address pr comments

e22489d

chore: address pr comments

6ad8016

project-defiant requested a review from Copilot February 9, 2026 13:07

Copilot started reviewing on behalf of project-defiant February 9, 2026 13:07 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

src/gentropy/common/session.py Show resolved Hide resolved

src/gentropy/common/session.py Show resolved Hide resolved

src/gentropy/eqtl_catalogue.py Show resolved Hide resolved

tests/gentropy/no_spark/test_no_spark.py Show resolved Hide resolved

project-defiant requested a review from Copilot February 9, 2026 13:58

Copilot started reviewing on behalf of project-defiant February 9, 2026 13:58 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

feat: update tests

2f2a527

project-defiant requested a review from Copilot February 9, 2026 21:05

Copilot started reviewing on behalf of project-defiant February 9, 2026 21:05 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

tests/gentropy/conftest.py Show resolved Hide resolved

Makefile Show resolved Hide resolved

tests/gentropy/no_spark/test_no_spark.py Outdated Show resolved Hide resolved

src/gentropy/eqtl_catalogue.py Show resolved Hide resolved

chore: pr comments

1fe0613

project-defiant requested a review from DSuveges February 11, 2026 08:11

feat: add logging default parameter

23cad13

DSuveges requested changes Feb 11, 2026

View reviewed changes

		from enum import StrEnum


		class QuantificationMethod(StrEnum):

		_stop_active_spark()


		@pytest.mark.no_shared_spark

feat: Session refactoring #1174

Are you sure you want to change the base?

feat: Session refactoring #1174

Uh oh!

Conversation

project-defiant commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Context

🛠 What does this PR implement

Running tests with non-shared spark session

Running tests with external jar dependencies to spark

Session object refactoring

Changes

External dependency composition

Default constructor behavior

Startup logging

Self contained setup*_config methods

Data loading

Others

🙈 Missing

🚦 Before submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

project-defiant commented Nov 27, 2025 •

edited

Loading

Self contained *setup**_config methods