Skip to content

Conversation

@project-defiant
Copy link
Contributor

@project-defiant project-defiant commented Nov 27, 2025

✨ Context

This PR is set of unrelated changes (refactoring), that I collected post eQTL catalogue GTEx v10 credible sets harmonization.

🛠 What does this PR implement

Running tests with non-shared spark session

The test target is now spitted into 2 targets test-no-shared-spark-session and test-shared-spark-session. Both targets are dependencies of test target, that later combines the coverage from both pytest runs.

  • test-no-shared-spark-session - first pass runs all tests that need to set up the Session with custom parameters (fixture uses automatic cleanup of the spark if it exists). These tests can not be run in parallel by design, as they can not attach to the same global SparkSession object. The tests are referenced by no_shared_spark pytest mark.
  • test-shared-spark-session - other tests that can use default session (all currently implemented), these tests can safely run in parallel.

Running tests with external jar dependencies to spark

Additional test target test-no-shared-spark-session-web-dependencies that is not run by default.

This target includes tests marked with download_jars_from_web and no_shared_spark pytest marks. These tests require additional jars (like enhanced_bgzip_codec) that needs to be pre-fetched along with their dependencies from maven before the test can run successfully, hence they are not run by default as they require internet access.

Session object refactoring

The changes to the Session object are done mainly to allow to easily recreate the session object, without needing to pass it to downstream functions. For example, now running

def transform_dataset(path, session) -> DataFrame:
   session = session or Session.find()
   df = session.spark.read.parquet(path)
   ... 
   return df

does not require passing the Session by reference, the function should be able to utilize the Session.find() constructor to recreate the Session object from existing SparkSession or throw exception if no SparkSession is found.

Changes

  • introduction of Session.find() method that searches for SparkSession and recreates the Session object with it. If no SparkSession is find, the method throws an exception.
  • Session now sets all attributes (like write_mode or partition_number) directly to the SparkConf object passed through the spark.gentropy.* parameters, for example spark.gentropy.write_mode. This allows for the Session to be just a lightweight wrapper that does not have a lot of overhead when recreating it.
  • Session default constructor now passes all of the attributes to SparkConf during first startup
  • default hail path to the pip installed jar if no hail_home is provided

External dependency composition

Refactored session allows for setting up multiple jar dependencies by composing them in

  • spark.jars
  • spark.jars.packages
  • spark.driver.extraClassPath
  • spark.executor.extraClassPath

This is required when adding new dependency via extended_spark_conf can override hail jar configuration (the SparkConf.set overrides, but not composes via separated string.

Default constructor behavior

Refactored session default constructor behavior:

  1. Create default values for expected parameters
  2. build expected configuration
  3. verify if SparkSession exists
    3.1 if exists, use it and set up logger and conf attributes
    3.2 compare existing config to expected config and warn if there is something unexpected
  4. if not exist, build new session with expected config and start hail if requested.

Startup logging

The startup logging for SparkSession

n [2]: Session()
26/02/11 07:58:45 WARN Utils: Your hostname, mindos resolves to a loopback address: 127.0.1.1; using 192.168.0.100 instead (on interface eno1)
26/02/11 07:58:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/11 07:58:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is now silenced via the configuration in log4j.properties, the default value of INFO is later set in the log4j logger (after Session is created, which will prevent the startup logs)

Self contained setup*_config methods

The Session._setup_*_config methods should allow for SparkConf composition.

Data loading

The Session.load_data is now refactored to allow for loading native file formats ( NativeFileFormat enum), also if the file is tsv or csv we allow for reading urllib, this is a fallback, as the SparkFiles solution never worked on dataproc.

Others

This PR also includes changes needed to be done for eQTL Catalogue processing of latest data in our buckets.

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g uv run pre-commit run --all-files)?

@github-actions github-actions bot added documentation Improvements or additions to documentation size-L and removed size-M labels Feb 4, 2026
@github-actions github-actions bot added size-XL and removed size-L labels Feb 7, 2026
@project-defiant project-defiant changed the title WIP Gtex v10 feat: Session refactoring Feb 7, 2026
@project-defiant project-defiant self-assigned this Feb 7, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors gentropy.common.session.Session into a more feature-rich SparkSession wrapper (including file loading helpers, log4j defaults, dynamic allocation, and enums), updates multiple datasources/steps to use the new APIs, and reorganizes parts of the test suite (including “no_spark” tests).

Changes:

  • Major rewrite of Session (new enums, config composition, URL loading, log4j assets, updated write-mode handling).
  • eQTL Catalogue ingestion updates (metadata path parameterization, refactors in study index/finemapping readers, chromosome normalization).
  • Test suite updates (new no_spark tests, fixture changes, Session/load_data tests, Hail init test adjustments).

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/gentropy/no_spark/test_no_spark.py Adds “no_spark” tests for Session creation/config and a web-dependent BGZIP→Parquet test.
tests/gentropy/datasource/gnomad/test_gnomad_ld.py Switches Hail init to use session.spark.sparkContext instead of raw SparkSession.
tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py Removes a long BGZIP codec test and introduces a new (currently unused) Spark-active flag.
tests/gentropy/datasource/finngen/test_finngen_finemapping.py Switches test to use Session fixture and updates Hail init + Spark argument passing.
tests/gentropy/datasource/eqtl_catalogue/test_eqtl_catalogue.py Fixes _setup fixture return annotation (now None).
tests/gentropy/conftest.py Refactors Spark fixture lifecycle and adjusts several fixtures for new datasource APIs/types.
tests/gentropy/common/test_session.py Adds tests for Session.load_data and HTTP/HTTPS loading via monkeypatching urlopen.
src/utils/spark.py Minor docstring formatting tweak.
src/gentropy/l2g.py Updates feature matrix load to explicitly pass parquet format.
src/gentropy/eqtl_catalogue.py Adds configurable metadata path, removes explicit Session passing to readers, and coalesces study index output.
src/gentropy/datasource/finngen_meta/summary_statistics.py Updates enhanced-BGZIP gating, forces threadpool map evaluation, and reformats assertions.
src/gentropy/datasource/eqtl_catalogue/study_index.py Removes pandas dependency, validates blacklist methods, loads metadata via Session.load_data, and introduces enums for mappings.
src/gentropy/datasource/eqtl_catalogue/finemapping.py Moves to Session.find()-based reads, adds chromosome normalization, and uses NativeFileFormat.
src/gentropy/datasource/eqtl_catalogue/init.py Introduces QuantificationMethod and StudyType StrEnums.
src/gentropy/dataset/dataset.py Adds generic Dataset.read() powered by Session.load_data.
src/gentropy/dataset/colocalisation.py Defers imports to function scope to avoid top-level dependencies.
src/gentropy/config.py Extends SessionConfig and adds eqtl_catalogue_metadata_path to config.
src/gentropy/common/session.py Large refactor: enums, config assembly, log4j integration, URL loading, runtime-conf updating, and new find().
src/gentropy/assets/log4j.properties Adds log4j properties asset for Spark driver logging configuration.
src/gentropy/init.py Expands pyspark pandas-on-spark warning suppression.
pyproject.toml Updates pytest xdist distribution, adds no_spark marker (but not webtest).
docs/python_api/common/session.md Updates API docs export list (now includes SparkWriteMode).
Makefile Splits test targets into no-spark/spark/web and combines coverage outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/gentropy/eqtl_catalogue.py:59

  • read_credible_set_from_source / read_lbf_from_source are invoked without the session parameter, so they rely on Session.find() internally. Since this step already has an explicit session, pass it through to avoid depending on global active Spark state and to ensure consistent Spark config is used.
        credible_sets_df = EqtlCatalogueFinemapping.read_credible_set_from_source(
            credible_set_path=[
                f"{eqtl_catalogue_paths_imported}/{qtd_id}.credible_sets.tsv"
                for qtd_id in studies_to_ingest
            ],
        )
        lbf_df = EqtlCatalogueFinemapping.read_lbf_from_source(
            lbf_path=[
                f"{eqtl_catalogue_paths_imported}/{qtd_id}.lbf_variable.txt"
                for qtd_id in studies_to_ingest
            ],
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very elaborate update with a number of updates. I see the value in how handling the session evolved and requires more and more sophisticated machinery. My only actionable comment is the use of format='tsv', which fails in my local tests. Also it might be an issue that upon reading tsv/csv, the header might be lost without header=True.

PARQUET = "parquet"
CSV = "csv"
TSV = "tsv"
JSON = "json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have json files, that usually means jsonl, isn't that parallelizable?

... "spark.executor.cores": "4",
... "spark.executor.memory": "8g",
... },
... ) # doctest: +SKIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this notation mean these examples are excluded from tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, these are non-runnable examples.

return SparkWriteMode(
self.conf.get(
"spark.gentropy.writeMode", SparkWriteMode.ERROR_IF_EXISTS.value
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this works.... Isn't the type of SparkWriteMode.ERROR_IF_EXISTS.value string? However the type of self.conf["spark.gentropy.writeMode"] is SparkWriteMode? See row 160:

self._write_mode = write_mode or SparkWriteMode.ERROR_IF_EXISTS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit more confusing, the enum was there to ensure the type hints infer to some set of valuable options rather then random string, which can get more confusing. The SparkSession requires the str value, but we could get away with specifying the Enum as type, but pass it as a string. The implicit str(EnumType) will fallback to actual EnumType.value. I had removed the Enum from the type hints to make it more clear

from enum import StrEnum


class QuantificationMethod(StrEnum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very generic comment, but I'm wondering if the study types and the other enums are specific for eqtl catalogue data ingestion? I assume the same enums can be re-used in different parsers. eg. ukb ppp maybe? Shouldn't this be placed somewhere shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I did not want to move it out in this PR though, as. it breaks a lot of stuff.

sep="\t",
header=True,
schema=cls.raw_credible_set_schema,
fmt=NativeFileFormat.TSV.value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comments about tsv. My experience:

In [3]: spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='tsv').show()

Fails. However this works:

spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='csv', sep='\t', header=True).show()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, besides specifying the field separator, it seems the header defaults to False, so this needs to be adjusted in the session.load method.

sep="\t",
header=True,
schema=cls.raw_lbf_schema,
fmt=NativeFileFormat.TSV.value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

eqtl_catalogue_paths_imported: str = MISSING
eqtl_catalogue_study_index_out: str = MISSING
eqtl_catalogue_credible_sets_out: str = MISSING
eqtl_catalogue_metadata_path: str = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/fe3c4b4ed911b3a184271a6aadcd8c8769a66aba/data_tables/dataset_metadata.tsv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing a hardcoded url in the config looks very strange. Is it intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I guess this should not be default, I forgot about it, thanks!

[
pytest.param("test_path.parquet", "parquet", {}, id="parquet"),
pytest.param("test_path.csv", "csv", {}, id="csv"),
pytest.param("test_path.tsv", "tsv", {}, id="tsv"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you are testing for tsv... how does it work here? When tsv and csv the data is read, there's no test if the header is there, right?

_stop_active_spark()


@pytest.mark.no_shared_spark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know how these marks work in pytests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the marks to filter the tests you want to read, or mark tests to bahave differently. This is just a marker that points - these tests are suppose to run without shared spark session, the actual implementation is written elsewere (typically with pytest.request fixture that allows you to access the test and modify it's behavior.

On the other note the pytest -m 'no_shared_spark will run only the tests with this mark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants