-
Notifications
You must be signed in to change notification settings - Fork 15
feat: Session refactoring #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
78a5a10 to
e8118cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors gentropy.common.session.Session into a more feature-rich SparkSession wrapper (including file loading helpers, log4j defaults, dynamic allocation, and enums), updates multiple datasources/steps to use the new APIs, and reorganizes parts of the test suite (including “no_spark” tests).
Changes:
- Major rewrite of
Session(new enums, config composition, URL loading, log4j assets, updated write-mode handling). - eQTL Catalogue ingestion updates (metadata path parameterization, refactors in study index/finemapping readers, chromosome normalization).
- Test suite updates (new
no_sparktests, fixture changes, Session/load_data tests, Hail init test adjustments).
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/gentropy/no_spark/test_no_spark.py | Adds “no_spark” tests for Session creation/config and a web-dependent BGZIP→Parquet test. |
| tests/gentropy/datasource/gnomad/test_gnomad_ld.py | Switches Hail init to use session.spark.sparkContext instead of raw SparkSession. |
| tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py | Removes a long BGZIP codec test and introduces a new (currently unused) Spark-active flag. |
| tests/gentropy/datasource/finngen/test_finngen_finemapping.py | Switches test to use Session fixture and updates Hail init + Spark argument passing. |
| tests/gentropy/datasource/eqtl_catalogue/test_eqtl_catalogue.py | Fixes _setup fixture return annotation (now None). |
| tests/gentropy/conftest.py | Refactors Spark fixture lifecycle and adjusts several fixtures for new datasource APIs/types. |
| tests/gentropy/common/test_session.py | Adds tests for Session.load_data and HTTP/HTTPS loading via monkeypatching urlopen. |
| src/utils/spark.py | Minor docstring formatting tweak. |
| src/gentropy/l2g.py | Updates feature matrix load to explicitly pass parquet format. |
| src/gentropy/eqtl_catalogue.py | Adds configurable metadata path, removes explicit Session passing to readers, and coalesces study index output. |
| src/gentropy/datasource/finngen_meta/summary_statistics.py | Updates enhanced-BGZIP gating, forces threadpool map evaluation, and reformats assertions. |
| src/gentropy/datasource/eqtl_catalogue/study_index.py | Removes pandas dependency, validates blacklist methods, loads metadata via Session.load_data, and introduces enums for mappings. |
| src/gentropy/datasource/eqtl_catalogue/finemapping.py | Moves to Session.find()-based reads, adds chromosome normalization, and uses NativeFileFormat. |
| src/gentropy/datasource/eqtl_catalogue/init.py | Introduces QuantificationMethod and StudyType StrEnums. |
| src/gentropy/dataset/dataset.py | Adds generic Dataset.read() powered by Session.load_data. |
| src/gentropy/dataset/colocalisation.py | Defers imports to function scope to avoid top-level dependencies. |
| src/gentropy/config.py | Extends SessionConfig and adds eqtl_catalogue_metadata_path to config. |
| src/gentropy/common/session.py | Large refactor: enums, config assembly, log4j integration, URL loading, runtime-conf updating, and new find(). |
| src/gentropy/assets/log4j.properties | Adds log4j properties asset for Spark driver logging configuration. |
| src/gentropy/init.py | Expands pyspark pandas-on-spark warning suppression. |
| pyproject.toml | Updates pytest xdist distribution, adds no_spark marker (but not webtest). |
| docs/python_api/common/session.md | Updates API docs export list (now includes SparkWriteMode). |
| Makefile | Splits test targets into no-spark/spark/web and combines coverage outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/gentropy/datasource/finngen_meta/test_finngen_meta_summary_statistics.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/gentropy/eqtl_catalogue.py:59
read_credible_set_from_source/read_lbf_from_sourceare invoked without thesessionparameter, so they rely onSession.find()internally. Since this step already has an explicitsession, pass it through to avoid depending on global active Spark state and to ensure consistent Spark config is used.
credible_sets_df = EqtlCatalogueFinemapping.read_credible_set_from_source(
credible_set_path=[
f"{eqtl_catalogue_paths_imported}/{qtd_id}.credible_sets.tsv"
for qtd_id in studies_to_ingest
],
)
lbf_df = EqtlCatalogueFinemapping.read_lbf_from_source(
lbf_path=[
f"{eqtl_catalogue_paths_imported}/{qtd_id}.lbf_variable.txt"
for qtd_id in studies_to_ingest
],
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
DSuveges
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very elaborate update with a number of updates. I see the value in how handling the session evolved and requires more and more sophisticated machinery. My only actionable comment is the use of format='tsv', which fails in my local tests. Also it might be an issue that upon reading tsv/csv, the header might be lost without header=True.
| PARQUET = "parquet" | ||
| CSV = "csv" | ||
| TSV = "tsv" | ||
| JSON = "json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have json files, that usually means jsonl, isn't that parallelizable?
| ... "spark.executor.cores": "4", | ||
| ... "spark.executor.memory": "8g", | ||
| ... }, | ||
| ... ) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this notation mean these examples are excluded from tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, these are non-runnable examples.
| return SparkWriteMode( | ||
| self.conf.get( | ||
| "spark.gentropy.writeMode", SparkWriteMode.ERROR_IF_EXISTS.value | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this works.... Isn't the type of SparkWriteMode.ERROR_IF_EXISTS.value string? However the type of self.conf["spark.gentropy.writeMode"] is SparkWriteMode? See row 160:
self._write_mode = write_mode or SparkWriteMode.ERROR_IF_EXISTSThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit more confusing, the enum was there to ensure the type hints infer to some set of valuable options rather then random string, which can get more confusing. The SparkSession requires the str value, but we could get away with specifying the Enum as type, but pass it as a string. The implicit str(EnumType) will fallback to actual EnumType.value. I had removed the Enum from the type hints to make it more clear
| from enum import StrEnum | ||
|
|
||
|
|
||
| class QuantificationMethod(StrEnum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very generic comment, but I'm wondering if the study types and the other enums are specific for eqtl catalogue data ingestion? I assume the same enums can be re-used in different parsers. eg. ukb ppp maybe? Shouldn't this be placed somewhere shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I did not want to move it out in this PR though, as. it breaks a lot of stuff.
| sep="\t", | ||
| header=True, | ||
| schema=cls.raw_credible_set_schema, | ||
| fmt=NativeFileFormat.TSV.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comments about tsv. My experience:
In [3]: spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='tsv').show()
Fails. However this works:
spark.read.load('/Users/dsuveges/project_data/releases/25.12/input/evidence/ot_crispr/config.tsv', format='csv', sep='\t', header=True).show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, besides specifying the field separator, it seems the header defaults to False, so this needs to be adjusted in the session.load method.
| sep="\t", | ||
| header=True, | ||
| schema=cls.raw_lbf_schema, | ||
| fmt=NativeFileFormat.TSV.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
| eqtl_catalogue_paths_imported: str = MISSING | ||
| eqtl_catalogue_study_index_out: str = MISSING | ||
| eqtl_catalogue_credible_sets_out: str = MISSING | ||
| eqtl_catalogue_metadata_path: str = "https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/fe3c4b4ed911b3a184271a6aadcd8c8769a66aba/data_tables/dataset_metadata.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing a hardcoded url in the config looks very strange. Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I guess this should not be default, I forgot about it, thanks!
| [ | ||
| pytest.param("test_path.parquet", "parquet", {}, id="parquet"), | ||
| pytest.param("test_path.csv", "csv", {}, id="csv"), | ||
| pytest.param("test_path.tsv", "tsv", {}, id="tsv"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you are testing for tsv... how does it work here? When tsv and csv the data is read, there's no test if the header is there, right?
| _stop_active_spark() | ||
|
|
||
|
|
||
| @pytest.mark.no_shared_spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know how these marks work in pytests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the marks to filter the tests you want to read, or mark tests to bahave differently. This is just a marker that points - these tests are suppose to run without shared spark session, the actual implementation is written elsewere (typically with pytest.request fixture that allows you to access the test and modify it's behavior.
On the other note the pytest -m 'no_shared_spark will run only the tests with this mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining!
✨ Context
This PR is set of unrelated changes (refactoring), that I collected post eQTL catalogue GTEx v10 credible sets harmonization.
🛠 What does this PR implement
Running tests with non-shared spark session
The
testtarget is now spitted into 2 targetstest-no-shared-spark-sessionandtest-shared-spark-session. Both targets are dependencies oftesttarget, that later combines the coverage from both pytest runs.test-no-shared-spark-session- first pass runs all tests that need to set up theSessionwith custom parameters (fixture uses automatic cleanup of the spark if it exists). These tests can not be run in parallel by design, as they can not attach to the same globalSparkSessionobject. The tests are referenced byno_shared_sparkpytest mark.test-shared-spark-session- other tests that can use default session (all currently implemented), these tests can safely run in parallel.Running tests with external jar dependencies to spark
Additional test target
test-no-shared-spark-session-web-dependenciesthat is not run by default.This target includes tests marked with
download_jars_from_webandno_shared_sparkpytest marks. These tests require additional jars (like enhanced_bgzip_codec) that needs to be pre-fetched along with their dependencies from maven before the test can run successfully, hence they are not run by default as they require internet access.Session object refactoring
The changes to the
Sessionobject are done mainly to allow to easily recreate the session object, without needing to pass it to downstream functions. For example, now runningdoes not require passing the Session by reference, the function should be able to utilize the
Session.find()constructor to recreate the Session object from existingSparkSessionor throw exception if noSparkSessionis found.Changes
Session.find()method that searches forSparkSessionand recreates the Session object with it. If noSparkSessionis find, the method throws an exception.write_modeorpartition_number) directly to theSparkConfobject passed through thespark.gentropy.*parameters, for examplespark.gentropy.write_mode. This allows for theSessionto be just a lightweight wrapper that does not have a lot of overhead when recreating it.Sessiondefault constructor now passes all of the attributes to SparkConf during first startuphail_homeis providedExternal dependency composition
Refactored session allows for setting up multiple jar dependencies by composing them in
spark.jarsspark.jars.packagesspark.driver.extraClassPathspark.executor.extraClassPathThis is required when adding new dependency via
extended_spark_confcan override hail jar configuration (theSparkConf.setoverrides, but not composes via separated string.Default constructor behavior
Refactored session default constructor behavior:
SparkSessionexists3.1 if exists, use it and set up
loggerandconfattributes3.2 compare existing config to expected config and warn if there is something unexpected
Startup logging
The startup logging for SparkSession
is now silenced via the configuration in
log4j.properties, the default value ofINFOis later set in the log4j logger (after Session is created, which will prevent the startup logs)Self contained setup*_config methods
The
Session._setup_*_configmethods should allow forSparkConfcomposition.Data loading
The
Session.load_datais now refactored to allow for loading native file formats (NativeFileFormatenum), also if the file istsvorcsvwe allow for reading urllib, this is a fallback, as theSparkFilessolution never worked on dataproc.Others
This PR also includes changes needed to be done for eQTL Catalogue processing of latest data in our buckets.
🙈 Missing
🚦 Before submitting
devbranch?make test)?uv run pre-commit run --all-files)?