PhenoXtract

PhenoXtract is a configurable ETL (Extract-Transform-Load) pipeline and crate written in Rust for converting tabular data sources (e.g. CSV or Excel) into Phenopackets v2.0. The config can be written in YAML, TOML, or JSON formats. For an explanation of how to write a config.yaml, see here: YAML_README.

PhenoXtract

How PhenoXtract works

PhenoXtract begins by extracting the data sources into a Polars Dataframe. In the config file, the user will have specified which Phenopacket elements each column of the data corresponds to. This is done by providing a SeriesContext for each column. See Contexts and series_contexts for more information on Series Contexts.

Once the data has been extracted, Strategies are applied, which transform the data into a format that the application can understand. See here for a list of all current strategies: Strategies. The user can decide which strategies should be applied in the config file.

After strategies have been applied, the Collection stage of the program begins. PhenoXtract creates Phenopackets for each patient in the data, and then goes through the data cell-by-cell and inserts the data into the correct Phenopacket.

Finally, the Phenopackets are loaded to .json files in a directory of the user's choice.

What format does PhenoXtract expect data to be in?

Before the Collection stage of the program, the data must be in a certain format so that it can be understood by PhenoXtract. How each column should look will be explained in the sections:

Extracting Individual Data
Extracting Phenotypes
Extracting Diseases
Extracting Interpretations
Extracting Measurements
Extracting Medical Actions

Running PhenoXtract in Rust

Once a config file has been written (see YAML_README for information on how to write a config.yaml), PhenoXtract can be run as follows:

use std::path::PathBuf;
use phenoxtract::phenoxtract::Phenoxtract;

fn main() -> Result<(), PipelineError> {
    let config_path = PathBuf::from(
        "path/to/config.yaml",
    );
    let mut phenoxtract = Phenoxtract::try_from(config_path).unwrap();
    phenoxtract.run()?;
}

Extracting Individual Data

(TODO)

Extracting Phenotypes

(TODO)

Extracting Diseases

(TODO)

Extracting Interpretations

(TODO)

Extracting Measurements

(TODO)

Extracting Medical Actions

(TODO)

Contexts

In order for PhenoXtract to understand what is inside a column, the user must specify a SeriesContext for that column. For each SeriesContext, the user can specify a header_context, which describes what is in the header of the column, and a data_context which describes what is in the cells of the column. How one configures a SeriesContext for a column (or multiple) is described in YAML_README.

Here is the list of possible values that header_context or data_context can take:

Individual data

subject_id
subject_sex
date_of_birth
vital_status
time_at_last_encounter: time_element_type
time_of_death: time_element_type
cause_of_death
survival_time_days

Phenotypes and Disease

hpo
disease
multi_hpo_id
onset: time_element_type

Genetics

hgvs
hgnc

Measurements

quantitative_measurement (assay_id: String, unit_ontology_id: String)
qualitative_measurement (assay_id: String)
time_of_measurement: time_element_type
reference_range: boundary

Medical Actions

treatment_target
treatment_intent
response_to_treatment
treatment_termination_reason
procedure
procedure_body_site
time_of_procedure: time_element_type
observation_status
None

In the above, TimeElementType can currently be one of

date
age

and Boundary can be one of

lower
upper

Strategies

Here is a list of the strategies currently supported by PhenoXtract:

age_to_iso8601

Given a column whose cells contains ages (e.g. with context subject_age, time_of_death:age or onset:age) this strategy converts integer entries to ISO8601 durations: 47 -> P47Y

NOTE: the integers must be between 0 and 150.

If an entry is already in ISO8601 duration format, it will be left unchanged. If there are cell values which are neither ISO8601 durations nor integers an error will be returned.

alias_map

This strategy will apply all the aliases found in the SeriesContexts. For example if a table (in PhenoXtract this is called a ContextualisedDataframe) has a SeriesContext consisting of a subject_sex column and a ToString AliasMap which converts "M" to "Male" and "F" to "Female" then the strategy will apply those aliases to each cell.

NOTE

This does not transform the headers of the table.
Only non-null cells may be aliased.
Non-null cells may be aliased to null

date_to_age

This strategy finds columns whose cells contain dates, and converts these dates to a certain age of the patient, by leveraging the patient's date of birth.

If there is no data on a certain patient's date of birth, yet there is a date corresponding to this patient, then an error will be thrown.

mapping

A strategy for mapping string values to standardized terms using a synonym dictionary.

MappingStrategy transforms data by replacing cell values with their corresponding mapped values from a synonym map. It's commonly used for data normalization tasks such as standardizing gender/sex values, categorical data, or controlled vocabulary.

multi_hpo_col_expansion

A strategy for converting columns whose cells contain HPO IDs into several columns whose headers are exactly those HPO IDs and whose cells contain the observation_status for each patient.

The columns are created on a "block by block" basis so that building blocks are preserved after the transformation. A new SeriesContext will be added for each block of new columns. The old columns and contexts will be removed.

ontology_normaliser

A strategy that converts ontology labels in cells (or synonyms of them) to the corresponding IDs. It is case-insensitive.

This strategy processes string columns in data tables by looking up values in an ontology bidirectional dictionary and replacing labels with their corresponding IDs. It only operates on columns that have no header context and match the specified data context.

hpo_disease_splitter_strategy

This strategy will find every column whose context is hpo_or_disease and split it into two separate columns: a hpo column and a disease column.

HPO is prioritised: the strategy will find all HPO labels and IDs, and then put them into the HPO column. All other cells will be assumed to refer to disease.

Authors

Rouven Reuter
Patrick Simon Nairne
Adam Graefe
Varenya Jain
Peter Robinson

Name		Name	Last commit message	Last commit date
Latest commit History 1,853 Commits
.github/workflows		.github/workflows
integration_tests		integration_tests
phenoxtract		phenoxtract
scripts		scripts
.gitignore		.gitignore
CONFIG_YAML_README.md		CONFIG_YAML_README.md
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhenoXtract

How PhenoXtract works

What format does PhenoXtract expect data to be in?

Running PhenoXtract in Rust

Extracting Individual Data

Extracting Phenotypes

Extracting Diseases

Extracting Interpretations

Extracting Measurements

Extracting Medical Actions

Contexts

Strategies

age_to_iso8601

alias_map

date_to_age

mapping

multi_hpo_col_expansion

ontology_normaliser

hpo_disease_splitter_strategy

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhenoXtract

How PhenoXtract works

What format does PhenoXtract expect data to be in?

Running PhenoXtract in Rust

Extracting Individual Data

Extracting Phenotypes

Extracting Diseases

Extracting Interpretations

Extracting Measurements

Extracting Medical Actions

Contexts

Strategies

age_to_iso8601

alias_map

date_to_age

mapping

multi_hpo_col_expansion

ontology_normaliser

hpo_disease_splitter_strategy

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages