PhenoXtract is a configurable ETL (Extract-Transform-Load) pipeline and crate written in Rust for converting
tabular data sources (e.g. CSV or Excel)
into Phenopackets v2.0. The config can be written in YAML,
TOML, or JSON formats. For an explanation of how to write a config.yaml, see
here: YAML_README.
PhenoXtract begins by extracting the data sources into a Polars Dataframe. In
the config file, the user will have specified which Phenopacket elements each column of the data corresponds to. This is
done by providing a SeriesContext for each column. See Contexts
and series_contexts for more information on Series Contexts.
Once the data has been extracted, Strategies are applied, which transform the data into a format that the application
can
understand. See here for a list of all current strategies: Strategies. The user can decide
which strategies should be applied in the config file.
After strategies have been applied, the Collection stage of the program begins. PhenoXtract creates Phenopackets for
each patient in the data, and then goes through the data cell-by-cell and inserts the data into the correct Phenopacket.
Finally, the Phenopackets are loaded to .json files in a directory of the user's choice.
Before the Collection stage of the program, the data must be in a certain format so that it can be understood by PhenoXtract. How each column should look will be explained in the sections:
- Extracting Individual Data
- Extracting Phenotypes
- Extracting Diseases
- Extracting Interpretations
- Extracting Measurements
- Extracting Medical Actions
Once a config file has been written (see YAML_README for information on how to write a config.yaml), PhenoXtract can be run as follows:
use std::path::PathBuf;
use phenoxtract::phenoxtract::Phenoxtract;
fn main() -> Result<(), PipelineError> {
let config_path = PathBuf::from(
"path/to/config.yaml",
);
let mut phenoxtract = Phenoxtract::try_from(config_path).unwrap();
phenoxtract.run()?;
}(TODO)
(TODO)
(TODO)
(TODO)
(TODO)
(TODO)
In order for PhenoXtract to understand what is inside a column, the user must specify a SeriesContext for that
column. For each SeriesContext, the user can specify a header_context, which describes what is in the header of the
column, and a data_context which describes what is in the cells of the column. How one configures a SeriesContext
for
a column (or multiple) is described in YAML_README.
Here is the list of possible values that header_context or data_context can take:
Individual data
- subject_id
- subject_sex
- date_of_birth
- vital_status
- time_at_last_encounter: time_element_type
- time_of_death: time_element_type
- cause_of_death
- survival_time_days
Phenotypes and Disease
- hpo
- disease
- multi_hpo_id
- onset: time_element_type
Genetics
- hgvs
- hgnc
Measurements
- quantitative_measurement (assay_id: String, unit_ontology_id: String)
- qualitative_measurement (assay_id: String)
- time_of_measurement: time_element_type
- reference_range: boundary
Medical Actions
-
treatment_target
-
treatment_intent
-
response_to_treatment
-
treatment_termination_reason
-
procedure
-
procedure_body_site
-
time_of_procedure: time_element_type
-
observation_status
-
None
In the above, TimeElementType can currently be one of
- date
- age
and Boundary can be one of
- lower
- upper
Here is a list of the strategies currently supported by PhenoXtract:
Given a column whose cells contains ages (e.g. with context subject_age, time_of_death:age or onset:age) this
strategy converts integer entries to ISO8601 durations: 47 -> P47Y
NOTE: the integers must be between 0 and 150.
If an entry is already in ISO8601 duration format, it will be left unchanged. If there are cell values which are neither ISO8601 durations nor integers an error will be returned.
This strategy will apply all the aliases found in the SeriesContexts.
For example if a table (in PhenoXtract this is called a ContextualisedDataframe) has a SeriesContext consisting of a
subject_sex column and a ToString AliasMap
which converts "M" to "Male" and "F" to "Female" then the strategy will apply those aliases to each cell.
NOTE
- This does not transform the headers of the table.
- Only non-null cells may be aliased.
- Non-null cells may be aliased to null
This strategy finds columns whose cells contain dates, and converts these dates to a certain age of the patient, by leveraging the patient's date of birth.
If there is no data on a certain patient's date of birth, yet there is a date corresponding to this patient, then an error will be thrown.
A strategy for mapping string values to standardized terms using a synonym dictionary.
MappingStrategy transforms data by replacing cell values with their corresponding mapped values from a synonym map.
It's commonly used for data normalization tasks such as standardizing gender/sex values, categorical data, or controlled
vocabulary.
A strategy for converting columns whose cells contain HPO IDs into several columns whose headers are exactly those HPO
IDs and whose cells contain the observation_status for each patient.
The columns are created on a "block by block" basis so that building blocks are preserved after the transformation. A new SeriesContext will be added for each block of new columns. The old columns and contexts will be removed.
A strategy that converts ontology labels in cells (or synonyms of them) to the corresponding IDs. It is case-insensitive.
This strategy processes string columns in data tables by looking up values in an ontology bidirectional dictionary and replacing labels with their corresponding IDs. It only operates on columns that have no header context and match the specified data context.
This strategy will find every column whose context is hpo_or_disease and split it into two separate columns: a hpo
column and a disease column.
HPO is prioritised: the strategy will find all HPO labels and IDs, and then put them into the HPO column. All other cells will be assumed to refer to disease.
- Rouven Reuter
- Patrick Simon Nairne
- Adam Graefe
- Varenya Jain
- Peter Robinson