Skip to content
This repository was archived by the owner on Sep 16, 2024. It is now read-only.

Latest commit

 

History

History
527 lines (449 loc) · 15.2 KB

File metadata and controls

527 lines (449 loc) · 15.2 KB

Introduction

This script allows the curation of metadata by transforming incorrect metadata values to correct ones based on controlled vocabularies (for example, “NCBI Taxonomy”) and manually prepared files with rules for matching. Synonyms, if provided, can assist in the matching process.

Requirements

Curation script usage

To explore the full list of supported arguments use the following command:

odm-curate-study -h

Curation can be carried out by running the Python metadata curation script odm-curate-study. This can be used to perform automated curation of metadata associated with experiments and assays on Genestack. The script takes as input the accessions of Genestack study or studies which should contain the samples to be curated. You must also supply a rules file, which specifies the rules for mapping.

odm-curate-study --rules <rules.json> <study accession(s)> -H GENESTACK_ENDPOINT_ADDR

Where GENESTACK_ENDPOINT_ADDR is the URL of the Genestack platform.

Space is used as a separator in case of multiple studies. Example:

odm-curate-study --rules rules.json GSF000100 GSF000200 -H GENESTACK_ENDPOINT_ADDR

You can test a curation rule before applying it with the --dry-run parameter. This will connect to the server and report matches in the task output log (see later) but not actually change any data.

odm-curate-study --rules <rules.json> --dry-run <study accession> -H GENESTACK_ENDPOINT_ADDR

By default, if the target key already contains information it will not be overwritten and a warning will be noted in the logs, however you can force this using the --overwrite flag. This does not affect attributes set as 'read-only' in the template.

odm-curate-study --rules <rules.json> --overwrite <study accession> -H GENESTACK_ENDPOINT_ADDR

Example usage for acting on study GSF123456:

odm-curate-study --rules rules.json --overwrite GSF123456 -H GENESTACK_ENDPOINT_ADDR

Metadata mapping rules

The curation application uses a rules.json file which defines the mapping rules using the --rules <rule_file> argument. The rules must be provided as a JSON file which contains an array of objects, where each object describes a metadata mapper. The mapping script will search for raw uncurated metadata values in a list of input keys (also called "raw keys"), try to map them to a "curated" value (using a synonym-aware dictionary or custom rules) and store the curated value in a target key. Values are case-sensitive. The valid attributes for the metainfo mapper are as follows:

  • dictionary (string, optional) - the name of a public Genestack dictionary used as a source of valid terms - if specified mappings are not found in a supplied dictionary then a warning is given in the logs;
  • genestack_key (string, mandatory) - target metainfo key (e.g. ‘Sex’);
  • object_type (string, mandatory) - the specific data object being targeted (e.g. 'study' or 'sample', must be lowercase.)
  • raw_keys (list of strings, mandatory) - comma separated list of names of the raw (ie from import) metadata keys in which raw values will be looked up (e.g. ‘sourceData:ae.sample.Characteristics [Sex]’);
  • rules (object of strings, optional) - rules to map raw values to terms from dictionaries/ontologies. For example, below is a JSON rules file to define custom mapping rules for the "Sex" metainfo attribute. This will copy data from a column (sourceData:ae.sample.Characteristics [Sex]) to the correct one (Sex)
[
    {
        "genestack_key": "Sex",
        "object_type": "sample",
        "raw_keys": [
            "sourceData:ae.sample.Characteristics [Sex]"
        ]
    }
]

To do the same, but also map data values to specific terms ("m" to "male", "f" to "female", "?" to "unknown"), use the following:

[
    {
        "genestack_key": "Sex",
        "object_type": "sample",
        "raw_keys": [
            "sourceData:ae.sample.Characteristics [Sex]"
        ],
        "rules": {
            "m": "male",
            "f": "female",
            "?": "unknown"
        }
    }
]

The end result is as per the below tables.

Before:

Sex
m
f
?

After:

Sex
male
female
unknown

Finally, a dictionary can be supplied in the rules. If the value (first mapped by rules) matches to one of the term's synonyms it will be replaced with the preferred term.

[
    {
        "genestack_key": "Sex",
        "object_type": "sample",
        "raw_keys": [
            "sourceData:ae.sample.Characteristics [Sex]"
        ],
        "dictionary": "Sex"
    }
]

Key_with_unit Mapper

If samples have a key with unit values stored in 1 attribute, e.g. “Time”=”7 days”, it is possible to curate the values so that they are displayed in 2 separate attributes in ODM: “Time/value”=”7”, “Time/unit”=”days”.

To split the value and its associated unit, the script will attempt to use the whitespace character(s) as a delimiter. If there is a space, the script puts the part before the first space in the Value attribute, everything after the first space in the Unit attribute (even if there are more than 1 space). Genestack keys should be specified for both, using a comma separated list. If there are no spaces, everything is put in the Value attribute, the Unit attribute is left empty. We rely on manual curation in this case. Example of using the mapper with units:

[
    {
        "object_type": "sample",
        "genestack_key": [
            "Treatment/dose/value",
            "Treatment/dose/unit"
        ],
        "raw_keys": [
            "Value[Dose]"
        ],
        "dictionary": "Units - Dose/Mass/Volume"
    }
]
  • Sample A:

    • Before: "Parameter Value[Dose]"=7 ug/ml
    • After: "Treatment/dose/value"=7, "Treatment/dose/unit"=microgram per millilitre
  • Sample B:

    • Before: "Parameter Value[Dose]"=7 ug per ml
    • After: "Treatment/dose/value"=7, "Treatment/dose/unit"=ug per ml
  • Sample C:

    • Before: "Parameter Value[Dose]"=7ug/ml
    • After: "Treatment/dose/value"=7ug/ml, "Treatment/dose/unit"=null

Before:

Sample Name Dose
Sample A 7 ug/ml
Sample B 7 ug per ml
Sample C 7ug/ml

After:

Sample Name Dose Dose Unit
Sample A 7 ug/ml
Sample B 7 ug per ml
Sample C 7ug/ml

Supported case: We support curation of multiple values with units for a single sample with the “key_with_unit“ mapper, e.g.: Sample X {Attribute_1=A|B; Attribute_2=X|Y} A= Paracetamol; X=5 mg B= Analgin; Y=0.5 g

Before:

Sample Name Medicine Dose
Sample Paracetamol 5mg
Analgin 0.5 g

After:

Sample Name Medicine Dose Dose Unit
Sample Paracetamol 5 mg
Analgin 0.5 g

Unsupported case: Changing a single value to multiple values for this mapper is not supported. Sample X {Attribute_1=A; Attribute_2=X|Y} A= Paracetamol; X=2.5 g | Y=2.5 g Before:

Sample Name Medicine Dose
Sample Paracetamol 5mg
After:
Sample Name Medicine Dose Dose Unit
Sample Paracetamol 2.5 g
2.5 g

Reassigning attributes

When it comes to matching attribute names, the script works similarly to the “Reassign” feature of the Metadata Editor application. That means that if a raw_key from the rules file is detected among the attributes, the values are reassigned to the corresponding genestack_key, and the original attribute raw_key is deleted.

Rule:

[
    {
        "genestack_key": "Disease",
        "object_type": "sample",
        "raw_keys": [
            "Illness",
            "DISEASE"
        ]
    }
]

Case 1

raw_key is a non-template attribute, genestack_key is a template attribute. After curation values are reassigned to genestack_key, and the original attribute raw_key is deleted.

Before:

Disease Illness
Sample 1 A
Sample 2 B

After:

Disease
Sample 1 A
Sample 2 B
The above is valid in most cases, but the behaviour can differ for some specific cases which are described below.

Case 2

If multiple raw keys are defined for the same attribute, the values are taken from the first non empty raw key found for the sample, the rest raw keys are ignored.

Multiple raw_keys from the rules are found. All raw_keys are non-template attributes. The first raw_key attribute (Illness) has a value for the Sample 1, but does not have any value for the Sample 2. Hence the value from the second raw_key attribute (DISEASE) is taken for the Sample 2. The raw_key attribute which values were re-assigned for all samples (Illness) is deleted. The partially re-assigned attribute (DISEASE) still presents in the table, but only with values which were not reassigned.

Before:

Disease Illness DISEASE
Sample 1 A A1
Sample 2 B1

After:

Disease DISEASE
Sample 1 A A1
Sample 2 B1

Note: The described case is quite rare since usually attributes with the same meaning will have the same name across all samples in one study. Multiple raw keys are provided mostly for running the script on multiple studies from different sources where attributes with the same meaning can have different names.

Case 3

raw_key is a template attribute. This case is considerably rare too, since the main purpose of the Curation script is to match non-template attributes of non-harmonised metadata to the template attributes. “Re-assign” feature cannot be used for template attributes, since they cannot be deleted. Hence, the values will be copied from raw_key to genestack_key and preserve raw_key in the same state as before the curation.

Before:

Disease Illness
Sample 1 A
Sample 2 B
After:
Disease Illness
Sample 1 A A
Sample 2 B B

Read only attributes

Attributes set as read only in the template associated with the study cannot be curated, and a warning is displayed in the logs. The overwrite flag does not affect this behaviour.

Multiple rules for a single attribute

If multiple rules for the same attribute are found the warning message is displayed in the logs:

Multiple curation rules were detected for the attribute X.

Progress, Logs

You can track the progress of the curation process in the Genestack Task Manager: image info

The results of mapping are shown in the output logs: image info