Skip to content

Latest commit

 

History

History
executable file
·
204 lines (171 loc) · 8.44 KB

File metadata and controls

executable file
·
204 lines (171 loc) · 8.44 KB

Analyze

.. autofunction:: association_study
.. autofunction:: interaction_study
.. autofunction:: add_corrected_pvalues

.. autofunction:: association_study(data: pd.DataFrame, outcomes: Union[str, List[str]], regression_variables: Optional[Union[str, List[str]]] = None, covariates: Optional[Union[str, List[str]]] = None, regression_kind: Optional[Union[str, List[str]]] = None, encoding: str = "additive", edge_encoding_info: Optional[pd.DataFrame] = None, **kwargs,)
   :noindex:

    Run an association study (EWAS, PhEWAS, GWAS, GxEWAS, etc)

    Individual regression classes selected with `regression_kind` may work
    slightly differently. Results are sorted in order of increasing `pvalue`

    Parameters
    ----------
    data: pd.DataFrame
        Contains all outcomes, regression_variables, and covariates
    outcomes: str or List[str]
        The exogenous variable (str) or variables (List) to be used as the
        output of each regression.
    regression_variables: str, List[str], or None
        The endogenous variable (str) or variables (List) to be used
        invididually as inputs into regression.
        If None, use all variables in `data` that aren't an outcome or a
        covariate
    covariates: str, List[str], or None (default)
        The variable (str) or variables (List) to be used as covariates in
        each regression.
    regression_kind: None, str or subclass of Regression
        This can be 'glm', 'weighted_glm', or 'r_survey' for built-in
        Regression types, or a custom subclass of Regression.  If None, it is
        set to 'glm' if a survey design is not specified and 'weighted_glm'
        if it is.
    kwargs: Keyword arguments specific to the Regression being used

    Returns
    -------
    df: pd.DataFrame
        Association Study results DataFrame with at least these columns: ['N',
        'pvalue', 'error', 'warnings'].
        Indexed by the outcome variable and the variable being assessed in
        each regression

    Examples
    --------

    >>> import igem
    >>> results = igem.epc.analyze.association_study(
                    outcomes="HI_CHOL",
                    covariates=["race", "agecat"],
                    data=df,
                    standardize_data=True,
                    )

.. autofunction:: ewas(outcome: str, covariates: List[str], data: Any, regression_kind: Optional[Union[str, List[str]]] = None, **kwargs)
   :noindex:

    Run an Environment-Wide Association Study

    All variables in `data` other than the outcome (outcome) and covariates
    are tested individually. Individual regression classes selected with
    `regression_kind` may work slightly differently.
    Results are sorted in order of increasing `pvalue`

    Parameters
    ----------
    outcome: string
        The variable to be used as the output of the regressions
    covariates: list (strings),
        The variables to be used as covariates.  Any variables in the
        DataFrames not listed as covariates are regressed.
    data: Any, usually pd.DataFrame
        The data to be analyzed, including the outcome, covariates, and any
        variables to be regressed.
    regression_kind: str or subclass of Regression
        This can be 'glm', 'weighted_glm', or 'r_survey' for built-in
        Regression types, or a custom subclass of Regression
        None by default to maintain existing api (`glm` unless
        SurveyDesignSpec exists, in which case `weighted_glm`)
    kwargs: Keyword arguments specific to the Regression being used

    Returns
    -------
    df: pd.DataFrame
        EWAS results DataFrame with at least these columns: ['N', 'pvalue',
        'error', 'warnings']
        indexed by the outcome and the variable being assessed in each row

    Examples
    --------
    >>> ewas_discovery = igem.epc.analyze.ewas(
        "logBMI", covariates, nhanes_discovery
        )
    Running on a continuous variable

.. autofunction:: interaction_study(data: pd.DataFrame, outcomes: Union[str, List[str]], interactions: Optional[Union[List[Tuple[str, str]], str]] = None, covariates: Optional[Union[str, List[str]]] = None, encoding: str = "additive", edge_encoding_info: Optional[pd.DataFrame] = None, report_betas: bool = False, min_n: int = 200, process_num: Optional[int] = None)
   :noindex:

    Perform LRT tests comparing a model with interaction terms to one
    without.

    An intercept, covariates, and main effects of the variables used in the
    interactiona are included in both the full and restricted models.
    All variables in `data` other than the outcome and covariates are
    potential interaction variables.
    All pairwise interactions are tested unless specific.
    Results are sorted in order of increasing `pvalue`.

    Parameters
    ----------
    data: pd.DataFrame
        The data to be analyzed, including the outcome, covariates, and any
        variables to be regressed.
    outcomes: str or List[str]
        The exogenous variable (str) or variables (List) to be used as the
        output of each regression.
    interactions: list(tuple(strings)), str, or None
        Valid variables are those in the data that are not an outcome variable
        or a covariate.
        None: Test all pairwise interactions between valid variables
        String: Test all interactions of this valid variable with other valid
        variables
        List of tuples: Test specific interactions of valid variables
    covariates: str, List[str], or None (default)
        The variable (str) or variables (List) to be used as covariates in
        each regression.
    encoding: str, default "additive""
        Encoding method to use for any genotype data.  One of {'additive',
        'dominant', 'recessive', 'codominant', or 'edge'}
    edge_encoding_info: Optional pd.DataFrame, default None
        If edge encoding is used, this must be provided.
        See Pandas-Genomics documentation on edge encoding.
    report_betas: boolean
        False by default.
          If True, the results will contain one row for each interaction term
          and will include the beta value, standard error (SE), and beta
          pvalue for that specific interaction. The number of terms increases
          with the number of categories in each interacting variable.
    min_n: int or None
        Minimum number of complete-case observations (no NA values for outcome,
        covariates, or variable). Defaults to 200
    process_num: Optional[int]
        Number of processes to use when running the analysis, default is None
        (use the number of cores)

    Returns
    -------
    df: pd.DataFrame
        DataFrame with these columns: ['Test_Number', 'Converged', 'N', 'Beta',
        'SE', 'Beta_pvalue', 'LRT_pvalue']
        indexed by the interaction terms ("Term1", "Term2") and the outcome
        variable ("Outcome")

    return df_result

.. autofunction:: add_corrected_pvalues(data: pd.DataFrame, pvalue: str = "pvalue", groupby: Optional[Union[str, List[str]]] = None)
   :noindex:

    Calculate bonferroni and FDR pvalues and sort by increasing FDR (in-place).
    Rows with a missing pvalue are not counted as a test.

    Parameters
    ----------
    data:
        A dataframe that will be modified in-place to add corrected pvalues
    pvalue:
        Name of a column in data that the calculations will be based on.
    groupby:
        A name or list of names of columns (including index columns) that will
        be used to group rows before performing calculations. This is meant to
        be used when multiple rows are present with repeated pvalues based on
        the same test. This will reduce the number of tests.  For example,
        grouping by ["Term1", "Term2"] in interaction results to apply
        corrections to the LRT_pvalue when betas are reported (which creates
        more rows than the number of tests).

    Returns
    -------
    None

    Examples
    --------
    >>> igem.epc.analyze.add_corrected_pvalues(ewas_discovery)

    >>> igem.epc.analyze.add_corrected_pvalues(
                    interaction_result,
                    pvalue='Beta_pvalue'
                    )

    >>> igem.epc.analyze.add_corrected_pvalues(
                    interaction_result,
                    pvalue='LRT_pvalue',
                    groupby=["Term1", "Term2"]
                    )