-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Labels
Description
Since we have the Cardea Class, it would also be beneficial to add a layer of functional interfaces that allows using Cardea with as few steps as possible. The design of the functional API would be problem centric as in, there will be a function for each given problem.
The functional api hides away all the nitty gritty details of composing a cardea pipeline, it is designed to return to the user a fitted pipeline on a given raw dataset. The user can then use the cardea instance to:
- make predictions on a new source data (not necessarily future).
- make predictions on future data.
- save/load the cardea instance.
Design
def model_pred_prob(data_path: str,
fhir: bool = True,
pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE,
hyperparameters: Union[str, pd.DataFrame] = None,
max_depth: int = 1,
max_features: int = -1,
n_jobs: int = 1,
test_size: float = 0.2,
shuffle: bool = True,
tune: bool = False,
max_evals: int = 10,
scoring: str = None,
evaluate: bool = False,
metrics: List[str] = DEFAULT_METRICS,
return_lt: bool = False,
return_fm: bool = False,
return_pred: bool = False,
verbose: bool = False,
save_path: str = None) -> Cardea:
"""Create and train a cardea instance on a specific prediction problem.
Return a cardea class object that has been trained on the given
dataset. The function loads the data, extracts label times, generates
features, then trains the pipeline all in one command.
Args:
data_path (str):
A directory of all .csv files that should be loaded.
fhir (bool):
An indicator whether FHIR or MIMIC schema is used.
pipeline (str or MLPipeline or dict):
Pipeline to use. It can be passed as:
* An ``str`` with a path to a JSON file.
* An ``str`` with the name of a registered pipeline.
* An ``str`` with the path to a pickle file.
* An ``MLPipeline`` instance.
* A ``dict`` with an ``MLPipeline`` specification.
hyperparameters (str or dict):
Hyperparameters to set to the pipeline. It can be passed as
a hyperparameters ``dict`` in the ``mlblocks`` format or as
a path to the corresponding JSON file. Defaults to ``None``.
max_depth (int):
Maximum allowed depth of features.
max_features (int):
Cap to the number of generated features. If -1, no limit.
n_jobs (int):
Number of parallel processes to use when calculating the
feature matrix.
test_size (float):
The proportion of the dataset to include in the test dataset.
shuffle (bool):
Whether or not to shuffle the data before splitting.
tune (bool):
Whether to optimize hyper-parameters of the pipelines.
max_evals (int):
Maximum number of hyper-parameter optimization iterations.
scoring (str):
The name of the scoring function used in the hyper-parameter
optimization.
evaluate (bool):
Whether to evaluate the performance of the pipeline. If True,
we evaluate the performance on the test data, if not given,
evaluate on train data.
metrics (list):
A list of scoring function names. The scoring functions should
be consistent with the problem type.
return_lt (bool):
Whether to return ``label_times``.
return_fm (bool):
Whether to return the calculated feature matrix.
return_pred (bool):
Whether to return the predictions of the pipeline.
verbose (bool):
Whether to show information during processing.
save_path (str):
Path to the file where the fitted pipeline will be stored
using ``pickle``.
Returns:
Cardea, dict:
* A fitted Cardea instance.
* Intermediary outputs when indicated.
"""
pass