-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Labels
enhancementNew feature or requestNew feature or request
Description
This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.
The Cardea class is responsible for handling and interacting with all the components (data_assembler, data_labeler, featurizer, and modeler).
Overall, the Cardea class:
- Provides simple user-facing abstractions
- label: generate label times
- featurize: generate feature matrix
- fit/predict
- evaluate
- save/load
- Hides away the interaction with other systems
- Entityset
- Featuretools DeepFeatureSynthesis
- ComposeML
- MLBlocks Pipelines
- Pipeline Selection and Tuning
design choices:
- remove
load_entitysetand make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation. - change
generate_label_time->label. - change
generate_feature_matrix->featurize. - allow the user to inspect
label_timesandfeature_matrix.
This should be the class public interface:
class Cardea:
def __init__(self,
data: str = DEFAULT_DATA,
labeler: FunctionType = DEFAULT_LABELER,
pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE,
hyperparameters: dict = None):
pass
def label(self,
labeler: FunctionType = None,
parameter: dict = None) -> pd.DataFrame:
"""Create label times using the data labeler.
Args:
labeler (function):
Function that defines the prediction task, it should return a
tuple of labeling function, the dataframe, and the name of the
target entity.
parameter (dict):
Variables to change the default parameters, if any.
Returns:
pandas.DataFrame:
A dataframe of cutoff times and their target labels.
"""
pass
def featurize(self,
label_times: pd.DataFrame,
verbose: bool = False) -> pd.DataFrame:
"""Returns a the calculated feature matrix.
Args:
label_times (pandas.DataFrame):
A dataframe that indicates cutoff time for each instance.
verbose (bool):
Indicate verbosity of the featurization.
Returns:
pandas.DataFrame:
Generated feature matrix.
"""
pass
def fit(self,
X: Union[np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series, list],
tune: bool = False,
max_evals: int = 10,
scoring: str = None,
verbose: bool = False) -> None:
"""Train the cardea pipeline.
Args:
X (pandas.DataFrame or numpy.ndarray):
Inputs to the pipeline.
y (pandas.Series, numpy.ndarray or list):
Target values.
tune (bool):
Whether to optimize hyper-parameters of the pipelines.
max_evals (int):
Maximum number of hyper-parameter optimization iterations.
scoring (str):
The name of the scoring function used in the hyper-parameter optimization.
verbose (bool):
Whether to log information during processing.
"""
pass
def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
"""Get predictions from the cardea pipeline.
Args:
X (pandas.DataFrame or numpy.ndarray):
Inputs to the pipeline.
Returns:
numpy.ndarray or list:
Predictions to the input data.
"""
pass
def fit_predict(self,
X: Union[np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series, list],
tune: bool = False,
max_evals: int = 10,
scoring: str = None,
verbose: bool = False) -> Union[np.ndarray, list]:
"""Train a cardea pipeline then make predictions.
Args:
X (pandas.DataFrame or numpy.ndarray):
Inputs to the pipeline.
y (pandas.Series, numpy.ndarray or list):
Target values.
tune (bool):
Whether to optimize hyper-parameters of the pipelines.
max_evals (int):
Maximum number of hyper-parameter optimization iterations.
scoring (str):
The name of the scoring function used in the hyper-parameter optimization.
verbose (bool):
Whether to log information during processing.
Returns:
numpy.ndarray:
Predictions to the input data.
"""
pass
def evaluate(self,
X: Union[np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series, list],
test_size: float = 0.2,
shuffle: bool = True, fit: bool = False,
tune: bool = False,
max_evals: int = 10,
scoring: str = None,
metrics: List[str] = DEFAULT_METRICS,
verbose: bool = False) -> pd.Series:
"""Evaluate the cardea pipeline.
Args:
X (pandas.DataFrame or numpy.ndarray):
Inputs to the pipeline.
y (pandas.Series, numpy.ndarray or list):
Target values.
test_size (float):
The proportion of the dataset to include in the test dataset.
shuffle (bool):
Whether or not to shuffle the data before splitting.
fit (bool):
Whether to fit the pipeline before evaluating it.
Defaults to ``False``.
tune (bool):
Whether to optimize hyper-parameters of the pipelines.
max_evals (int):
Maximum number of hyper-parameter optimization iterations.
scoring (str):
The name of the scoring function used in the hyper-parameter optimization.
metrics (list):
A list of scoring function names. The scoring functions should be consistent
with the problem type.
verbose (bool):
Whether to log information during processing.
Returns:
pandas.Series:
``pandas.Series`` containing one element for each
metric applied, with the metric name as index.
"""
pass
def save(self, path: str) -> None:
"""Save this object using pickle.
Args:
path (str):
Path to the file where the serialization of
this object will be stored.
"""
pass
def load(cls, path: str) -> Cardea:
"""Load an Cardea instance from a pickle file.
Args:
path (str):
Path to the file where the instance has been
previously serialized.
Returns:
Cardea:
A Cardea instance
Raises:
ValueError:
If the serialized object is not an Cardea instance.
"""
passIn addition to the main APIs, there will be helper functions such as set_pipeline, and train_test_split to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request