Redesigning the Cardea Class

This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.

The Cardea class is responsible for handling and interacting with all the components (`data_assembler`, `data_labeler`, `featurizer`, and `modeler`). 

Overall, the Cardea class:
* Provides simple user-facing abstractions
    - label: generate label times
    - featurize: generate feature matrix
    - fit/predict
    - evaluate
    - save/load
* Hides away the interaction with other systems
    - Entityset
    - Featuretools DeepFeatureSynthesis
    - ComposeML
    - MLBlocks Pipelines
    - Pipeline Selection and Tuning

**design choices:**
1. remove `load_entityset` and make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation.
2. change `generate_label_time` -> `label`.
3. change `generate_feature_matrix` -> `featurize`.
4. allow the user to inspect `label_times` and `feature_matrix`.


This should be the class public interface:
```python3
class Cardea:

    def __init__(self, 
                 data: str = DEFAULT_DATA, 
                 labeler: FunctionType = DEFAULT_LABELER,
                 pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                 hyperparameters: dict = None):
        pass

    def label(self, 
              labeler: FunctionType = None,
              parameter: dict = None) -> pd.DataFrame:
        """Create label times using the data labeler.

        Args:
            labeler (function):
                Function that defines the prediction task, it should return a
                tuple of labeling function, the dataframe, and the name of the
                target entity.
            parameter (dict):
                Variables to change the default parameters, if any.

        Returns:
            pandas.DataFrame:
                A dataframe of cutoff times and their target labels.
        """
        pass

    def featurize(self, 
                  label_times: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
        """Returns a the calculated feature matrix.

        Args:
            label_times (pandas.DataFrame):
                A dataframe that indicates cutoff time for each instance.
            verbose (bool):
                Indicate verbosity of the featurization.

        Returns:
            pandas.DataFrame:
                Generated feature matrix.
        """
        pass

    def fit(self, 
            X: Union[np.ndarray, pd.DataFrame], 
            y: Union[np.ndarray, pd.Series, list],
            tune: bool = False, 
            max_evals: int = 10, 
            scoring: str = None,
            verbose: bool = False) -> None:
        """Train the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.
        """
        pass

    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
        """Get predictions from the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.

        Returns:
            numpy.ndarray or list:
                Predictions to the input data.
        """
        pass

    def fit_predict(self, 
                    X: Union[np.ndarray, pd.DataFrame],
                    y: Union[np.ndarray, pd.Series, list], 
                    tune: bool = False,
                    max_evals: int = 10, 
                    scoring: str = None,
                    verbose: bool = False) -> Union[np.ndarray, list]:
        """Train a cardea pipeline then make predictions.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            numpy.ndarray:
                Predictions to the input data.
        """
        pass

    def evaluate(self, 
                 X: Union[np.ndarray, pd.DataFrame], 
                 y: Union[np.ndarray, pd.Series, list],
                 test_size: float = 0.2, 
                 shuffle: bool = True, fit: bool = False,
                 tune: bool = False, 
                 max_evals: int = 10, 
                 scoring: str = None,
                 metrics: List[str] = DEFAULT_METRICS, 
                 verbose: bool = False) -> pd.Series:
        """Evaluate the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            test_size (float):
                The proportion of the dataset to include in the test dataset.
            shuffle (bool):
                Whether or not to shuffle the data before splitting.
            fit (bool):
                Whether to fit the pipeline before evaluating it.
                Defaults to ``False``.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            metrics (list):
                A list of scoring function names. The scoring functions should be consistent
                with the problem type.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            pandas.Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

    def save(self, path: str) -> None:
        """Save this object using pickle.

        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass

    def load(cls, path: str) -> Cardea:
        """Load an Cardea instance from a pickle file.

        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.

        Returns:
            Cardea:
                A Cardea instance

        Raises:
            ValueError:
                If the serialized object is not an Cardea instance.
        """
        pass
```

In addition to the main APIs, there will be helper functions such as `set_pipeline`, and `train_test_split` to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesigning the Cardea Class #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redesigning the Cardea Class #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions