Skip to content

Redesigning the Cardea Class #85

@sarahmish

Description

@sarahmish

This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.

The Cardea class is responsible for handling and interacting with all the components (data_assembler, data_labeler, featurizer, and modeler).

Overall, the Cardea class:

  • Provides simple user-facing abstractions
    • label: generate label times
    • featurize: generate feature matrix
    • fit/predict
    • evaluate
    • save/load
  • Hides away the interaction with other systems
    • Entityset
    • Featuretools DeepFeatureSynthesis
    • ComposeML
    • MLBlocks Pipelines
    • Pipeline Selection and Tuning

design choices:

  1. remove load_entityset and make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation.
  2. change generate_label_time -> label.
  3. change generate_feature_matrix -> featurize.
  4. allow the user to inspect label_times and feature_matrix.

This should be the class public interface:

class Cardea:

    def __init__(self, 
                 data: str = DEFAULT_DATA, 
                 labeler: FunctionType = DEFAULT_LABELER,
                 pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                 hyperparameters: dict = None):
        pass

    def label(self, 
              labeler: FunctionType = None,
              parameter: dict = None) -> pd.DataFrame:
        """Create label times using the data labeler.

        Args:
            labeler (function):
                Function that defines the prediction task, it should return a
                tuple of labeling function, the dataframe, and the name of the
                target entity.
            parameter (dict):
                Variables to change the default parameters, if any.

        Returns:
            pandas.DataFrame:
                A dataframe of cutoff times and their target labels.
        """
        pass

    def featurize(self, 
                  label_times: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
        """Returns a the calculated feature matrix.

        Args:
            label_times (pandas.DataFrame):
                A dataframe that indicates cutoff time for each instance.
            verbose (bool):
                Indicate verbosity of the featurization.

        Returns:
            pandas.DataFrame:
                Generated feature matrix.
        """
        pass

    def fit(self, 
            X: Union[np.ndarray, pd.DataFrame], 
            y: Union[np.ndarray, pd.Series, list],
            tune: bool = False, 
            max_evals: int = 10, 
            scoring: str = None,
            verbose: bool = False) -> None:
        """Train the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.
        """
        pass

    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
        """Get predictions from the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.

        Returns:
            numpy.ndarray or list:
                Predictions to the input data.
        """
        pass

    def fit_predict(self, 
                    X: Union[np.ndarray, pd.DataFrame],
                    y: Union[np.ndarray, pd.Series, list], 
                    tune: bool = False,
                    max_evals: int = 10, 
                    scoring: str = None,
                    verbose: bool = False) -> Union[np.ndarray, list]:
        """Train a cardea pipeline then make predictions.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            numpy.ndarray:
                Predictions to the input data.
        """
        pass

    def evaluate(self, 
                 X: Union[np.ndarray, pd.DataFrame], 
                 y: Union[np.ndarray, pd.Series, list],
                 test_size: float = 0.2, 
                 shuffle: bool = True, fit: bool = False,
                 tune: bool = False, 
                 max_evals: int = 10, 
                 scoring: str = None,
                 metrics: List[str] = DEFAULT_METRICS, 
                 verbose: bool = False) -> pd.Series:
        """Evaluate the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            test_size (float):
                The proportion of the dataset to include in the test dataset.
            shuffle (bool):
                Whether or not to shuffle the data before splitting.
            fit (bool):
                Whether to fit the pipeline before evaluating it.
                Defaults to ``False``.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            metrics (list):
                A list of scoring function names. The scoring functions should be consistent
                with the problem type.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            pandas.Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

    def save(self, path: str) -> None:
        """Save this object using pickle.

        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass

    def load(cls, path: str) -> Cardea:
        """Load an Cardea instance from a pickle file.

        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.

        Returns:
            Cardea:
                A Cardea instance

        Raises:
            ValueError:
                If the serialized object is not an Cardea instance.
        """
        pass

In addition to the main APIs, there will be helper functions such as set_pipeline, and train_test_split to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions