Skip to content

Latest commit

 

History

History
368 lines (307 loc) · 15.7 KB

File metadata and controls

368 lines (307 loc) · 15.7 KB

Usage


Creating a simple dataset

# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(10, 
                     10000, 
                     cardinality=35, 
                     ensure_rep=True, 
                     random_values=True, 
                     low=0, 
                     high=40)

# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')

Documentation


CategoricalClassification.dataset_info

print(CategoricalClassification.dataset_info)

Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.


CategoricalClassification.generate_data

CategoricalClassification.generate_data(n_features, 
                                        n_samples, 
                                        cardinality=5, 
                                        structure=None, 
                                        ensure_rep=False, 
                                        random_values=False, 
                                        low=0, 
                                        high=1000,
                                        k=10,
                                        seed=42)

Generates dataset of shape (n_samples, n_features), based on given parameters.

  • n_features: int The number of features in a generated dataset.
  • n_samples: int The number of samples in a generated dataset.
  • cardinality: int, default=5. Sets the default cardinality of a generated dataset.
  • structure: list, numpy.ndarray, default=None. Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions. Follows the format [tuple, tuple, ...], where:
    • tuple can either be:
      • (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
      • (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
        • list: a list of values to be used in the feature or features,
        • [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • seed: int, default=42. Controls numpy.random.seed

Returns: a numpy.ndarray dataset with n_features features and n_samples samples.


CategoricalClassification._configure_generate_feature

CategoricalClassification._feature_builder(feature_attributes, 
                                           n_samples, 
                                           ensure_rep=False, 
                                           random_values=False, 
                                           low=0, 
                                           high=1000,
                                           k=10)

Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.

  • feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
  • n_samples: int Number of samples in dataset. Determines generated feature vector size.
  • ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
  • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.

CategoricalClassification._generate_feature

CategoricalClassification._generate_feature(size, 
                                            vec=None, 
                                            cardinality=5, 
                                            ensure_rep=False, 
                                            random_values=False, 
                                            low=0, 
                                            high=1000,
                                            k=10,
                                            p=None)

Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.

  • size: int Length of generated feature array.
  • vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
  • cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
  • ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
  • random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
  • low: int Sets lower bound of value domain of feature.
  • high: int Sets upper bound of value domain of feature. Only used when random_values is True.
    • k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
  • p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.

Returns: a numpy.ndarray feature array.


CategoricalClassification.generate_combinations

CategoricalClassification.generate_combinations(X, 
                                                feature_indices, 
                                                combination_function=None, 
                                                combination_type='linear')

Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
  • combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
  • combination_type: str either linear or nonlinear, default='linear': Selects which built-in combination type is used.
    • If 'linear', the combination is a sum of selected features.
    • If 'nonlinear', the combination is the sine value of the sum of selected features.

Returns: a numpy.ndarray dataset X with added feature combinations.


CategoricalClassification._xor

CategoricalClassification._xor(arr)

Performs bitwise XOR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.


CategoricalClassification._and

CategoricalClassification._and(arr)

Performs bitwise AND on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.


CategoricalClassification._or

CategoricalClassification._or(arr)

Performs bitwise OR on given vectors and returns result.

  • arr: list or numpy.ndarray List of features to perform the combination on.

Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.


CategoricalClassification.generate_correlated

CategoricalClassification.generate_correlated(X, 
                                              feature_indices, 
                                              r=0.8)

Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
  • r: float, default=0.8: Desired correlation coefficient.

Returns: a numpy.ndarray dataset X with added correlated features.


CategoricalClassification.generate_duplicates

CategoricalClassification.generate_duplicates(X, 
                                              feature_indices)

Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.

  • X: list or numpy.ndarray: Dataset to perform the combinations on.
  • feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.

Returns: a numpy.ndarray dataset X with added duplicated features.


CategoricalClassification.generate_labels

CategoricalClassification.generate_nonlinear_labels(X, 
                                                    n=2, 
                                                    p=0.5, 
                                                    k=2, 
                                                    decision_function=None, 
                                                    class_relation='linear', 
                                                    balance=False,
                                                    random_state=42,
                                                    feature_vector=None)

Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.

  • X: list or numpy.ndarray: Dataset to generate labels for.
  • n: int, default=2: Number of classes.
  • p: float or list, default=0.5: Class distribution.
  • k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
  • decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
  • class_relation: str, either 'linear', 'nonlinear', 'cluster', or 'feature-like', default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, by clustering the samples in X, or by generating a target vector based on a given feature.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.
  • random_state: int, default=42: Random state seed for KMeans clustering.
  • feature_vector: list or numpy.ndarray: Feature vector to base labels on.

Returns: numpy.ndarray y of class labels.


CategoricalClassification._cluster_data

CategoricalClassification._cluster_data(X, 
                                        n, 
                                        p=1.0, 
                                        balance=False)

Clusters given data using KMeans clustering.

  • X: list or numpy.ndarray: Dataset to cluster.
  • n: int: Number of clusters.
  • p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
  • balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.

Returns: numpy.ndarray cluster_labels of clustering labels.


CategoricalClassification.generate_noise

CategoricalClassification.generate_noise(X, 
                                         y, 
                                         p=0.2, 
                                         type="categorical", 
                                         missing_val=float('-inf'))

Generates categorical noise or simulates missing data on a given dataset.

  • X: list or numpy.ndarray: Dataset to generate noise for.
  • y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
  • p: float, p <=1.0, default=0.2: Amount of noise to generate.
  • type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
  • missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.

Returns: numpy.ndarray X with added noise.


CategoricalClassification.downsample_dataset

CategoricalClassification.downsample_dataset(X, 
                                             y, 
                                             n=None, 
                                             seed=42, 
                                             reshuffle=False):

Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.

  • X: list or numpy.ndarray: Dataset to downsample.
  • y: list or numpy.ndarray: Labels corresponding to X.
  • N: int, optional: Optional number of samples per class to downsample to.
  • seed: int, default=42: Seed for random state of resample function.
  • reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.

Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.


CategoricalClassification.print_dataset

CategoricalClassification.print_dataset(X, y)

Prints given dataset in a readable format.

  • X: list or numpy.ndarray: Dataset to print.
  • y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.

CategoricalClassification.summarize

CategoricalClassification.summarize()

Prints stored dataset information dictionary in a digestible manner.