# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(10,
10000,
cardinality=35,
ensure_rep=True,
random_values=True,
low=0,
high=40)
# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')print(CategoricalClassification.dataset_info)Stores a formatted dictionary of operations made. Function CategoricalClassification.generate_data resets its contents. Each subsequent function call adds information to it.
CategoricalClassification.generate_data(n_features,
n_samples,
cardinality=5,
structure=None,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10,
seed=42)Generates dataset of shape (n_samples, n_features), based on given parameters.
- n_features: int The number of features in a generated dataset.
- n_samples: int The number of samples in a generated dataset.
- cardinality: int, default=5. Sets the default cardinality of a generated dataset.
- structure: list, numpy.ndarray, default=None.
Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions.
Follows the format [tuple, tuple, ...], where:
- tuple can either be:
- (int or list, int): the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range [0, second element of tuple].
- (int or list, list): the first element represents the index or list of indexes of features. The second element offers two options:
- list: a list of values to be used in the feature or features,
- [list, list]: where the first list element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
- tuple can either be:
- ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- seed: int, default=42. Controls numpy.random.seed
Returns: a numpy.ndarray dataset with n_features features and n_samples samples.
CategoricalClassification._feature_builder(feature_attributes,
n_samples,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10)Helper function used to configure _generate_feature() with proper parameters based on feature_atributes.
- feature_attributes: int or list or numpy.ndarray Attributes of feature. Can be just cardinality (int), value domain (list), or value domain and their respective probabilities (list).
- n_samples: int Number of samples in dataset. Determines generated feature vector size.
- ensure_rep: bool, default=False: Control flag. If True, all possible values will appear in the feature.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak. Returns: a numpy.ndarray feature array.
CategoricalClassification._generate_feature(size,
vec=None,
cardinality=5,
ensure_rep=False,
random_values=False,
low=0,
high=1000,
k=10,
p=None)Generates feature array of length size. Called by CategoricalClassification.generate_data, by utilizing numpy.random.choice. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.
- size: int Length of generated feature array.
- vec: list or numpy.ndarray, default=None List of feature values, value domain of feature.
- cardinality: int, default=5 Cardinality of feature to use when generating its value domain. If vec is not None, vec is used instead.
- ensure_rep: bool, default=False Control flag. If True, all possible values will appear in the feature array.
- random_values: bool, default=False: Control flag. If True, value domain of feature will be random on interval [low, high].
- low: int Sets lower bound of value domain of feature.
- high: int Sets upper bound of value domain of feature. Only used when random_values is True.
-
- k: int or float, default=10. Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- p: list or numpy.ndarray, default=None Array of frequencies or probabilities. Must be of length v or equal to the length of v.
Returns: a numpy.ndarray feature array.
CategoricalClassification.generate_combinations(X,
feature_indices,
combination_function=None,
combination_type='linear')Generates and adds a new column to given dataset X. The column is the result of a combination of features selected with feature_indices. Combinations can be linear, nonlinear, or custom defined functions.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: list or numpy.ndarray: List of feature (column) indices to be combined.
- combination_function: function, default=None: Custom or user-defined combination function. The function parameter must be a list or numpy.ndarray of features to be combined. The function must return a list or numpy.ndarray column or columns, to be added to given dataset X using numpy.column_stack.
- combination_type: str either linear or nonlinear, default='linear':
Selects which built-in combination type is used.
- If 'linear', the combination is a sum of selected features.
- If 'nonlinear', the combination is the sine value of the sum of selected features.
Returns: a numpy.ndarray dataset X with added feature combinations.
CategoricalClassification._xor(arr)Performs bitwise XOR on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_xor(a,b) on given columns in arr.
CategoricalClassification._and(arr)Performs bitwise AND on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_and(a,b) on given columns in arr.
CategoricalClassification._or(arr)Performs bitwise OR on given vectors and returns result.
- arr: list or numpy.ndarray List of features to perform the combination on.
Returns: a numpy.ndarray result of numpy.bitwise_or(a,b) on given columns in arr.
CategoricalClassification.generate_correlated(X,
feature_indices,
r=0.8)Generates and adds new columns to given dataset X, correlated to the selected features, by a Pearson correlation coefficient of r. For vectors with mean 0, their correlation equals the cosine of their angle.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to generate correlated features to.
- r: float, default=0.8: Desired correlation coefficient.
Returns: a numpy.ndarray dataset X with added correlated features.
CategoricalClassification.generate_duplicates(X,
feature_indices)Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset X.
- X: list or numpy.ndarray: Dataset to perform the combinations on.
- feature_indices: int or list or numpy.ndarray: Index of feature (column) or list of feature (column) indices to duplicate.
Returns: a numpy.ndarray dataset X with added duplicated features.
CategoricalClassification.generate_nonlinear_labels(X,
n=2,
p=0.5,
k=2,
decision_function=None,
class_relation='linear',
balance=False,
random_state=42,
feature_vector=None)Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.
- X: list or numpy.ndarray: Dataset to generate labels for.
- n: int, default=2: Number of classes.
- p: float or list, default=0.5: Class distribution.
- k: int or float, default=2: Constant to be used in the linear or nonlinear combination used to set class values.
- decision_function: function, default: None Custom defined function to use for setting class values. Must accept dataset X as input and return a list or numpy.ndarray decision boundary.
- class_relation: str, either 'linear', 'nonlinear', 'cluster', or 'feature-like', default='linear': Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, by clustering the samples in X, or by generating a target vector based on a given feature.
- balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.
- random_state: int, default=42: Random state seed for KMeans clustering.
- feature_vector: list or numpy.ndarray: Feature vector to base labels on.
Returns: numpy.ndarray y of class labels.
CategoricalClassification._cluster_data(X,
n,
p=1.0,
balance=False)Clusters given data using KMeans clustering.
- X: list or numpy.ndarray: Dataset to cluster.
- n: int: Number of clusters.
- p: float or list or numpy.ndarray: To be used when balance=True, sets class distribution - number of samples per cluster.
- balance: boolean, default=False: Whether to naievly balance clusters generated by KMeans clustering.
Returns: numpy.ndarray cluster_labels of clustering labels.
CategoricalClassification.generate_noise(X,
y,
p=0.2,
type="categorical",
missing_val=float('-inf'))Generates categorical noise or simulates missing data on a given dataset.
- X: list or numpy.ndarray: Dataset to generate noise for.
- y: list or numpy.ndarray: Labels of samples in dataset X. Required for generating categorical noise.
- p: float, p <=1.0, default=0.2: Amount of noise to generate.
- type: str, either "categorical" or "missing", default="categorical": Type of noise to generate.
- missing_val: default=float('-inf'): Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.
Returns: numpy.ndarray X with added noise.
CategoricalClassification.downsample_dataset(X,
y,
n=None,
seed=42,
reshuffle=False):Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.
- X: list or numpy.ndarray: Dataset to downsample.
- y: list or numpy.ndarray: Labels corresponding to X.
- N: int, optional: Optional number of samples per class to downsample to.
- seed: int, default=42: Seed for random state of resample function.
- reshuffle: boolean, default=False: Reshuffle the dataset after downsampling.
Returns: Balanced, downsampled numpy.ndarray X and numpy.ndarray y.
CategoricalClassification.print_dataset(X, y)Prints given dataset in a readable format.
- X: list or numpy.ndarray: Dataset to print.
- y: list or numpy.ndarray: Class labels corresponding to samples in given dataset.
CategoricalClassification.summarize()Prints stored dataset information dictionary in a digestible manner.