This repository was archived by the owner on Jun 17, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
Perform the last action with the column name modified. #8
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
There is no process to restore the column name after correcting the special character of the column name in the input data.
To Reproduce
Steps to reproduce the behavior:
- Show your code calling
generate_code().
script
import numpy as np
import pandas as pd
from sapientml import SapientML
df = pd.DataFrame({'a': [1,2]*10, 'b': ["moji"]*20, '[y]': [1,0]*10})
cls_ = SapientML(
target_columns=['[y]'],
task_type='classification',
)
cls_.fit(
training_data=df,
)- Attach the datasets or dataframes input to
generate_code()if possible. - Show the generated code such as
1_default.pywhen it was generated.
generated code
# GENERATED PIPELINE
# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"C:\work\workspace\sapientml\outputs\training.pkl")
# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state)
return train_dataset, test_dataset
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)
# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
dataframe=train_dataset,
sample_size=100000,
target_columns=['[y]'],
task_type='classification'
)
test_dataset = validation_dataset
# Remove special symbols that interfere with visualization and model training
import re
cols_has_symbols = ['[y]']
inhibited_symbol_pattern = re.compile(r"[\{\}\[\]\",:<'\\]+")
train_dataset = train_dataset.rename(columns=lambda col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col)
test_dataset = test_dataset.rename(columns=lambda col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col)
# DISCARD COLUMNS WITH ONE VALUE ONLY
cols_one_value_only = ['b']
train_dataset = train_dataset.drop(cols_one_value_only, axis=1, errors="ignore")
test_dataset = test_dataset.drop(cols_one_value_only, axis=1, errors="ignore")
# DETACH TARGET
TARGET_COLUMNS = ['y']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()
# MODEL
import numpy as np
from xgboost import XGBClassifier
random_state_model = 42
model = XGBClassifier(random_state=random_state_model, )
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
target_train = pd.DataFrame(label_encoder.fit_transform(target_train), columns=TARGET_COLUMNS)
model.fit(feature_train, target_train.values.ravel())
y_pred = model.predict(feature_test)
y_pred = label_encoder.inverse_transform(y_pred).reshape(-1, 1)
#EVALUATION
from sklearn import metrics
f1 = metrics.f1_score(target_test, y_pred, average='macro')
print('RESULT: F1 Score: ' + str(f1))Expected behavior
File output processing is being performed with the modified column name
Environment (please complete the following information):
- OS: [e.g. Ubuntu 20.04]
- SapientML Version: [e.g. 2.3.4]
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working