Skip to content
This repository was archived by the owner on Jun 17, 2024. It is now read-only.

Perform the last action with the column name modified. #8

@tashiro-akira

Description

@tashiro-akira

Describe the bug
There is no process to restore the column name after correcting the special character of the column name in the input data.

To Reproduce
Steps to reproduce the behavior:

  1. Show your code calling generate_code().
script
import numpy as np
import pandas as pd
from sapientml import SapientML

df = pd.DataFrame({'a': [1,2]*10, 'b': ["moji"]*20, '[y]': [1,0]*10})
cls_ = SapientML(
    target_columns=['[y]'],
    task_type='classification',
)
cls_.fit(
    training_data=df,
)
  1. Attach the datasets or dataframes input to generate_code() if possible.
  2. Show the generated code such as 1_default.py when it was generated.
generated code
#  GENERATED PIPELINE

# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"C:\work\workspace\sapientml\outputs\training.pkl")

# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
    train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state)
    return train_dataset, test_dataset	
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)

# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
    dataframe=train_dataset,
    sample_size=100000,
    target_columns=['[y]'],
    task_type='classification'
)

test_dataset = validation_dataset

# Remove special symbols that interfere with visualization and model training
import re
cols_has_symbols = ['[y]']
inhibited_symbol_pattern = re.compile(r"[\{\}\[\]\",:<'\\]+")
train_dataset = train_dataset.rename(columns=lambda col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col)
test_dataset = test_dataset.rename(columns=lambda col: inhibited_symbol_pattern.sub("", col) if col in cols_has_symbols else col)

# DISCARD COLUMNS WITH ONE VALUE ONLY
cols_one_value_only = ['b']
train_dataset = train_dataset.drop(cols_one_value_only, axis=1, errors="ignore")
test_dataset = test_dataset.drop(cols_one_value_only, axis=1, errors="ignore")


# DETACH TARGET
TARGET_COLUMNS = ['y']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()

# MODEL
import numpy as np
from xgboost import XGBClassifier
random_state_model = 42
model = XGBClassifier(random_state=random_state_model, )
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
target_train = pd.DataFrame(label_encoder.fit_transform(target_train), columns=TARGET_COLUMNS)
model.fit(feature_train, target_train.values.ravel())
y_pred = model.predict(feature_test)
y_pred = label_encoder.inverse_transform(y_pred).reshape(-1, 1)

#EVALUATION
from sklearn import metrics
f1 = metrics.f1_score(target_test, y_pred, average='macro')
print('RESULT: F1 Score: ' + str(f1))

Expected behavior
File output processing is being performed with the modified column name

Environment (please complete the following information):

  • OS: [e.g. Ubuntu 20.04]
  • SapientML Version: [e.g. 2.3.4]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions