Add the new feature of customized initial population #1352

t-harden · 2024-06-29T09:05:28Z

What does this PR do?

Add the new feature of allowing users to specify customized initial pipeline population for TPOT.

Where should the reviewer start?

tpot/tests/test_custom_iniPop.py and tpot/tpot/base.py

How should this PR be tested?

The test code is at tpot/tests/test_custom_iniPop.py:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

individual_str1 = 'MultinomialNB(input_matrix, MultinomialNB__alpha=0.1, MultinomialNB__fit_prior=True)'
individual_str2 = 'GaussianNB(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=17, DecisionTreeClassifier__min_samples_split=13))'
individual_str3 = 'GaussianNB(SelectFwe(CombineDFs(input_matrix, ZeroCount(input_matrix))))'

est = TPOTClassifier(generations=3, population_size=5, verbosity=2, random_state=42, config_dict=None,
                     customized_initial_population=[individual_str1, individual_str2, individual_str3],
                      )
est.fit(X_train, y_train)
print(est.score(X_test, y_test))

You can test it by:

cd tpot
nosetests tests/test_custom_iniPop.py -s

Any background context you want to provide?

Under this version, users can specify well-defined initial pipeline population in string format by themselves. This update has the potential to enhance the algorithm's performance and reduce evolutionary time.

Several Tips:

1. These string pipelines can be obtained in two ways:

Referencing the examples in test_custom_iniPop.py and modifying them according to TPOT's config_dict.
Extracting the keys of self.evaluated_individuals_ evolved by TPOT. This method is particularly useful for constructing appropriate initial pipelines for better evolution.

2. We consider the relationship between #customized initial pipelines and #population as follows:

"check if #customized initial pipelines <= #population"
if len(iniPop) <= self.population_size:
    for _ in range(self.population_size - len(iniPop)):
        individual_rand = self._toolbox.individual()
        iniPop.append(individual_rand)
    print(len(customized_initial_population), "customized pipelines +", self.population_size - len(customized_initial_population), "randomized pipelines as initial population.")
else:
    raise Exception("the number of customized initial pipelines > the number of population size!")

3. We also found that in this version, the configurations (i.e., operators and parameters) of customized initial pipelines should be a subset of those specified by the config_dict parameter. This issue can be noted in the documentation or can be addressed in the near future if you agree with this PR.

What are the relevant issues?

#1321

Questions:

Do the docs need to be updated?
Yes
Does this PR add new (Python) dependencies?
No

add the feature of customized initial population

b90d1fc

t-harden marked this pull request as draft June 29, 2024 11:44

t-harden changed the base branch from development to master June 29, 2024 11:45

t-harden marked this pull request as ready for review June 29, 2024 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add the new feature of customized initial population #1352

Add the new feature of customized initial population #1352

Uh oh!

t-harden commented Jun 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add the new feature of customized initial population #1352

Are you sure you want to change the base?

Add the new feature of customized initial population #1352

Uh oh!

Conversation

t-harden commented Jun 29, 2024

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Questions:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant