Skip to content

Conversation

@t-harden
Copy link

What does this PR do?

Add the new feature of allowing users to specify customized initial pipeline population for TPOT.

Where should the reviewer start?

tpot/tests/test_custom_iniPop.py and tpot/tpot/base.py

How should this PR be tested?

The test code is at tpot/tests/test_custom_iniPop.py:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

individual_str1 = 'MultinomialNB(input_matrix, MultinomialNB__alpha=0.1, MultinomialNB__fit_prior=True)'
individual_str2 = 'GaussianNB(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=17, DecisionTreeClassifier__min_samples_split=13))'
individual_str3 = 'GaussianNB(SelectFwe(CombineDFs(input_matrix, ZeroCount(input_matrix))))'

est = TPOTClassifier(generations=3, population_size=5, verbosity=2, random_state=42, config_dict=None,
                     customized_initial_population=[individual_str1, individual_str2, individual_str3],
                      )
est.fit(X_train, y_train)
print(est.score(X_test, y_test))

You can test it by:

cd tpot
nosetests tests/test_custom_iniPop.py -s

Any background context you want to provide?

Under this version, users can specify well-defined initial pipeline population in string format by themselves. This update has the potential to enhance the algorithm's performance and reduce evolutionary time.

Several Tips:

1. These string pipelines can be obtained in two ways:

  • Referencing the examples in test_custom_iniPop.py and modifying them according to TPOT's config_dict.
  • Extracting the keys of self.evaluated_individuals_ evolved by TPOT. This method is particularly useful for constructing appropriate initial pipelines for better evolution.

2. We consider the relationship between #customized initial pipelines and #population as follows:

"check if #customized initial pipelines <= #population"
if len(iniPop) <= self.population_size:
    for _ in range(self.population_size - len(iniPop)):
        individual_rand = self._toolbox.individual()
        iniPop.append(individual_rand)
    print(len(customized_initial_population), "customized pipelines +", self.population_size - len(customized_initial_population), "randomized pipelines as initial population.")
else:
    raise Exception("the number of customized initial pipelines > the number of population size!")

3. We also found that in this version, the configurations (i.e., operators and parameters) of customized initial pipelines should be a subset of those specified by the config_dict parameter. This issue can be noted in the documentation or can be addressed in the near future if you agree with this PR.

What are the relevant issues?

#1321

Questions:

  • Do the docs need to be updated?
    Yes
  • Does this PR add new (Python) dependencies?
    No

@t-harden t-harden marked this pull request as draft June 29, 2024 11:44
@t-harden t-harden changed the base branch from development to master June 29, 2024 11:45
@t-harden t-harden marked this pull request as ready for review June 29, 2024 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant