Skip to content

Conversation

@MashAliK
Copy link
Contributor

@MashAliK MashAliK commented Jun 9, 2025

Add parallel iterations along with a couple of bug fixes/improvements. I would consider this a pretty important feature because of the significant speedup it provides to training. This is the implementation that I tried and has worked for my use cases.

Primary changes:

  • Use concurrent.futures in controller.py to spawn processes for performing the iterations of training
  • Moved most of the iteration logic inside the function run_iteration_sync in a new file iteration.py which is run by the workers to allow concurrent execution
  • Pass config into spawned processes to initialize new instances of LLMEnsemble and ProgramDatabase (these classes are not pickleable so this is the best approach I could think of to use these classes in each of the worker processes)
  • As a consequence of this approach a snapshot of the database needs to be saved in the root process every iteration and each new process needs to load it

Minor changes:

  • The calculate_edit_distance function was crashing the database when I was using it and since there's already libraries that do this routine I ended up using one of them (levenshtein)
  • Replaced the use of the Levenshtein distance with a ratio in _calculate_island_diversity since it's easier to read
  • Replaced the use of the Levenshtein distance with a ratio in _calculate_feature_coords since it's normalized for code length
  • Introduced allowed_population_overflow since otherwise the database was adding and removing a program every iteration when it reach the allowed program limit
  • Added logging for MAP-Elite features

@MashAliK
Copy link
Contributor Author

MashAliK commented Jun 9, 2025

Solves: #32
I believe it also contains the solution to #60 because I remember moving the call to _enforce_population_limit later in add fixed an error where the program was removed from the list while it was still being added
Taking a closer look this bug looks different then the one I saw (I encountered mine when calling add, not sample)

@codelion
Copy link
Member

Thanks for contributing can you rebase from main, I can test and then review this PR.

@SuhailB
Copy link
Contributor

SuhailB commented Jul 3, 2025

Hello @codelion, thank you for this great project, and thank you @MashAliK for the parallelization effort.

I saw the request to rebase this PR to the latest main. I needed the parallel evaluation as well, so I rebased it and resolved the conflicts in my fork:
https://github.com/SuhailB/openevolve/tree/updated-parallel-iterations

One thing I am not sure about, which was causing loading/storing large artifacts to fail is this code snippet:

  # Create directory and remove old path if it exists
  if os.path.exists(save_path):
      shutil.rmtree(save_path)

Lines 361-363 in the save method in database.py

The code now passes all the tests, but I am not sure what I did is correct.

Let me know if you'd like me to open a new PR, or if the @MashAliK prefers to pull from this branch to update this one.

Thank you

@codelion
Copy link
Member

codelion commented Jul 3, 2025

I don't think that code in main? It was done in this PR, perhaps to handle multiple processes updating the DB

As a consequence of this approach a snapshot of the database needs to be saved in the root process every iteration and each new process needs to load it

How have you solved it? Happy to look at the PR if you can send it across. Also, can you test with some of the existing examples with and without parallel execution to see if there is a speed up and they both reach similar convergence in terms of the best_program found.

@SuhailB
Copy link
Contributor

SuhailB commented Jul 3, 2025

@codelion yes, it’s not in main, I was referring to the PR version done by @MashAliK. Erasing the path was causing a failure in loading large artifacts which are stored in that path, so I commented it out in my rebased version:

https://github.com/SuhailB/openevolve/tree/updated-parallel-iterations

I am not sure if this breaks the dependencies or not. @MashAliK probably would know more about this.

I will try to run examples tomorrow.

@MashAliK
Copy link
Contributor Author

MashAliK commented Jul 3, 2025

Hello, sorry I wasn't able to get to updating this PR earlier. I will rebase it soon.

@SuhailB I originally deleted the folder containing the programs each time to prevent duplicate programs from being added, but now I'm seeing that the files are just overwritten inside _save_program so you're right that this probably isn't necessary.

The only other benefit I can think of for deleting the folder is if a new database is saved at the same location, then the previous programs will all be cleared. Again, this doesn't matter for how it's being used right now because controller.py creates a temp folder each time.

Since you're saying that deleting tends to cause issues for larger databases I'd prefer to just remove those lines for now so I will do that here. Thanks for pointing this out!

@SuhailB
Copy link
Contributor

SuhailB commented Jul 3, 2025

@MashAliK Sounds good. And to clarify, deleting it is causing an issue for large artifacts (a feature added after your PR), not large databases. Also feel free to use my branch for rebasing (if it has no issues)

@SuhailB
Copy link
Contributor

SuhailB commented Jul 3, 2025

Hi @codelion and @MashAliK,

I've tested the updated-parallel-iteration version for the circle_packing_with_artifacts example for 50 iterations for each stage instead of 100 to reduce API costs. I used gemini-2.0-flash-lite and gemini-2.0-flash-lite.

Results:
Sequential:
Total Runtime: 30 minutes
Best Score (sum_radii): 1.88
Stage1:

{
  "id": "788596e9-6d2b-4134-ade9-c309fdf812c2",
  "generation": 2,
  "iteration": 6,
  "timestamp": 1751574058.320835,
  "parent_id": "6eb73296-9b3c-42e1-91da-2063ce7acfaf",
  "metrics": {
    "validity": 1.0,
    "sum_radii": 1.8801588665719113,
    "target_ratio": 0.7135327766876325,
    "combined_score": 0.7135327766876325,
    "eval_time": 0.11612915992736816
  },
  "language": "python",
  "saved_at": 1751574423.6830008
}

Stage2:

{
  "id": "084465de-efce-4e42-81b0-79f6a044e819",
  "generation": 0,
  "iteration": 0,
  "timestamp": 1751574425.2020102,
  "parent_id": null,
  "metrics": {
    "validity": 1.0,
    "sum_radii": 1.8801588665719113,
    "target_ratio": 0.7135327766876325,
    "combined_score": 0.7135327766876325,
    "eval_time": 0.11661648750305176
  },
  "language": "python",
  "saved_at": 1751575856.2608302
}

Parallel (25 cores):
Total Runtime: 2 minutes
Best score (sum_radii): 2.038
Stage1:

{
  "id": "b96d6e67-76bb-4765-9f51-584d92e5d2c2",
  "generation": 1,
  "iteration": 20,
  "timestamp": 1751577546.0471964,
  "parent_id": "49307652-3ead-485f-8c21-1ce11a16ec29",
  "metrics": {
    "validity": 1.0,
    "sum_radii": 1.8598312591600656,
    "target_ratio": 0.7058183146717517,
    "combined_score": 0.7058183146717517,
    "eval_time": 0.22881340980529785
  },
  "language": "python",
  "saved_at": 1751577556.0780315
}

Stage2:

{
  "id": "1a044db7-811c-4ea6-92e3-1918c34b28a1",
  "generation": 1,
  "iteration": 16,
  "timestamp": 1751577569.0512633,
  "parent_id": "74a8ac96-f519-49e5-b3da-f4b9058cdc2e",
  "metrics": {
    "validity": 1.0,
    "sum_radii": 2.038107874942713,
    "target_ratio": 0.7734754743615609,
    "combined_score": 0.7734754743615609,
    "eval_time": 0.3030838966369629
  },
  "language": "python",
  "saved_at": 1751577604.0499
}

The rebased version is in here: https://github.com/SuhailB/openevolve/tree/updated-parallel-iterations

@MashAliK could you rebase for @codelion to review, or should I open a new PR?

@MashAliK
Copy link
Contributor Author

MashAliK commented Jul 3, 2025

@SuhailB Got it, thanks.
Since you've already fixed and rebased it you could open a new PR from your branch and we can test things there. I will then close this one.

@codelion codelion closed this Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants