|
| 1 | +--- |
| 2 | +title: Development of an auto-tuning tool for the CLUEstering library |
| 3 | +layout: gsoc_proposal |
| 4 | +project: Patatrack |
| 5 | +year: 2025 |
| 6 | +organization: CERN |
| 7 | +--- |
| 8 | + |
| 9 | +## Description |
| 10 | +[CLUE][clue] is a fast and fully parallelizable density-based clustering algorithm, optimized for high- |
| 11 | +occupancy scenarios, where the number of clusters is much larger than the average number of hits |
| 12 | +in a cluster ([Rovere et al. 2020][cluepaper]). The algorithm uses a grid spatial index for fast querying of |
| 13 | +neighbors and its timing scales linearly with the number of points within the range considered. It is |
| 14 | +currently used in the CMS and CLIC event reconstruction software for clustering calorimetric hits in |
| 15 | +two dimensions based on their energy. The CLUE algorithm has been generalized to an arbitrary |
| 16 | +number of dimensions and to a wider range of applications in [CLUEstering][cluestering], a general purpose |
| 17 | +clustering library, with the backend implemented in C++ and providing a Python interface for |
| 18 | +easier use. The backend can be executed on multiple backends (serial, TBB, GPUs, ecc) thanks |
| 19 | +to the [Alpaka][alpakapaper] performance portability library. One feature currently lacking from CLUEstering |
| 20 | +and that would be extremely useful for every user, is an autotuning of the parameters, that given |
| 21 | +the expected number of clusters computes the combination of input parameters that results in the best |
| 22 | +clustering. |
| 23 | +For this task, one of the options to be explored is “The Optimizer”, a Python library developed by |
| 24 | +the Patatrack group of the CMS experiment which provides a collection of optimization algorithm, |
| 25 | +in particular MOPSO (Multi-Objective Particle Swarm Optimization). |
| 26 | + |
| 27 | +## Expected results |
| 28 | +* Consider the best techniques and tools for the task |
| 29 | +* Develop an auto-tuning tool for the parameters of CLUEstering |
| 30 | +* Test it on a wide range of commonly used datasets |
| 31 | +* Benchmark and profile to identify the bottlenecks of the tool and optimize it |
| 32 | + |
| 33 | +## Evaluation Task |
| 34 | +Interested students please contact [email protected] |
| 35 | + |
| 36 | +## Technologies |
| 37 | +* C++, Python |
| 38 | + |
| 39 | +## Desirable skills |
| 40 | +* Experience with development in C++17/20 |
| 41 | +* Experience with GPU computing |
| 42 | +* Experience with machine learning and optimization techniques |
| 43 | +* Experience with development of Python libraries |
| 44 | + |
| 45 | +## Additional information |
| 46 | +* Difficulty level (low, medium, hard): medium |
| 47 | +* Duration: 350 hours |
| 48 | +* Mentor availability: June-October |
| 49 | + |
| 50 | +## Mentors |
| 51 | + * **[Simone Balducci ](mailto:[email protected]) (CERN UNIBO) ** |
| 52 | + * [Felice Pantaleo ](mailto:[email protected]) (CERN) |
| 53 | + |
| 54 | +## Links |
| 55 | + * [CLUE][clue] |
| 56 | + * [CLUEstering][cluestering] |
| 57 | + * [Alpaka][alpaka] |
| 58 | + |
| 59 | +[clue]: https://gitlab.cern.ch/kalos/clue |
| 60 | +[cluestering]: https://github.com/cms-patatrack/CLUEstering |
| 61 | +[cluepaper]: https://www.frontiersin.org/articles/10.3389/fdata.2020.591315/full |
| 62 | +[alpakapaper]: https://arxiv.org/abs/1602.08477 |
| 63 | +[alpaka]: https://github.com/alpaka-group/alpaka |
0 commit comments