From 5d7886603a6b03aadb08371e00f0b7217fdcd3ff Mon Sep 17 00:00:00 2001 From: Valentin Volkl Date: Mon, 10 Feb 2025 13:58:37 +0100 Subject: [PATCH 1/3] gsoc25: add cvmfs proposal --- _gsocprojects/2025/project_CernVM-FS.md | 10 +++ .../proposal_CVMFS_DistributeModelFiles.md | 61 +++++++++++++++++++ gsoc/2025/mentors.md | 1 + 3 files changed, 72 insertions(+) create mode 100644 _gsocprojects/2025/project_CernVM-FS.md create mode 100644 _gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md diff --git a/_gsocprojects/2025/project_CernVM-FS.md b/_gsocprojects/2025/project_CernVM-FS.md new file mode 100644 index 000000000..a72802a65 --- /dev/null +++ b/_gsocprojects/2025/project_CernVM-FS.md @@ -0,0 +1,10 @@ +--- +title: CernVM-FS +project: CernVM-FS +layout: default +logo: cernvmfs-logo.png +description: | + The CernVM-File System ([CVMFS](https://cernvm.cern.ch/fs/)) is a global, read-only POSIX file system that provides the universal namespace /cvmfs. It is based on content-addressable storage, Merkle trees, and HTTP data transport. CernVM-FS provides a mission critical infrastructure to small and large HEP collaborations. +--- + +{% include gsoc_project.ext %} diff --git a/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md new file mode 100644 index 000000000..f596c7e8a --- /dev/null +++ b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md @@ -0,0 +1,61 @@ +--- +title: Evaluate Distribution of ML model files on CVMFS +layout: gsoc_proposal +project: CernVM-FS +year: 2025 +organization: + - CERN +difficulty: medium +duration: 175 +mentor_avail: June-October +--- + +# Description + +Particle physicists studying nature at highest energy scales at the Large Hadron Collider rely on simulations and data processing for their experiments. +These workloads run on the "computing grid", a massive globally distributed computing infrastructure. +Deploying software efficiently and reliable to this grid is an important and challenging task. +CVMFS is an optimised shared file system developed specifically for this purpose: It is implemented as a POSIX read-only file system in user space (a FUSE module). +Files and directories are hosted on standard web servers and mounted in the universal namespace `/cvmfs`. +In many cases, it replaces package managers and shared software areas on cluster file systems as means to distribute the software used to process experiment data. + +## Task idea + +CVMFS is optimized for the distribution of software (header files, scripts and libraries), taking advantage of the repeated access pattern for its caching, and the possibility to deduplicate files present in several versions. +CVMFS is capable to provide a general read-only POSIX file system view on data in external storage. A very common usecache is to make conditions databases available to workloads running in distributed computing infrastructure, but various datasets have been published in CVMFS. +How efficient CVMFS can be always depends on the details in these usecases - often the benefit for the users is simply in leveraging the existing server and proxy infrastructure. + + +In this project proposal, we'd like to evaluate CVMFS as a means to distribute machine learning model files used in inference, for example .onnx files. The main focus will be on creating a test deployment and benchmarking the access, as well as possible coding utilities and scripts to aid in the deployment of models on CVMFS. We'd also like to contrast CVMFS to existing inference servers like KServe, and see if it could integrate as a backend storage. + + + + +## Expected results and milestones + + * Familiarization with the CVMFS server infrastructure + * Familiarization with the ML model usage at CERN, Survey of different common inference model file formats. + * + * Test deployment of models relevant to ML4EP + * Benchmark and evaluation of inference using models served from CVMFS + * Addition of the benchmark to the CVMFS continuous benchmarking infrastructure + * Writing a best practices document for the CVMFS documentation + + +## Requirements + + * UNIX/Linux + * Interest in scientific computing devops + * Familiarity with common ML libraries, in particular ONNX + + +## Mentors + + * **[Valentin Volkl](mailto:valentin.volkl@cern.ch)** + * [Lorenzo Moneta](mailto:lorenzo.moneta@cern.ch) + + +## Links + + * [CVMFS](https://cernvm.cern.ch/fs/) + * [KServe](https://kserve.github.io/website) diff --git a/gsoc/2025/mentors.md b/gsoc/2025/mentors.md index dd8269b7a..ebae681d7 100644 --- a/gsoc/2025/mentors.md +++ b/gsoc/2025/mentors.md @@ -17,6 +17,7 @@ layout: plain * Stephan Lachnit [stephan.lachnit@desy.de](mailto:stephan.lachnit@desy.de) DESY * David Lange [david.lange@cern.ch](mailto:david.lange@cern.ch) CompRes * Serguei Linev [S.Linev@gsi.de](mailto:S.Linev@gsi.de) GSI +* Lorenzo Moneta [lorenzo.moneta@cern.ch](mailto:lorenzo.moneta@cern.ch) CERN * Giacomo Parolini [giacomo.parolini@cern.ch](mailto:giacomo.parolini@cern.ch) CERN * Alexander Penev [alexander.p.penev@gmail.com](mailto:alexander.p.penev@gmail.com) CompRes/University of Plovdiv, BG * Mayank Sharma [mayank.sharma@cern.ch](mailto:mayank.sharma@cern.ch) UMich From 59151bde471cf389e3b4197975f4c38c9aee4b75 Mon Sep 17 00:00:00 2001 From: Valentin Volkl Date: Tue, 11 Feb 2025 09:51:08 +0100 Subject: [PATCH 2/3] Update _gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md --- _gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md index f596c7e8a..0c5bf059b 100644 --- a/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md +++ b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md @@ -22,7 +22,7 @@ In many cases, it replaces package managers and shared software areas on cluster ## Task idea CVMFS is optimized for the distribution of software (header files, scripts and libraries), taking advantage of the repeated access pattern for its caching, and the possibility to deduplicate files present in several versions. -CVMFS is capable to provide a general read-only POSIX file system view on data in external storage. A very common usecache is to make conditions databases available to workloads running in distributed computing infrastructure, but various datasets have been published in CVMFS. +CVMFS is capable to provide a general read-only POSIX file system view on data in external storage. A very common usecase is to make conditions databases available to workloads running in distributed computing infrastructure, but various datasets have been published in CVMFS. How efficient CVMFS can be always depends on the details in these usecases - often the benefit for the users is simply in leveraging the existing server and proxy infrastructure. From 1b5f61a5e5f4e59211d98624dff685c153af754f Mon Sep 17 00:00:00 2001 From: Valentin Volkl Date: Fri, 14 Feb 2025 09:54:25 +0100 Subject: [PATCH 3/3] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Maciej SzymaƄski --- _gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md index 0c5bf059b..fab6e5288 100644 --- a/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md +++ b/_gsocproposals/2025/proposal_CVMFS_DistributeModelFiles.md @@ -15,15 +15,15 @@ mentor_avail: June-October Particle physicists studying nature at highest energy scales at the Large Hadron Collider rely on simulations and data processing for their experiments. These workloads run on the "computing grid", a massive globally distributed computing infrastructure. Deploying software efficiently and reliable to this grid is an important and challenging task. -CVMFS is an optimised shared file system developed specifically for this purpose: It is implemented as a POSIX read-only file system in user space (a FUSE module). +CVMFS is an optimised shared file system developed specifically for this purpose: it is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace `/cvmfs`. In many cases, it replaces package managers and shared software areas on cluster file systems as means to distribute the software used to process experiment data. ## Task idea CVMFS is optimized for the distribution of software (header files, scripts and libraries), taking advantage of the repeated access pattern for its caching, and the possibility to deduplicate files present in several versions. -CVMFS is capable to provide a general read-only POSIX file system view on data in external storage. A very common usecase is to make conditions databases available to workloads running in distributed computing infrastructure, but various datasets have been published in CVMFS. -How efficient CVMFS can be always depends on the details in these usecases - often the benefit for the users is simply in leveraging the existing server and proxy infrastructure. +CVMFS is capable to provide a general read-only POSIX file system view on data in external storage. A very common use case is to make conditions databases available to workloads running in distributed computing infrastructure, but various datasets have been published in CVMFS. +How efficient CVMFS can be always depends on the details in these use cases - often the benefit for the users is simply in leveraging the existing server and proxy infrastructure. In this project proposal, we'd like to evaluate CVMFS as a means to distribute machine learning model files used in inference, for example .onnx files. The main focus will be on creating a test deployment and benchmarking the access, as well as possible coding utilities and scripts to aid in the deployment of models on CVMFS. We'd also like to contrast CVMFS to existing inference servers like KServe, and see if it could integrate as a backend storage.