ATLAS blog post for GSoC 2025 #1755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

yolannel wants to merge 1 commit into HSF:main from yolannel:main

+103 −0

yolannel commented Sep 5, 2025

First draft proposed for the GSoC 2025 blog post for Neural (De)compression for High Energy Physics under the ATLAS project proposal.


          First draft of blog post for GSoC 2025.

5f5521c

netlify bot commented Sep 5, 2025 •

edited

Loading

✅ Deploy Preview for earnest-hotteok-b1e1bf ready!

Name	Link
🔨 Latest commit	`5f5521c`
🔍 Latest deploy log	https://app.netlify.com/projects/earnest-hotteok-b1e1bf/deploys/68baa862e6a7360008d6b1ef
😎 Deploy Preview	https://deploy-preview-1755--earnest-hotteok-b1e1bf.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Author

yolannel commented Sep 5, 2025

@maszyman Hi Maciej - I think as I'm external to this repo, I can't directly request a reviewer. Pinging you here so that you can add yourself as a reviewer to this draft PR and provide any feedback prior to marking this ready for review!

maszyman self-assigned this

maszyman self-requested a review

September 5, 2025 09:20

maszyman added the GSoC label

maszyman requested changes

View reviewed changes

Contributor

maszyman left a comment

Thanks a lot Yolanne for this nice report (and your contribution in general)!

Please have a look at my comments inline.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md

Comment on lines +2 to +6

+              project: ATLAS
+              title: Neural (De)
+              author: Jane Doe
+              photo: blog_authors/JaneDoe.jpg   # Upload your square photo here
+              avatar: https://avatars.githubusercontent.com/u/12345678?s=400&v=4   # Replace with your GitHub avatar URL

Contributor

maszyman Sep 5, 2025

Please update this information.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		## Introduction

		In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic with the support and supervision of Maciej Szymański and Peter Van Gemmeren with the ATLAS Software & Computing group.

Contributor

maszyman Sep 5, 2025

It would be nice if you could mention Argonne National Laboratory here as well :-)

_gsocblogs/2025/blog_ATLAS_YolanneLee.md

+              | --- | --- |
+              | Name | [Yolanne Lee](https://maths4dl.ac.uk/team-member/yolanne-lee) |
+              | Organisation | [CERN ATLAS Project](https://atlas.cern/) |
+              | Mentor | [Dr. Maciej Szymański](https://www.anl.gov/profile/maciej-pawel-szymanski), [Dr. Peter Van Gemmeren](https://www.anl.gov/profile/maciej-pawel-szymanski)|

Contributor

maszyman Sep 5, 2025

Suggested change

      
            | Mentor | [Dr. Maciej Szymański](https://www.anl.gov/profile/maciej-pawel-szymanski), [Dr. Peter Van Gemmeren](https://www.anl.gov/profile/maciej-pawel-szymanski)|
          
            | Mentor | [Dr. Maciej Szymański](https://www.anl.gov/profile/maciej-pawel-szymanski), [Dr. Peter Van Gemmeren](https://www.anl.gov/profile/peter-van-gemmeren)|

_gsocblogs/2025/blog_ATLAS_YolanneLee.md

+              layout: blog_post
+              logo: hsf_logo_angled.png   # Match the logo file listed in your project’s metadata
+              intro: |
+                In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations.

Contributor

maszyman Sep 5, 2025

Suggested change

      
              In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations.
          
              In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasibility for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		## Introduction

		In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic with the support and supervision of Maciej Szymański and Peter Van Gemmeren with the ATLAS Software & Computing group.

Contributor

maszyman Sep 5, 2025

Suggested change

      
            In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic with the support and supervision of Maciej Szymański and Peter Van Gemmeren with the ATLAS Software & Computing group.
          
            In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasibility for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic with the support and supervision of Maciej Szymański and Peter Van Gemmeren with the ATLAS Software & Computing group.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		As a part of exporatory work, I have implemented autoencoders, variational autoencoders, and flow matching models to successfully reconstruct the distributions of the data of interest (currently, the momentum, eta, and phi of electrons as a minimal test set), demonstrating that such models are sufficiently complex to capture the characteristics of the data. These models also carry additional benefits such as retrievable statistical characteristics and densities, which could benefit downstream usage.

		In the context of the proposed pipeline, I had first attempted to train an autoencoder (taking the implest model to create a 'minimum viable product', as it were) under a denoising workflow wherein I have as input to the model the compressed data, optionally with some added noise. The output of the model, then, is trained to be the 'denoised' data (where if no noise was added, one can consider the mantissa truncation to add 'quantisation noise') and MSE loss is taken of the model output versus the original, uncompressed data. The addition of some small amount of Gaussian noise is a common technique which I use in my day-to-day work and can often encourage the model to learn more effectively. Models at this scale are easily and quickly trained on an NVIDIA RTX4080, with 100 epochs taking on average ~15 minutes, or during testing, converging sufficiently within 30 epochs to establish a rough performance measure. All models were implemented using pytorch, with additional functionalities used for evaluation using scikit-learn statistics and numpy operations where necessary.

Contributor

maszyman Sep 5, 2025

Nitpicking: backticks for the software names (please check also the rest of the document)

Suggested change

      
            In the context of the proposed pipeline, I had first attempted to train an autoencoder (taking the implest model to create a 'minimum viable product', as it were) under a denoising workflow wherein I have as input to the model the compressed data, optionally with some added noise. The output of the model, then, is trained to be the 'denoised' data (where if no noise was added, one can consider the mantissa truncation to add 'quantisation noise') and MSE loss is taken of the model output versus the original, uncompressed data. The addition of some small amount of Gaussian noise is a common technique which I use in my day-to-day work and can often encourage the model to learn more effectively. Models at this scale are easily and quickly trained on an NVIDIA RTX4080, with 100 epochs taking on average ~15 minutes, or during testing, converging sufficiently within 30 epochs to establish a rough performance measure. All models were implemented using pytorch, with additional functionalities used for evaluation using scikit-learn statistics and numpy operations where necessary.
          
            In the context of the proposed pipeline, I had first attempted to train an autoencoder (taking the implest model to create a 'minimum viable product', as it were) under a denoising workflow wherein I have as input to the model the compressed data, optionally with some added noise. The output of the model, then, is trained to be the 'denoised' data (where if no noise was added, one can consider the mantissa truncation to add 'quantisation noise') and MSE loss is taken of the model output versus the original, uncompressed data. The addition of some small amount of Gaussian noise is a common technique which I use in my day-to-day work and can often encourage the model to learn more effectively. Models at this scale are easily and quickly trained on an NVIDIA RTX4080, with 100 epochs taking on average ~15 minutes, or during testing, converging sufficiently within 30 epochs to establish a rough performance measure. All models were implemented using `pytorch`, with additional functionalities used for evaluation using `scikit-learn` statistics and `numpy` operations where necessary.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between "making the numbers more precise" and "superresolving medical images", for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the S&C group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.

		Working on HEP data was both fascinating from the subject matter but also in terms of the problem area being tackled, and I hope that my continued work can benefit not only this specific upsampling target for PHYSLITE data, but lossy compression as a whole. My time working with my mentors was fruitful and inspiring, and I felt integrated into the group via group meetings and the strong support that both mentors provided throughout the project. I am thankful for such an exciting opportunity as well as my mentors' insightful feedback each week, and I look forward to continuing my work beyond this summer with them!

Contributor

maszyman Sep 5, 2025

👍 😊

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		## My thoughts on GSoC

		I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between "making the numbers more precise" and "superresolving medical images", for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the S&C group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.

Contributor

maszyman Sep 5, 2025

Suggested change

      
            I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between "making the numbers more precise" and "superresolving medical images", for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the S&C group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.
          
            I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between "making the numbers more precise" and "superresolving medical images", for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the ANL Software & Computing group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		## Theoretical Framework and Proposed Method

		My project took a short journey through the impact that mantissa truncation has on the resulting data. What exactly does it mean to truncate $n$ bits, realised in base-10 representation?

Contributor

maszyman Sep 5, 2025

My project

Perhaps it's a good place to add the link to your GitHub repository?

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		Another approach under development is to treat the data as an inpainting problem, commonly seen within image generation where some part of an image may be blacked out; an inpainting model is designed to "fill in the blanks". In our case, we not only have the new theoretical bounds but also the first $23-n$ bits of data that is retained after truncation: this is valuable information which, in statistical tests, is also often a 'good-enough' approximation of the uncompressed data to begin with. Then, the challenge is only to "fill in" the remaining truncated $n$ bits which represents an even more bounded problem space and would minimize unexpected upsampling artifacts by constraining any correction terms to be within the allowable $n$ bits of change.

		While this project has not yet conclusively found a candidate model to precision upsample, ongoing work is being performed and is to continue beyond the timeline of the GSoC project toward proposing a working pipeline based off of the work performed up to this point. In short, autoencoders, variational autoencoders, and some simple flow matching models have been implemented and tested, with performance measured using simple MSE loss as well as distribution-based metrics such as KL divergence. The pipeline of the model was being tested in Jupyter notebook files, but I have begun to move them to modular python files to facilitate further work.

Contributor

maszyman Sep 5, 2025

I would still add the plots showing the results for the models you implemented, even if they are not as good as one could hope for.

I think it's important to clearly show what has been achieved, try explaining why, and propose the next steps (which you did in the previous paragraph basically).

but I have begun to move them to modular python files to facilitate further work.

I would drop that. Instead you may say few words about your repo, what's there, how to use it, etc.

maszyman reviewed

View reviewed changes

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		An interesting phenomenon occurs when observing the data plotted on a log-scaled $x$-axis. To further investigate, we plotted this with some light Gaussian smoothing. The resulting peaks appeared to have some regular spacing (specifically, $\Delta \log_{10} \approx 0.28 \;\Rightarrow\; \text{ratio} \approx 1.9$). In fact, investigating the residuals in the same log $x$ scaling resulted in distinct patterns.

		<img src="https://raw.githubusercontent.com/yolannel/ATLAS_decompression/refs/heads/master/figures/combined_residual_density.png" alt="Residual densities with theoretical model overlayed." width="60%">

Contributor

maszyman Sep 5, 2025

Suggested change

      
            <img src="https://raw.githubusercontent.com/yolannel/ATLAS_decompression/refs/heads/master/figures/combined_residual_density.png" alt="Residual densities with theoretical model overlayed." width="60%">
          
            <img src="https://raw.githubusercontent.com/yolannel/ATLAS_decompression/refs/heads/master/figures/combined_residual_density.png" alt="Residual densities with theoretical model overlaid." width="60%">

maszyman reviewed

View reviewed changes

_gsocblogs/2025/blog_ATLAS_YolanneLee.md


		## Introduction

		In high-energy physics experiments such as those at CERN’s ATLAS project, immense volumes of data are generated. This project explores the feasability for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic with the support and supervision of Maciej Szymański and Peter Van Gemmeren with the ATLAS Software & Computing group.

Contributor

maszyman Sep 5, 2025

Please consider adding the references to e.g. CERN, ATLAS Open Data, IEEE 754, ML models, software you used, etc. Either inline links or collectively at the end, as you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GSoC