Skip to content

Commit bcd4d6f

Browse files
committed
Merge branch 'main' of github.com:yolannel/hsf.github.io
2 parents eee5105 + df574ab commit bcd4d6f

File tree

1 file changed

+40
-11
lines changed

1 file changed

+40
-11
lines changed

_gsocblogs/2025/blog_ATLAS_YolanneLee.md

Lines changed: 40 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
project: ATLAS
3-
title: Neural (De)
4-
author: Jane Doe
3+
title: Neural (De)compression for High Energy Physics
4+
author: Yolanne Lee
55
photo: blog_authors/JaneDoe.jpg # Upload your square photo here
6-
avatar: https://avatars.githubusercontent.com/u/12345678?s=400&v=4 # Replace with your GitHub avatar URL
6+
avatar: https://avatars.githubusercontent.com/yolannel # Replace with your GitHub avatar URL
77
date: 31.08.2025 # Use the date you wrote the article
88
year: 2025
99
layout: blog_post
@@ -15,17 +15,39 @@ intro: |
1515
| | |
1616
| --- | --- |
1717
| Name | [Yolanne Lee](https://maths4dl.ac.uk/team-member/yolanne-lee) |
18-
| Organisation | [CERN ATLAS Project](https://atlas.cern/) |
19-
| Mentor | [Dr. Maciej Szymański](https://www.anl.gov/profile/maciej-pawel-szymanski), [Dr. Peter Van Gemmeren](https://www.anl.gov/profile/maciej-pawel-szymanski)|
18+
| Organisation | [CERN ATLAS Project](https://atlas.cern/), [Argonne National Laboratory](https://www.anl.gov/) |
19+
| Mentor | [Dr. Maciej Szymański](https://www.anl.gov/profile/maciej-pawel-szymanski), [Dr. Peter Van Gemmeren](https://www.anl.gov/profile/peter-van-gemmeren)|
2020
| Project | [Neural (De)compression for High Energy Physics](https://summerofcode.withgoogle.com/programs/2025/projects/W5UjfwMc) |
2121

2222
## Introduction
2323

2424
In high-energy physics experiments such as those at [CERN’s ATLAS project](https://atlas.cern/), immense volumes of data are generated. This project explores the feasibility for “precision upsampling” using deep generative models to be used to reconstruct high-precision floating-point data from aggressively compressed representations. I had the opportunity to work on this topic under the support and supervision of Maciej Szymański and Peter Van Gemmeren, with the ATLAS Software & Computing group at CERN and Argonne National Laboratory.
2525

26-
While lossless compression is already employed to manage this data, lossy compression (specifically of floating-point precision) offers more aggressive reductions, potentially decreasing file sizes by over 30%. However, this comes at the cost of irreversibly discarding information, raising the challenge of how to recover or approximate full-precision data for downstream analysis.
26+
In preparation for the High-Luminosity Large Hadron Collider project, two more streamlined data formats are being
27+
developed and refined for the ATLAS project: The DAOD_PHYS and DAOD_PHYSLITE, requiring approximately [50kB/event and 10kB/event respectively](https://agenda.infn.it/event/39076/contributions/218802/attachments/114572/164461/21.XII.2023.ATLAS_Italia_Calcolo.pdf). While lossless compression is already employed to manage this data, lossy compression (specifically of floating-point precision) offers more aggressive reductions, potentially decreasing file sizes by over 30%. This is in part limited by the presence of other data types present in the DAOD files. However, this comes at the cost of irreversibly discarding information, raising the challenge of how to recover or approximate full-precision data for downstream analysis.
2728

28-
This project proposes a novel approach using deep probabilistic models to reconstruct high-precision floating-point data from aggressively compressed representations, a problem coined here as precision upsampling. The goal is to explore and compare the capabilities of three classes of generative models: autoencoders, diffusion models, and normalizing flows. Each model type offers distinct advantages: autoencoders are well-studied in neural compression, diffusion models are robust to noise and excel in reconstructing multi-scale structures, and normalizing flows offer exact likelihood estimation and invertible mappings that align well with the structured nature of physical data.
29+
<details>
30+
<summary>Contents of a DAOD_PHYSLITE file</summary>
31+
32+
33+
| Data Type | % Branches Comp | % Branches Orig | % Size Comp | % Size Orig | Count Comp | Count Orig | Size (MB) Comp | Size (MB) Orig |
34+
|------------------|-----------------|-----------------|-------------|-------------|------------|------------|----------------|----------------|
35+
| AsDtype | 3.15 | 3.15 | 0.15 | 0.12 | 25.00 | 25.00 | 0.22 | 0.24 |
36+
| float32 (>f4) | 34.92 | 34.92 | 57.96 | 67.88 | 276.50 | 276.50 | 91.53 | 147.30 |
37+
| group | 0.12 | 0.12 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 |
38+
| int32 (>i4) | 2.22 | 2.22 | 2.94 | 2.20 | 17.50 | 17.50 | 4.73 | 4.81 |
39+
| int64 (>i8) | 0.12 | 0.12 | 0.02 | 0.02 | 1.00 | 1.00 | 0.04 | 0.06 |
40+
| jagged_array | 13.72 | 13.72 | 8.50 | 6.41 | 108.50 | 108.50 | 13.32 | 13.66 |
41+
| object_container | 21.00 | 21.13 | 15.96 | 12.74 | 218.50 | 219.50 | 30.72 | 32.95 |
42+
| strided_object | 12.93 | 12.93 | 6.22 | 4.69 | 133.00 | 133.00 | 13.37 | 13.44 |
43+
| uint32 (>u4) | 10.66 | 10.54 | 7.68 | 5.52 | 84.50 | 83.50 | 12.27 | 11.90 |
44+
| uint64 (>u8) | 0.76 | 0.76 | 0.57 | 0.42 | 6.00 | 6.00 | 0.90 | 0.90 |
45+
| unreadable_branch| 0.38 | 0.38 | 0.00 | 0.00 | 3.00 | 3.00 | 0.00 | 0.00 |
46+
47+
</details>
48+
49+
50+
This project proposes a novel approach using deep probabilistic models to reconstruct high-precision floating-point data from aggressively compressed representations, a problem coined here as precision upsampling. The goal is to explore and compare the capabilities of three classes of generative models: autoencoders, diffusion models, and normalising flows. Each model type offers distinct advantages: autoencoders are well-studied in neural compression, diffusion models are robust to noise and excel in reconstructing multi-scale structures, and normalising flows offer exact likelihood estimation and invertible mappings that align well with the structured nature of physical data.
2951

3052
## Context
3153

@@ -35,12 +57,17 @@ Currently, the implemented lossy compression reduces the precision of a 32-bit I
3557

3658
A 32-bit IEEE 754 float consists of:
3759

38-
$\text{bits} = \underbrace{s}_{1 \text{ bit}} \ \underbrace{e}_{8 \text{ bits}} \ \underbrace{m}_{23 \text{ bits}}$
60+
```text
61+
bits = [ s | eeeeeeee | mmmmmmmmmmmmmmmmmmmmmmmmm ]
62+
^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
63+
1 bit 8 bits 23 bits
64+
```
3965

4066
where:
67+
4168
- $s$ is the **sign bit**,
4269
- $e$ is the **exponent in biased form** ($e \in [0, 255]$),
43-
- $m$ is the **mantissa (fraction)** with an implicit leading 1 for normalized values.
70+
- $m$ is the **mantissa (fraction)** with an implicit leading 1 for normalised values.
4471

4572
The lossy compression currently implemented in ATLAS operates by breaking a float number into its sign, exponent, and mantissa bits. The mantissa is then rounded to the nearest representable value by adding a rounding bit and truncating the lowest $k$ bits. Finally, the number is reconstructed using the same sign and exponent but with the shortened mantissa, producing a lower-precision approximation of the original.
4673

@@ -55,7 +82,9 @@ When truncating the last 10 mantissa bits, the compression observed is non-negli
5582

5683
My project took a short journey through the impact that mantissa truncation has on the resulting data. What exactly does it mean to truncate $n$ bits, realised in base-10 representation?
5784

58-
Firstly, I visualised the original and lossy-compressed data, focusing largely on the momentum, eta, and phi values within the PHYSLITE files.
85+
[Github Repository: See the journey from analysis to exploratory approaches here!](https://github.com/yolannel/ATLAS_decompression)
86+
87+
I visualised the original and lossy-compressed data, focusing largely on the momentum, eta, and phi values of electrons within the PHYSLITE files. These recorded the motion and direction of the electrons, where $\vec{p}$ (momentum) tells how fast they are moving, η (eta) describes the angle at which they travel relative to the main particle beams, and φ (phi) gives the direction around those beams.
5988

6089
<p float="left">
6190
<img src="https://raw.githubusercontent.com/yolannel/ATLAS_decompression/master/figures/original_data_distributions.png"
@@ -118,6 +147,6 @@ This hybrid model performs significantly better than the autoencoder approach; h
118147

119148
## My thoughts on GSoC
120149

121-
I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between "making the numbers more precise" and "superresolving medical images", for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the S&C group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.
150+
I had first approached my mentors and this project with the excitement of applying some of my current research in generative models for scientific machine learning directly to this precision upsampling problem, drawing parallels between 'making the numbers more precise' and 'superresolving medical images', for example. However, despite the extensive data exploration we performed and the theoretical work I ended up doing stalling the work directly applying deep learning models to these PHYSLITE file data, I found it both satisfying in being able to derive a clear reason to the phenomena I observe in the data but also extremely informative in terms of designing (as a work in progress) a more rigorous deep learning system. I have also had the opportunity to present my work to the ANL Software & Computing group, which generated some useful discussion which I look forward to applying to my work, and will be further presenting it during the ATLAS S&C Week workshop in the fall.
122151

123152
Working on HEP data was both fascinating from the subject matter but also in terms of the problem area being tackled, and I hope that my continued work can benefit not only this specific upsampling target for PHYSLITE data, but lossy compression as a whole. My time working with my mentors was fruitful and inspiring, and I felt integrated into the group via group meetings and the strong support that both mentors provided throughout the project. I am thankful for such an exciting opportunity as well as my mentors' insightful feedback each week, and I look forward to continuing my work beyond this summer with them!

0 commit comments

Comments
 (0)