Skip to content

Commit d955ab4

Browse files
authored
Merge pull request #251 from JeffersonLab/aaust_comp_mod_py3
Update computing model based on 2023_01 production and 2025_01 data taking
2 parents 2400175 + 08b5367 commit d955ab4

File tree

2 files changed

+337
-48
lines changed

2 files changed

+337
-48
lines changed

comp_mod/RunPeriod-2025-01.xml

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
<!--
2+
3+
Spring 2025: GlueX-II
4+
5+
=====================================================================
6+
triggerRate
7+
8+
80kHz
9+
10+
=====================================================================
11+
runningTimeOnFloor
12+
13+
Total number of days: 150
14+
15+
=====================================================================
16+
runningEfficiency
17+
18+
Total running efficiency: 50%
19+
20+
=====================================================================
21+
eventsize
22+
23+
Running hdevio_scan on a raw data file gives:
24+
25+
hdevio_scan /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio
26+
No run number given, trying to extract from filename: /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio
27+
Processing file 1/1 : /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio
28+
Mapping EVIO file ...
29+
30+
31+
EVIO Statistics for /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio :
32+
Nblocks: 301
33+
Nevents: 35445
34+
Nerrors: 0
35+
Nbad_blocks: 0
36+
Nbad_events: 0
37+
38+
EVIO file size: 19073 MB
39+
EVIO block map size: 1682 kB
40+
first event: 5660241
41+
last event: 8495640
42+
43+
block levels = 40
44+
events per block = 1,116-
45+
Nsync = 0
46+
Nprestart = 0
47+
Ngo = 0
48+
Npause = 0
49+
Nend = 0
50+
Nepics = 1
51+
Nbor = 1
52+
Nphysics = 1417720
53+
Nunknown = 0
54+
blocks with unknown tags = 0
55+
56+
which gives for the avg. event size: 19073/1417720 = 0.0135 MB/event or 13.5kB
57+
58+
=====================================================================
59+
eventsPerRun
60+
61+
Number of events (in millions) in a production run: average 400M
62+
63+
=====================================================================
64+
RESTfraction
65+
66+
This is based on looking at the REST file sizes for 2023: 6391425
67+
68+
The raws data files are all very similar in size to: 19570207
69+
70+
thus: 32.7%
71+
72+
=====================================================================
73+
goodRunFraction
74+
75+
This represents the fraction of the full dataset considered good production
76+
runs. We get this from the ratio of the CPU used for the two recon passes
77+
from the record (https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html)
78+
to that calculated assuming all beamtime was used to collect production
79+
data:
80+
81+
(1.5743+1.3669+1.7672+1.5498)/7.4 = 0.85
82+
83+
=====================================================================
84+
reconstructionRate
85+
86+
Directly measured on gluons gives something close to 5.2Hz/core. The
87+
5.0 number is from memory of a calculation I did based on some numbers
88+
from one of the launces documented here:
89+
https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html
90+
91+
I assume the discrepancy is due to inclusion of hyperthreads in the
92+
farm number.
93+
94+
=====================================================================
95+
reconPasses
96+
97+
Number of reconstruction passes. We did 2 full recon passes of the
98+
2017 data.
99+
100+
https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html
101+
102+
=====================================================================
103+
analysisRate
104+
105+
This is estimated by looking at the total CPU of the first analysis
106+
pass of 2017 data compared to the total CPU of the first recon pass
107+
and using that to scale the 5Hz recon rate:
108+
109+
(5Hz)*(1.5743+1.3669)/(0.1954) = 75Hz
110+
111+
Note that this will depend on what channels are included in the pass.
112+
Some passes only added channels and therefore took less time. This
113+
number represents the rate for the first pass which would have been the
114+
slowest rate.
115+
116+
=====================================================================
117+
analysisPasses
118+
119+
For 2017 there were 8 versions, but only 5 had data at
120+
https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html
121+
Presumably the other 3 were minor enough as to not warrant bookkeeping.
122+
123+
As noted above, not all passes were the same. The first was the most
124+
inclusive, but others only added some channels and therefore used
125+
much less CPU. The final analysis launch looks to have taken the same
126+
amount of time as the first, but was only run on about half of the files.
127+
128+
The value here is empirical to rpresent an equivalent number of passes
129+
to match the total CPU of the 5 recorded passes.
130+
131+
(0.551 Mhr)/(0.1954) = 2.82
132+
133+
=====================================================================
134+
cores
135+
136+
The average number of cores available to us varied from the different
137+
launches/batches due to competition for the farm at the time. The
138+
number of threads per job was 24. The number of jobs active varied
139+
from 100-300 which would correspond to 2400 to 7200 cores. This would
140+
include some hyperthreading.
141+
142+
The number of 4500 was based partially on the above and partially
143+
on an estimate of the time jobs were active in each batch. The
144+
following are taken from eyeballing the "active" curve on the plot:
145+
"Number of jobs in each stage since launch"
146+
147+
410 + 350 + 350 + 320 = 1430 hr = 8.5 weeks
148+
149+
=====================================================================
150+
incomingData
151+
152+
proportional to number of runs
153+
154+
Number of files per run analyzed for the "incoming data" jobs. This
155+
is always 5.
156+
157+
=====================================================================
158+
calibRate
159+
160+
proportinal to time on floor
161+
162+
This value represents the number of Mhr of CPU used per week of running
163+
to calibrate the detector. For 2017 data, the gxproj3 account (Sean)
164+
used 2Mhr. Additional time was used by individual accounts for
165+
calibration that is not as easy to categorize. Tegan B. was the biggest
166+
user with 7.4% of the 26.3Mhr, some fraction of this for calibration.
167+
For this value we assume 3Mhr/5.7 weeks = 0.526
168+
169+
It should be noted that during the discussion on this at the Offline
170+
meeting on 2018-06-15 there was general thinking that we should be
171+
able to calibrate with far less CPU in the future. This number is
172+
higher partly because we were still developing technique and partly
173+
because the farm resource was not freely available at the time.
174+
175+
=====================================================================
176+
offlineMonitoring
177+
178+
proportional to number of runs
179+
180+
A total of about 2.3 Mhr was used for Offline Monitoring jobs of 2017
181+
data. This consisted of a couple of dozen runs with various conditions
182+
and amounts of data for each. If we took 289 production runs (based
183+
on 0.893PB total data, 24TB/run, and 85% good run fraction) then the
184+
offline monitoring used about 0.00800Mhr per run.
185+
186+
=====================================================================
187+
miscUserStudies
188+
189+
proportional to time to process al files of single run
190+
191+
This value is used to capture the CPU usage by all of the various users
192+
that is attributed to the gluex project. Some of this should probably
193+
go under calibRate, but it is very hard to categorize which parts of
194+
this should go there.
195+
196+
It is assumed here that these are jobs that run over all files from a
197+
small number of runs in order to do special studies. The amount of CPU
198+
required is therefore proportional to the time it takes to process a
199+
single production run. This number is empirical based on 2017 CPU usage.
200+
There is about a 9 Mhr descrepency in the total usage (26.3MHr) and the
201+
shared account usage (16.4Mhr). We attribute 1Mhr of that to Teagan's
202+
calibrations in the calibrateRate value above.
203+
204+
9Mhr/( (200M events)/(5Hz)/(3600s/hr) ) = 810
205+
206+
Note that this is not to say that there were 810 studies, but rather,
207+
this is the proportionality constant for the CPU usage that is
208+
proportional to processing a single run.
209+
210+
211+
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
212+
213+
=====================================================================
214+
Actual Farm CPU usage
215+
216+
By way of comparison of the calculation to the actual farm usage for
217+
the recon launches, recon numbers are obtained from:
218+
https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html
219+
220+
Full Recon.
221+
ver01: 3.19Mcore-hr
222+
batch1: mean CPU/job=68.55hr Njobs=23337
223+
batch2: mean CPU/job=71.16hr Njobs=22411
224+
225+
ver02: 3.37Mcore-hr
226+
batch1: mean CPU/job=76.95hr Njobs=23262
227+
batch2: mean CPU/job=80.68hr Njobs=19569
228+
229+
Total: 6.56 Mcore-hr n.b. this will include hyperthreads and failed jobs
230+
231+
=====================================================================
232+
Actual Tape usage
233+
234+
The total amount of raw data was 911TB (from memory since the scicomp
235+
page is down). This number includes special runs, including some
236+
tests by Sasha after the beam was gone. Other non-production running
237+
was also mixed in that would cause this number to be higher than the
238+
estimate calculated by the model.
239+
240+
=====================================================================
241+
simulationRate
242+
243+
This is based on a very rough value Thomas B. gave of 40ms/event for
244+
bggen events with real data background mixed in. Note that adding the
245+
background this way significantly reduced the compute time required
246+
from previous models.
247+
248+
=====================================================================
249+
simulationpasses
250+
251+
Number of times we will need to repeat simulation. This value of
252+
2 is an old estimate.
253+
254+
=====================================================================
255+
simulatedPerRawEvent
256+
257+
Number of simulated events needed for each raw data event (production
258+
runs only) This is assumed to be 2 simulated events for each signal
259+
event in the raw data stream. We estimate about 20% of the raew data
260+
is reconstructable (see "GlueX at High Intensity" talk slide 10
261+
here: https://halldweb.jlab.org/wiki/index.php/GlueX-II_and_DIRC_ERR )
262+
263+
264+
-->
265+
<compMod>
266+
<parameter name="triggerRate" value="80e3" units="Hz"/>
267+
<parameter name="runningTimeOnFloor" value="150.0" units="days"/>
268+
<parameter name="runningEfficiency" value="0.50"/>
269+
<parameter name="eventsize" value="13.5" units="kB"/>
270+
<parameter name="eventsPerRun" value="400" units="Mevent"/>
271+
<parameter name="compressionFactor" value="1.0"/>
272+
<parameter name="RESTfraction" value="0.33"/>
273+
274+
<parameter name="reconstructionRate" value="10.0" units="Hz"/>
275+
<parameter name="reconPasses" value="1.0"/>
276+
<parameter name="goodRunFraction" value="0.85"/>
277+
<parameter name="analysisRate" value="75.0" units="Hz"/>
278+
<parameter name="analysisPasses" value="2.82"/>
279+
<parameter name="cores" value="4500"/>
280+
<parameter name="incomingData" value="5" units="files"/>
281+
<parameter name="calibRate" value="0.530" units="Mhr/week"/>
282+
<parameter name="offlineMonitoring" value="0.00800" units="Mhr/run"/>
283+
<parameter name="miscUserStudies" value="810"/>
284+
285+
<parameter name="simulationRate" value="25" units="Hz"/>
286+
<parameter name="simulationpasses" value="2"/>
287+
<parameter name="simulatedPerRawEvent" value="0.4"/>
288+
</compMod>

comp_mod/comp_mod.py

Lines changed: 49 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242

4343
# defaults values for variable that may not be present in all files
4444
#NERSC_unitsPerFile = 880
45-
NERSC_unitsPerFile = 3. # Average node hours per file determined by Igal for GlueX data in 2022
45+
NERSC_unitsPerFile = 0.33 # Average node hours per file determined by Igal for 2023 GlueX data in 2025
4646
PSC_unitsPerFile = 156.8
4747

4848
# input values (with unit checks
@@ -133,53 +133,54 @@ class bcolors:
133133
if HIGHLIGHT : BD = bcolors.BOLD + bcolors.FAIL
134134

135135
# Print report
136-
print ''
137-
print ' GlueX Computing Model'
138-
print ' '*(25 - len(INPUTFILE)/2) + INPUTFILE
139-
print '=============================================='
140-
print ' PAC Time: ' + '%3.1f' % runningTimePac_weeks + ' weeks'
141-
print ' Running Time: ' + '%3.1f' % (runningTimeOnFloor/7.0) + ' weeks'
142-
print ' Running Efficiency: ' + str(int(runningEfficiency*100.0)) + '%'
143-
print ' --------------------------------------'
144-
print ' Trigger Rate: ' + str(triggerRate/1000.0) + ' kHz'
145-
print ' Raw Data Num. Events: ' + '%3.1f' % numberProductionEvents_billions + ' billion (good production runs only)'
146-
print ' Raw Data compression: ' + '%3.2f' % compressionFactor
147-
print ' Raw Data Event Size: ' + str(eventsize) + ' kB ' + uncompressed_str
148-
print ' Front End Raw Data Rate: ' + '%3.2f' % rawDataRateUncompressed_GBps + ' GB/s ' + uncompressed_str
149-
print ' Disk Raw Data Rate: ' + '%3.2f' % rawDataRateCompressed_GBps + ' GB/s ' + compressed_str
150-
print ' Raw Data Volume: ' + '%3.3f' % rawDataVolume_PB + ' PB ' + compressed_str
151-
print ' Bandwidth to offsite: ' + '%3.0f' % rawDataOffsite1month_MBps + ' MB/s (all raw data in 1 month)'
152-
print ' REST/Raw size frac.: ' + '%3.2f' % (RESTfractionCompressed*100.0) + '%'
153-
print ' REST Data Volume: ' + '%3.3f' % RESTDataVolume_PB + ' PB (for ' + str(reconPasses) + ' passes)'
154-
print ' Analysis Data Volume: ' + '%3.3f' % AnalysisDataVolume_PB + ' PB (ROOT Trees for ' + str(analysisPasses) + ' passes)'
155-
print BD+' Total Real Data Volume: ' + '%3.1f' % (rawDataVolume_PB + RESTDataVolume_PB + AnalysisDataVolume_PB) + ' PB' + bcolors.ENDC
156-
print ' --------------------------------------'
157-
print ' Recon. time/event: ' + '%3.0f' % reconstructionTimePerEvent_ms + ' ms (' + str(reconstructionRate) + ' Hz/core)'
158-
print ' Available CPUs: ' + str(cores) + ' cores (full)'
159-
print ' Time to process: ' + '%3.1f' % reconstructionTimeAllCores_weeks + ' weeks (all passes)'
160-
print ' Good run fraction: ' + str(goodRunFraction)
161-
print ' Number of recon passes: ' + str(reconPasses)
162-
print 'Number of analysis passes: ' + str(analysisPasses)
163-
print ' Reconstruction CPU: ' + '%3.1f' % reconstructionTimeAllCores_Mhr + ' Mhr' + ' (=%2.0fk NERSC units or %2.0fM PSC units)' %(NERSC_units_total, PSC_units_total)
164-
print ' Analysis CPU: ' + '%3.3f' % analysisCPU_Mhr + ' Mhr'
165-
print ' Calibration CPU: ' + '%3.1f' % calibCPU_Mhr + ' Mhr'
166-
print ' Offline Monitoring CPU: ' + '%3.1f' % offlineMonitoring_Mhr + ' Mhr'
167-
print ' Misc User CPU: ' + '%3.1f' % miscUserStudies_Mhr + ' Mhr'
168-
print ' Incoming Data CPU: ' + '%3.3f' % incomingData_Mhr + ' Mhr'
169-
print BD+' Total Real Data CPU: ' + '%3.1f' % TOTAL_CPU_REAL_DATA + ' Mhr' + bcolors.ENDC
170-
print ' --------------------------------------'
171-
print ' MC generation Rate: ' + '%3.1f' % simulationRate + ' Hz/core'
172-
print ' MC Number of passes: ' + '%3.1f' % simulationpasses
173-
print ' MC events/raw event: ' + '%3.2f' % simulatedPerRawEvent
174-
print BD+' MC data volume: ' + '%3.3f' % simulationDataVolume_PB + ' PB (REST only)' + bcolors.ENDC
175-
print ' MC Generation CPU: ' + '%3.1f' % simulationTimeGeneration_Mhr + ' Mhr'
176-
print ' MC Reconstruction CPU: ' + '%3.1f' % simulationTimeReconstruction_Mhr + ' Mhr'
177-
print BD+' Total MC CPU: ' + '%3.1f' % simulationTimeTotal_Mhr + ' Mhr' + bcolors.ENDC
178-
print ' ---------------------------------------'
179-
print ' TOTALS:'
180-
print BD+' CPU: ' + '%3.1f' % TOTAL_CPU_Mhr + ' Mhr' + bcolors.ENDC
181-
print BD+' TAPE: ' + '%3.1f' % TOTAL_TAPE_PB + ' PB' + bcolors.ENDC
182-
print ''
136+
print( '')
137+
print( ' GlueX Computing Model')
138+
print( ' '*(25 - int(len(INPUTFILE)/2)) + INPUTFILE)
139+
print( '==============================================')
140+
print( ' PAC Time: ' + '%3.1f' % runningTimePac_weeks + ' weeks')
141+
print( ' Running Time: ' + '%3.1f' % (runningTimeOnFloor/7.0) + ' weeks')
142+
print( ' Running Efficiency: ' + str(int(runningEfficiency*100.0)) + '%')
143+
print( ' --------------------------------------')
144+
print( ' Trigger Rate: ' + str(triggerRate/1000.0) + ' kHz')
145+
print( ' Raw Data Num. Events: ' + '%3.1f' % numberProductionEvents_billions + ' billion (good production runs only)')
146+
print( ' Raw Data compression: ' + '%3.2f' % compressionFactor)
147+
print( ' Raw Data Event Size: ' + str(eventsize) + ' kB ' + uncompressed_str)
148+
print( ' Front End Raw Data Rate: ' + '%3.2f' % rawDataRateUncompressed_GBps + ' GB/s ' + uncompressed_str)
149+
print( ' Disk Raw Data Rate: ' + '%3.2f' % rawDataRateCompressed_GBps + ' GB/s ' + compressed_str)
150+
print( ' Raw Data Volume: ' + '%3.3f' % rawDataVolume_PB + ' PB ' + compressed_str)
151+
print( ' Bandwidth to offsite: ' + '%3.0f' % rawDataOffsite1month_MBps + ' MB/s (all raw data in 1 month)')
152+
print( ' REST/Raw size frac.: ' + '%3.2f' % (RESTfractionCompressed*100.0) + '%')
153+
print( ' REST Data Volume: ' + '%3.3f' % RESTDataVolume_PB + ' PB (for ' + str(reconPasses) + ' passes)')
154+
print( ' Analysis Data Volume: ' + '%3.3f' % AnalysisDataVolume_PB + ' PB (ROOT Trees for ' + str(analysisPasses) + ' passes)')
155+
print(BD+' Total Real Data Volume: ' + '%3.1f' % (rawDataVolume_PB + RESTDataVolume_PB + AnalysisDataVolume_PB) + ' PB' + bcolors.ENDC)
156+
print( ' --------------------------------------')
157+
print( ' Recon. time/event: ' + '%3.0f' % reconstructionTimePerEvent_ms + ' ms (' + str(reconstructionRate) + ' Hz/core)')
158+
print( ' Available CPUs: ' + str(cores) + ' cores (full)')
159+
print( ' Time to process: ' + '%3.1f' % reconstructionTimeAllCores_weeks + ' weeks (all passes)')
160+
print( ' Good run fraction: ' + str(goodRunFraction))
161+
print( ' Number of recon passes: ' + str(reconPasses))
162+
print( 'Number of analysis passes: ' + str(analysisPasses))
163+
#print( ' Reconstruction CPU: ' + '%3.1f' % reconstructionTimeAllCores_Mhr + ' Mhr' + ' (=%2.0fk NERSC units or %2.0fM PSC units)' %(NERSC_units_total, PSC_units_total))
164+
print( ' Reconstruction CPU: ' + '%3.1f' % reconstructionTimeAllCores_Mhr + ' Mhr' + ' (=%2.0fk NERSC units)' %(NERSC_units_total))
165+
print( ' Analysis CPU: ' + '%3.3f' % analysisCPU_Mhr + ' Mhr')
166+
print( ' Calibration CPU: ' + '%3.1f' % calibCPU_Mhr + ' Mhr')
167+
print( ' Offline Monitoring CPU: ' + '%3.1f' % offlineMonitoring_Mhr + ' Mhr')
168+
print( ' Misc User CPU: ' + '%3.1f' % miscUserStudies_Mhr + ' Mhr')
169+
print( ' Incoming Data CPU: ' + '%3.3f' % incomingData_Mhr + ' Mhr')
170+
print(BD+' Total Real Data CPU: ' + '%3.1f' % TOTAL_CPU_REAL_DATA + ' Mhr' + bcolors.ENDC)
171+
print( ' --------------------------------------')
172+
print( ' MC generation Rate: ' + '%3.1f' % simulationRate + ' Hz/core')
173+
print( ' MC Number of passes: ' + '%3.1f' % simulationpasses)
174+
print( ' MC events/raw event: ' + '%3.2f' % simulatedPerRawEvent)
175+
print(BD+' MC data volume: ' + '%3.3f' % simulationDataVolume_PB + ' PB (REST only)' + bcolors.ENDC)
176+
print( ' MC Generation CPU: ' + '%3.1f' % simulationTimeGeneration_Mhr + ' Mhr')
177+
print( ' MC Reconstruction CPU: ' + '%3.1f' % simulationTimeReconstruction_Mhr + ' Mhr')
178+
print(BD+' Total MC CPU: ' + '%3.1f' % simulationTimeTotal_Mhr + ' Mhr' + bcolors.ENDC)
179+
print( ' ---------------------------------------')
180+
print( ' TOTALS:')
181+
print(BD+' CPU: ' + '%3.1f' % TOTAL_CPU_Mhr + ' Mhr' + bcolors.ENDC)
182+
print(BD+' TAPE: ' + '%3.1f' % TOTAL_TAPE_PB + ' PB' + bcolors.ENDC)
183+
print( '')
183184

184185

185186

0 commit comments

Comments
 (0)