|
| 1 | +<!-- |
| 2 | +
|
| 3 | +Spring 2025: GlueX-II |
| 4 | +
|
| 5 | +===================================================================== |
| 6 | +triggerRate |
| 7 | +
|
| 8 | +80kHz |
| 9 | +
|
| 10 | +===================================================================== |
| 11 | +runningTimeOnFloor |
| 12 | +
|
| 13 | +Total number of days: 150 |
| 14 | +
|
| 15 | +===================================================================== |
| 16 | +runningEfficiency |
| 17 | +
|
| 18 | +Total running efficiency: 50% |
| 19 | +
|
| 20 | +===================================================================== |
| 21 | +eventsize |
| 22 | +
|
| 23 | +Running hdevio_scan on a raw data file gives: |
| 24 | +
|
| 25 | +hdevio_scan /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio |
| 26 | +No run number given, trying to extract from filename: /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio |
| 27 | +Processing file 1/1 : /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio |
| 28 | +Mapping EVIO file ... |
| 29 | +
|
| 30 | +
|
| 31 | +EVIO Statistics for /cache/halld/RunPeriod-2025-01/rawdata/Run133298/hd_rawdata_133298_004.evio : |
| 32 | + Nblocks: 301 |
| 33 | + Nevents: 35445 |
| 34 | + Nerrors: 0 |
| 35 | +Nbad_blocks: 0 |
| 36 | +Nbad_events: 0 |
| 37 | +
|
| 38 | +EVIO file size: 19073 MB |
| 39 | +EVIO block map size: 1682 kB |
| 40 | +first event: 5660241 |
| 41 | +last event: 8495640 |
| 42 | +
|
| 43 | + block levels = 40 |
| 44 | + events per block = 1,116- |
| 45 | + Nsync = 0 |
| 46 | + Nprestart = 0 |
| 47 | + Ngo = 0 |
| 48 | + Npause = 0 |
| 49 | + Nend = 0 |
| 50 | + Nepics = 1 |
| 51 | + Nbor = 1 |
| 52 | + Nphysics = 1417720 |
| 53 | + Nunknown = 0 |
| 54 | + blocks with unknown tags = 0 |
| 55 | +
|
| 56 | +which gives for the avg. event size: 19073/1417720 = 0.0135 MB/event or 13.5kB |
| 57 | +
|
| 58 | +===================================================================== |
| 59 | +eventsPerRun |
| 60 | +
|
| 61 | +Number of events (in millions) in a production run: average 400M |
| 62 | +
|
| 63 | +===================================================================== |
| 64 | +RESTfraction |
| 65 | +
|
| 66 | +This is based on looking at the REST file sizes for 2023: 6391425 |
| 67 | +
|
| 68 | +The raws data files are all very similar in size to: 19570207 |
| 69 | +
|
| 70 | +thus: 32.7% |
| 71 | +
|
| 72 | +===================================================================== |
| 73 | +goodRunFraction |
| 74 | +
|
| 75 | +This represents the fraction of the full dataset considered good production |
| 76 | +runs. We get this from the ratio of the CPU used for the two recon passes |
| 77 | +from the record (https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html) |
| 78 | +to that calculated assuming all beamtime was used to collect production |
| 79 | +data: |
| 80 | +
|
| 81 | + (1.5743+1.3669+1.7672+1.5498)/7.4 = 0.85 |
| 82 | +
|
| 83 | +===================================================================== |
| 84 | +reconstructionRate |
| 85 | +
|
| 86 | +Directly measured on gluons gives something close to 5.2Hz/core. The |
| 87 | +5.0 number is from memory of a calculation I did based on some numbers |
| 88 | +from one of the launces documented here: |
| 89 | +https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html |
| 90 | +
|
| 91 | +I assume the discrepancy is due to inclusion of hyperthreads in the |
| 92 | +farm number. |
| 93 | +
|
| 94 | +===================================================================== |
| 95 | +reconPasses |
| 96 | +
|
| 97 | +Number of reconstruction passes. We did 2 full recon passes of the |
| 98 | +2017 data. |
| 99 | +
|
| 100 | +https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html |
| 101 | +
|
| 102 | +===================================================================== |
| 103 | +analysisRate |
| 104 | +
|
| 105 | +This is estimated by looking at the total CPU of the first analysis |
| 106 | +pass of 2017 data compared to the total CPU of the first recon pass |
| 107 | +and using that to scale the 5Hz recon rate: |
| 108 | +
|
| 109 | + (5Hz)*(1.5743+1.3669)/(0.1954) = 75Hz |
| 110 | +
|
| 111 | +Note that this will depend on what channels are included in the pass. |
| 112 | +Some passes only added channels and therefore took less time. This |
| 113 | +number represents the rate for the first pass which would have been the |
| 114 | +slowest rate. |
| 115 | +
|
| 116 | +===================================================================== |
| 117 | +analysisPasses |
| 118 | +
|
| 119 | +For 2017 there were 8 versions, but only 5 had data at |
| 120 | +https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html |
| 121 | +Presumably the other 3 were minor enough as to not warrant bookkeeping. |
| 122 | +
|
| 123 | +As noted above, not all passes were the same. The first was the most |
| 124 | +inclusive, but others only added some channels and therefore used |
| 125 | +much less CPU. The final analysis launch looks to have taken the same |
| 126 | +amount of time as the first, but was only run on about half of the files. |
| 127 | +
|
| 128 | +The value here is empirical to rpresent an equivalent number of passes |
| 129 | +to match the total CPU of the 5 recorded passes. |
| 130 | +
|
| 131 | + (0.551 Mhr)/(0.1954) = 2.82 |
| 132 | +
|
| 133 | +===================================================================== |
| 134 | +cores |
| 135 | +
|
| 136 | +The average number of cores available to us varied from the different |
| 137 | +launches/batches due to competition for the farm at the time. The |
| 138 | +number of threads per job was 24. The number of jobs active varied |
| 139 | +from 100-300 which would correspond to 2400 to 7200 cores. This would |
| 140 | +include some hyperthreading. |
| 141 | +
|
| 142 | +The number of 4500 was based partially on the above and partially |
| 143 | +on an estimate of the time jobs were active in each batch. The |
| 144 | +following are taken from eyeballing the "active" curve on the plot: |
| 145 | +"Number of jobs in each stage since launch" |
| 146 | +
|
| 147 | +410 + 350 + 350 + 320 = 1430 hr = 8.5 weeks |
| 148 | +
|
| 149 | +===================================================================== |
| 150 | +incomingData |
| 151 | +
|
| 152 | +proportional to number of runs |
| 153 | +
|
| 154 | +Number of files per run analyzed for the "incoming data" jobs. This |
| 155 | +is always 5. |
| 156 | +
|
| 157 | +===================================================================== |
| 158 | +calibRate |
| 159 | +
|
| 160 | +proportinal to time on floor |
| 161 | +
|
| 162 | +This value represents the number of Mhr of CPU used per week of running |
| 163 | +to calibrate the detector. For 2017 data, the gxproj3 account (Sean) |
| 164 | +used 2Mhr. Additional time was used by individual accounts for |
| 165 | +calibration that is not as easy to categorize. Tegan B. was the biggest |
| 166 | +user with 7.4% of the 26.3Mhr, some fraction of this for calibration. |
| 167 | +For this value we assume 3Mhr/5.7 weeks = 0.526 |
| 168 | +
|
| 169 | +It should be noted that during the discussion on this at the Offline |
| 170 | +meeting on 2018-06-15 there was general thinking that we should be |
| 171 | +able to calibrate with far less CPU in the future. This number is |
| 172 | +higher partly because we were still developing technique and partly |
| 173 | +because the farm resource was not freely available at the time. |
| 174 | +
|
| 175 | +===================================================================== |
| 176 | +offlineMonitoring |
| 177 | +
|
| 178 | +proportional to number of runs |
| 179 | +
|
| 180 | +A total of about 2.3 Mhr was used for Offline Monitoring jobs of 2017 |
| 181 | +data. This consisted of a couple of dozen runs with various conditions |
| 182 | +and amounts of data for each. If we took 289 production runs (based |
| 183 | +on 0.893PB total data, 24TB/run, and 85% good run fraction) then the |
| 184 | +offline monitoring used about 0.00800Mhr per run. |
| 185 | +
|
| 186 | +===================================================================== |
| 187 | +miscUserStudies |
| 188 | +
|
| 189 | +proportional to time to process al files of single run |
| 190 | +
|
| 191 | +This value is used to capture the CPU usage by all of the various users |
| 192 | +that is attributed to the gluex project. Some of this should probably |
| 193 | +go under calibRate, but it is very hard to categorize which parts of |
| 194 | +this should go there. |
| 195 | +
|
| 196 | +It is assumed here that these are jobs that run over all files from a |
| 197 | +small number of runs in order to do special studies. The amount of CPU |
| 198 | +required is therefore proportional to the time it takes to process a |
| 199 | +single production run. This number is empirical based on 2017 CPU usage. |
| 200 | +There is about a 9 Mhr descrepency in the total usage (26.3MHr) and the |
| 201 | +shared account usage (16.4Mhr). We attribute 1Mhr of that to Teagan's |
| 202 | +calibrations in the calibrateRate value above. |
| 203 | +
|
| 204 | +9Mhr/( (200M events)/(5Hz)/(3600s/hr) ) = 810 |
| 205 | +
|
| 206 | +Note that this is not to say that there were 810 studies, but rather, |
| 207 | +this is the proportionality constant for the CPU usage that is |
| 208 | +proportional to processing a single run. |
| 209 | +
|
| 210 | +
|
| 211 | +""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| 212 | +
|
| 213 | +===================================================================== |
| 214 | +Actual Farm CPU usage |
| 215 | +
|
| 216 | +By way of comparison of the calculation to the actual farm usage for |
| 217 | +the recon launches, recon numbers are obtained from: |
| 218 | +https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html |
| 219 | +
|
| 220 | +Full Recon. |
| 221 | +ver01: 3.19Mcore-hr |
| 222 | + batch1: mean CPU/job=68.55hr Njobs=23337 |
| 223 | + batch2: mean CPU/job=71.16hr Njobs=22411 |
| 224 | +
|
| 225 | +ver02: 3.37Mcore-hr |
| 226 | + batch1: mean CPU/job=76.95hr Njobs=23262 |
| 227 | + batch2: mean CPU/job=80.68hr Njobs=19569 |
| 228 | +
|
| 229 | +Total: 6.56 Mcore-hr n.b. this will include hyperthreads and failed jobs |
| 230 | +
|
| 231 | +===================================================================== |
| 232 | +Actual Tape usage |
| 233 | +
|
| 234 | +The total amount of raw data was 911TB (from memory since the scicomp |
| 235 | +page is down). This number includes special runs, including some |
| 236 | +tests by Sasha after the beam was gone. Other non-production running |
| 237 | +was also mixed in that would cause this number to be higher than the |
| 238 | +estimate calculated by the model. |
| 239 | +
|
| 240 | +===================================================================== |
| 241 | +simulationRate |
| 242 | +
|
| 243 | +This is based on a very rough value Thomas B. gave of 40ms/event for |
| 244 | +bggen events with real data background mixed in. Note that adding the |
| 245 | +background this way significantly reduced the compute time required |
| 246 | +from previous models. |
| 247 | +
|
| 248 | +===================================================================== |
| 249 | +simulationpasses |
| 250 | +
|
| 251 | +Number of times we will need to repeat simulation. This value of |
| 252 | +2 is an old estimate. |
| 253 | +
|
| 254 | +===================================================================== |
| 255 | +simulatedPerRawEvent |
| 256 | +
|
| 257 | +Number of simulated events needed for each raw data event (production |
| 258 | +runs only) This is assumed to be 2 simulated events for each signal |
| 259 | +event in the raw data stream. We estimate about 20% of the raew data |
| 260 | +is reconstructable (see "GlueX at High Intensity" talk slide 10 |
| 261 | +here: https://halldweb.jlab.org/wiki/index.php/GlueX-II_and_DIRC_ERR ) |
| 262 | +
|
| 263 | +
|
| 264 | +--> |
| 265 | +<compMod> |
| 266 | +<parameter name="triggerRate" value="80e3" units="Hz"/> |
| 267 | +<parameter name="runningTimeOnFloor" value="150.0" units="days"/> |
| 268 | +<parameter name="runningEfficiency" value="0.50"/> |
| 269 | +<parameter name="eventsize" value="13.5" units="kB"/> |
| 270 | +<parameter name="eventsPerRun" value="400" units="Mevent"/> |
| 271 | +<parameter name="compressionFactor" value="1.0"/> |
| 272 | +<parameter name="RESTfraction" value="0.33"/> |
| 273 | + |
| 274 | +<parameter name="reconstructionRate" value="10.0" units="Hz"/> |
| 275 | +<parameter name="reconPasses" value="1.0"/> |
| 276 | +<parameter name="goodRunFraction" value="0.85"/> |
| 277 | +<parameter name="analysisRate" value="75.0" units="Hz"/> |
| 278 | +<parameter name="analysisPasses" value="2.82"/> |
| 279 | +<parameter name="cores" value="4500"/> |
| 280 | +<parameter name="incomingData" value="5" units="files"/> |
| 281 | +<parameter name="calibRate" value="0.530" units="Mhr/week"/> |
| 282 | +<parameter name="offlineMonitoring" value="0.00800" units="Mhr/run"/> |
| 283 | +<parameter name="miscUserStudies" value="810"/> |
| 284 | + |
| 285 | +<parameter name="simulationRate" value="25" units="Hz"/> |
| 286 | +<parameter name="simulationpasses" value="2"/> |
| 287 | +<parameter name="simulatedPerRawEvent" value="0.4"/> |
| 288 | +</compMod> |
0 commit comments