Skip to content

Commit 0c9661e

Browse files
benedikt-voelkelBenedikt Volkel
andauthored
Apply additional custom cuts (#642)
* cuts loaded from corresponding database analysis section in Processer parent class * use flag use_cuts in analysis section to switch usage on or off * applied in derived classes via helper method Processer.apply_cuts_ptbin(df, ipt) (at the moment implemented in processerdhadrons_mult.py) * forseen to place additional cuts ONLY in Processer_derived.process_histomass_single() * remove process_histomass from processerdhadrons_mult since it is a duplicate of the one on processer.py Co-authored-by: Benedikt Volkel <[email protected]>
1 parent dc2cf1c commit 0c9661e

File tree

4 files changed

+87
-20
lines changed

4 files changed

+87
-20
lines changed

machine_learning_hep/analysis/README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,30 @@
55
First of all, everything in here is basically an **Analyzer**. These objects can be handled by an `AnalysisManager`.
66

77

8+
## Applying additional analysis cuts
9+
10+
In order to place additional cuts before a mass histogram is filled, those have to be set in the corresponding analysis section in the database. There, one cut must be put per analysis pT bin. If no cut should be applied, just put `Null`. The flag `use_cuts` controls whether the cuts should be applied or not. Otherwise, cuts are formulated as strings which are directly used in a `pandas.DataFrame.query` meaning that all names used **must** exist as a column in the dataframe in the analysis. An example implementation in the database could look like
11+
12+
```yaml
13+
# within an analysis section, assuming 4 pT bins
14+
use_cuts: True
15+
cuts:
16+
- "p_prong0 > 2 or p_prong1 < 1"
17+
- Null
18+
- "abs(eta_cand) < 1.2"
19+
- Null
20+
```
21+
22+
The cuts can then be accessed in `processer_<type>.process_histomass_single`. The database flag `use_cuts` is translated into the member `self.do_custom_analysis_cuts` and should be checked whether it's `True` in order to not circumvent it's purpose. Then, there is a helper function in `Processer` so if you have a dataframe corresponding to a certain pT bin, you can just do
23+
24+
```python
25+
if self.do_custom_analysis_cuts:
26+
df = self.apply_cuts_ptbin(df, ipt)
27+
28+
```
29+
30+
which would apply the cuts defined for the `ipt`'th bin and returns the skimmed dataframe. Nothing is done when there was no cut defined and you would just get back the dataframe you put in.
31+
832
## Analysis and systematic implementation and workflow
933

1034
A specific analysis or systematics is derived from `Analyzer`. This `AnalyzerDerived` can then implement any analysis step method. Note, that passing arguments to those methods is at the moment not supported. However, as they have access to the entire configuration via the database dictionary, this will probably not be needed as all specifics can be derived from that database.

machine_learning_hep/data/data_prod_20200417/database_ml_parameters_D0pp_0417.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -737,6 +737,16 @@ D0pp:
737737
nevents: null
738738
dodoublecross: false
739739

740+
# Additional cuts applied before mass histogram is filled
741+
use_cuts: False
742+
cuts:
743+
- Null
744+
- Null
745+
- Null
746+
- Null
747+
- Null
748+
- Null
749+
740750
systematics:
741751
# For now don't do these things per pT bin
742752
max_chisquare_ndf: 2.

machine_learning_hep/processer.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
main script for doing data processing, machine learning and analysis
1717
"""
1818
import sys
19+
from copy import deepcopy
1920
import multiprocessing as mp
2021
import pickle
2122
import os
@@ -143,6 +144,10 @@ def __init__(self, case, datap, run_param, mcordata, p_maxfiles,
143144
self.lpt_anbinmin = datap["sel_skim_binmin"]
144145
self.lpt_anbinmax = datap["sel_skim_binmax"]
145146
self.p_nptbins = len(self.lpt_anbinmin)
147+
# Analysis pT bins
148+
self.lpt_finbinmin = datap["analysis"][self.typean]["sel_an_binmin"]
149+
self.lpt_finbinmax = datap["analysis"][self.typean]["sel_an_binmax"]
150+
self.p_nptfinbins = len(self.lpt_finbinmin)
146151
self.lpt_model = datap["mlapplication"]["modelsperptbin"]
147152
self.dirmodel = datap["ml"]["mlout"]
148153
self.lpt_model = appendmainfoldertolist(self.dirmodel, self.lpt_model)
@@ -203,6 +208,11 @@ def __init__(self, case, datap, run_param, mcordata, p_maxfiles,
203208
# if os.path.exists(self.d_root) is False:
204209
# self.logger.warning("ROOT tree folder is not there. Is it intentional?")
205210

211+
# Analysis cuts (loaded in self.process_histomass)
212+
self.analysis_cuts = None
213+
# Flag if they should be used
214+
self.do_custom_analysis_cuts = datap["analysis"][self.typean].get("use_cuts", False)
215+
206216
def unpack(self, file_index):
207217
treeevtorig = uproot.open(self.l_root[file_index])[self.n_treeevt]
208218
try:
@@ -392,6 +402,41 @@ def process_mergedec(self):
392402
if self.mcordata == "mc":
393403
merge_method(self.mptfiles_gensk[ipt], self.lpt_gendecmerged[ipt])
394404

405+
406+
def load_cuts(self):
407+
"""Load cuts from database
408+
"""
409+
410+
# Assume that there is a list with self.p
411+
raw_cuts = self.datap["analysis"][self.typean].get("cuts", None)
412+
if not raw_cuts:
413+
print("No custom cuts given, hence not cutting...")
414+
self.analysis_cuts = [None] * self.p_nptfinbins
415+
return
416+
417+
if len(raw_cuts) != self.p_nptfinbins:
418+
print(f"You have {self.p_nptfinbins} but you passed {len(raw_cuts)} cuts. Exit...")
419+
sys.exit(1)
420+
421+
self.analysis_cuts = deepcopy(raw_cuts)
422+
423+
424+
def apply_cuts_ptbin(self, df_, ipt):
425+
"""Helper function to cut dataframe with cuts for given pT bin
426+
427+
Args:
428+
df: dataframe
429+
ipt: int
430+
i'th pT bin
431+
Returns:
432+
dataframe
433+
"""
434+
if not self.analysis_cuts[ipt]:
435+
return df_
436+
437+
return df_.query(self.analysis_cuts[ipt])
438+
439+
395440
# pylint: disable=no-member
396441
def process_histomass(self):
397442
print("Doing masshisto", self.mcordata, self.period)
@@ -402,6 +447,9 @@ def process_histomass(self):
402447
else:
403448
print("No extra selection needed since we are doing std analysis")
404449

450+
# Load potential custom cuts
451+
self.load_cuts()
452+
405453
create_folder_struc(self.d_results, self.l_path)
406454
arguments = [(i,) for i in range(len(self.l_root))]
407455
self.parallelizer(self.process_histomass_single, arguments, self.p_chunksizeunp)

machine_learning_hep/processerdhadrons_mult.py

Lines changed: 5 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,6 @@ def __init__(self, case, datap, run_param, mcordata, p_maxfiles,
6565
self.v_var2_binning_gen = datap["analysis"][self.typean]["var_binning2_gen"]
6666
self.corr_eff_mult = datap["analysis"][self.typean]["corrEffMult"]
6767

68-
self.lpt_finbinmin = datap["analysis"][self.typean]["sel_an_binmin"]
69-
self.lpt_finbinmax = datap["analysis"][self.typean]["sel_an_binmax"]
70-
self.p_nptfinbins = len(self.lpt_finbinmin)
7168
self.bin_matching = datap["analysis"][self.typean]["binning_matching"]
7269
#self.sel_final_fineptbins = datap["analysis"][self.typean]["sel_final_fineptbins"]
7370
self.s_evtsel = datap["analysis"][self.typean]["evtsel"]
@@ -177,6 +174,7 @@ def process_histomass_single(self, index):
177174
hvtxoutmult.Write()
178175

179176
list_df_recodtrig = []
177+
180178
for ipt in range(self.p_nptfinbins):
181179
bin_id = self.bin_matching[ipt]
182180
df = pickle.load(openfile(self.mptfiles_recoskmldec[bin_id][index], "rb"))
@@ -192,6 +190,10 @@ def process_histomass_single(self, index):
192190
list_df_recodtrig.append(df)
193191
df = seldf_singlevar(df, self.v_var_binning, \
194192
self.lpt_finbinmin[ipt], self.lpt_finbinmax[ipt])
193+
194+
if self.do_custom_analysis_cuts:
195+
df = self.apply_cuts_ptbin(df, ipt)
196+
195197
for ibin2 in range(len(self.lvar2_binmin)):
196198
suffix = "%s%d_%d_%.2f%s_%.2f_%.2f" % \
197199
(self.v_var_binning, self.lpt_finbinmin[ipt],
@@ -251,23 +253,6 @@ def process_histomass_single(self, index):
251253
df_recodtrig[df_recodtrig[self.v_ismcsignal] == 1], "MC"
252254
).write()
253255

254-
def process_histomass(self):
255-
print("Doing masshisto", self.mcordata, self.period)
256-
print("Using run selection for mass histo", \
257-
self.runlistrigger, "for period", self.period)
258-
if self.doml is True:
259-
print("Doing ml analysis")
260-
else:
261-
print("No extra selection needed since we are doing std analysis")
262-
263-
create_folder_struc(self.d_results, self.l_path)
264-
arguments = [(i,) for i in range(len(self.l_root))]
265-
self.parallelizer(self.process_histomass_single, arguments, self.p_chunksizeunp)
266-
tmp_merged = \
267-
f"/data/tmp/hadd/{self.case}_{self.typean}/mass_{self.period}/{get_timestamp_string()}/"
268-
mergerootfiles(self.l_histomass, self.n_filemass, tmp_merged)
269-
270-
271256
def get_reweighted_count(self, dfsel):
272257
filename = os.path.join(self.d_mcreweights, self.n_mcreweights)
273258
weight_file = TFile.Open(filename, "read")

0 commit comments

Comments
 (0)