Skip to content

datasets excluded from NNPDF4.0: CMS_2JET_8TEV_3D#2418

Open
andrpie wants to merge 11 commits intomasterfrom
implement_CMS_2JET_8TEV_3D
Open

datasets excluded from NNPDF4.0: CMS_2JET_8TEV_3D#2418
andrpie wants to merge 11 commits intomasterfrom
implement_CMS_2JET_8TEV_3D

Conversation

@andrpie
Copy link
Contributor

@andrpie andrpie commented Jan 13, 2026

Addresses #2408 #2390

Copy link
Contributor

@achiefa achiefa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below a few comments for this PR.

- observable_name: 3D
process_type: DIJET
tables: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ndata: [31, 26, 14, 23, 17, 11] # listed per (y*, yb) bin, sum 122
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validphys cannot parse a list for the ndata entry. Please, change it to the total number of points

Suggested change
ndata: [31, 26, 14, 23, 17, 11] # listed per (y*, yb) bin, sum 122
ndata: 122


implemented_observables:
- observable_name: 3D
process_type: DIJET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently @enocera has already implemented the process for the 3D dijet distribution in the branch CMS_2JET_13TEV, that has not been merged yet. Try to use the same.

"DIJET_3D": DIJET_3D,

kinematics:
file: kinematics.yaml
variables: # need to fix inline math here because load_yaml breaks down
pTavg: {description: "average transverse momentum of the two leading jets", label: '\(p_{\text{T,avg}}\)', units: "GeV"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use available kinematic variables. If the kinematics that you need is missing, you need to implement it. See here the available kinematics. Again, maybe @enocera has implemented new variables in his branch.

class _Vars:
x = "x"
Q2 = "Q2"
Q = "Q"
y = "y"
abs_y = "abs_y"
pT = "pT"
ET = "ET"
sqrts = "sqrts"
ystar = "ystar"
ydiff = "ydiff"
ymax = "ymax"
yb = "yb"
m_jj = "m_jj"
pT2 = "pT2"
y_t = "y_t"
y_ttBar = "y_ttBar"
m_t2 = "m_t2"
m_t = "m_t"
pT_t = "pT_t"
m_ttBar = "m_ttBar"
eta = "eta"
abs_eta = "abs_eta"
m_W2 = "m_W2"
m_Z2 = "m_Z2"
m_V2 = "m_V2"
m_W = "m_W"
m_Z = "m_Z"
M2 = "M2"
abs_eta_1 = "abs_eta_1"
abs_eta_2 = "abs_eta_2"
eta_1 = "eta_1"
eta_2 = "eta_2"
m_ll = "m_ll"
m_ll2 = "m_ll2"
abs_y = "abs_y"

variables: # need to fix inline math here because load_yaml breaks down
pTavg: {description: "average transverse momentum of the two leading jets", label: '\(p_{\text{T,avg}}\)', units: "GeV"}
ystar: {description: "half rapidity separation of the two leading jets", label: "$y^*$", units: ""}
yboost: {description: "the boost of the two leading jets", label: "$y_\text{b}$", units: ""}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: yboost does not exist in process_options.py.

pTavg: {description: "average transverse momentum of the two leading jets", label: '\(p_{\text{T,avg}}\)', units: "GeV"}
ystar: {description: "half rapidity separation of the two leading jets", label: "$y^*$", units: ""}
yboost: {description: "the boost of the two leading jets", label: "$y_\text{b}$", units: ""}
sqrts: {description: "centre-of-mass energy", label: '\sqrt{s}', units: "TeV"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sqrts from the kinematics. It's a reduntant information that is already incorporated in the name of the dataset. No need to keep it.

Suggested change
sqrts: {description: "centre-of-mass energy", label: '\sqrt{s}', units: "TeV"}

ystar: {description: "half rapidity separation of the two leading jets", label: "$y^*$", units: ""}
yboost: {description: "the boost of the two leading jets", label: "$y_\text{b}$", units: ""}
sqrts: {description: "centre-of-mass energy", label: '\sqrt{s}', units: "TeV"}
kinematic_coverage: [pTavg, ystar, yboost, sqrts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinematic_coverage can only handle three variables. Remove sqrt

Suggested change
kinematic_coverage: [pTavg, ystar, yboost, sqrts]
kinematic_coverage: [pTavg, ystar, yboost]

Comment on lines +1 to +17
bins:
- pTavg:
min: 133.0
mid: 143.0
max: 153.0
yboost:
min: 0.0
mid: 0.5
max: 1.0
ystar:
min: 0.0
mid: 0.5
max: 1.0
sqrts:
min: null
mid: 8
max: null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not prescription for the ordering of the kinematic variables, as long as the ordering matches with kinematic_coverage in metadata.yaml. However, pT is usually in the last position, see here

bins:
- y:
min: 0.0
mid: 0.25
max: 0.5
pT:
min: 21.0
mid: 22.5
max: 24.0
sqrts:
min: null
mid: 8000.0
max: null

If you change this, make sure you also change the ordering in kinematic_coverage.

Suggested change
bins:
- pTavg:
min: 133.0
mid: 143.0
max: 153.0
yboost:
min: 0.0
mid: 0.5
max: 1.0
ystar:
min: 0.0
mid: 0.5
max: 1.0
sqrts:
min: null
mid: 8
max: null
bins:
yboost:
min: 0.0
mid: 0.5
max: 1.0
ystar:
min: 0.0
mid: 0.5
max: 1.0
pTavg:
min: 133.0
mid: 143.0
max: 153.0

@andrpie andrpie force-pushed the implement_CMS_2JET_8TEV_3D branch from 78069ab to e330dd7 Compare February 3, 2026 12:08
@andrpie andrpie force-pushed the implement_CMS_2JET_8TEV_3D branch from af142f5 to c799acb Compare February 3, 2026 18:34
- CMS_2JET_8TEV_3D_3
- CMS_2JET_8TEV_3D_4
- CMS_2JET_8TEV_3D_5
operation: 'null' # from TeV to GeV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the comment because it's not pertinent to the operation key

@achiefa
Copy link
Contributor

achiefa commented Feb 25, 2026

Hi @andrpie , what's the status of this? Cam you post data-theory comparisons here so that wa keep track for future records? Also, mind that the branch has to be rebased. Note that I'm about to merge onto master the PR Emanuele has worked on. This implements 3D kinematics for dijets. So it's likely that there will be conflicts.

@andrpie
Copy link
Contributor Author

andrpie commented Feb 25, 2026

Hi @achiefa,

what's the status of this? Cam you post data-theory comparisons here so that wa keep track for future records?

I believe my treatment of the uncertainties (especially) systematics needs to be reassessed. Also, the covmat is highly singular here. I will post data-theory in the next comment.

Also, mind that the branch has to be rebased. Note that I'm about to merge onto master the PR Emanuele has worked on. This implements 3D kinematics for dijets. So it's likely that there will be conflicts.

Indeed, I will resolve the conflicts.

@andrpie
Copy link
Contributor Author

andrpie commented Feb 25, 2026

Here is the data-theory comparison as it stands now. Quite similar to what we've seen in Amsterdam. The (experimental) chi2 is 3.52.

Screenshot 2026-02-25 at 4 31 45 pm

@achiefa
Copy link
Contributor

achiefa commented Feb 25, 2026

Thanks for this. However, you need to rebase on top of master to get the correct data-theory plots. The reason is that last week Emanuele and I realised that there was a bug in the implementation of the systematic shift, which made these plots wrong. Nonetheless, the chi2 was not affected by this bug. Given the value that you obtained, I'd say that the predictions are fairly good. Of course we need to see the correct comparisons that depict a clear picture.

@achiefa
Copy link
Contributor

achiefa commented Feb 25, 2026

Also, I see that the uncertainties in the data are basically zero. I suspect that for this dataset experimentalists only provided the full covariance matrix with all the uncertainties correlated. If this is the case, the whole uncertainty structure of the data flows into the shift, leaving the data without uncertainty. As a sanity check, could you please plot the same comparison wihtout the shift? You can just set the flag with_shift=false in the plot_fancy function.

@achiefa
Copy link
Contributor

achiefa commented Feb 25, 2026

As a reference, a similar problem has occurred here: https://github.com/NNPDF/theories_slim/pull/67

version: 1

nnpdf_metadata:
nnpdf31_process: "DIJET_3D"
Copy link
Member

@scarlehoff scarlehoff Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theory uncertainties

Suggested change
nnpdf31_process: "DIJET_3D"
nnpdf31_process: "DIJET"


implemented_observables:
- observable_name: 3D
process_type: DIJET
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
process_type: DIJET
process_type: DIJET_3D

@scarlehoff scarlehoff added the Done PRs that are done but waiting on something else to merge/approve label Mar 10, 2026
@scarlehoff
Copy link
Member

Please @andrpie update the labels and upload the right plots (with the fixes to the shift)

@scarlehoff
Copy link
Member

scarlehoff commented Mar 12, 2026

@andrpie to get the plots with the right shifts you need to rebase on top of master (or merge from master, although the rebase is much preferred)

The rebase should be easy as the conflict should just the line in process_option that has been updated also from a different pull request.

@andrpie andrpie force-pushed the implement_CMS_2JET_8TEV_3D branch from fe0d052 to 308718a Compare March 12, 2026 15:16
@andrpie
Copy link
Contributor Author

andrpie commented Mar 12, 2026

Dear @scarlehoff , @achiefa. Below is the data-theory comparison after the rebase.

With shifts: Screenshot 2026-03-12 at 3 48 07 pm
Without shifts: Screenshot 2026-03-12 at 3 52 56 pm

What worries me the most is the treatment of statistical uncertainties. If I understand correctly, validphys recognises uncertainty as statistical if its name starts with "stat". For each datapoint, there are 122 artificial statistical uncertainties that come from correlation matrices. As @achiefa pointed out, the covmat is already quite singular (and this is when those statistical uncertainties are named "art_unc_x"). When I change the name of statistical uncertainties to "stat_art_unc_x", the covmat is not positive definite anymore; below is the error I get when running a data-theory validphys report.

numpy.linalg.LinAlgError: 32-th leading minor of the array is not positive definite

Traceback (most recent call last): File "/opt/miniconda3/envs/nnpdf_dev/bin/validphys", line 6, in <module> sys.exit(main()) ^^^^^^ File "/Users/s2850353/Documents/nnpdf/validphys2/src/validphys/scripts/main.py", line 10, in main vp.main() File "/Users/s2850353/Documents/nnpdf/validphys2/src/validphys/app.py", line 155, in main a.main() File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/reportengine/app.py", line 421, in main self.run() File "/Users/s2850353/Documents/nnpdf/validphys2/src/validphys/app.py", line 150, in run super().run() File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/reportengine/app.py", line 406, in run rb.execute_sequential() File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/reportengine/resourcebuilder.py", line 210, in execute_sequential result = self.get_result(callspec.function, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/reportengine/resourcebuilder.py", line 380, in get_result fres = function(**kwdict) ^^^^^^^^^^^^^^^^^^ File "/Users/s2850353/Documents/nnpdf/validphys2/src/validphys/covmats.py", line 649, in sqrt_covmat decomp = la.cholesky(correlation_matrix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/scipy/_lib/_util.py", line 1181, in wrapper return f(*arrays, *other_args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/scipy/linalg/_decomp_cholesky.py", line 106, in cholesky c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=True, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/nnpdf_dev/lib/python3.12/site-packages/scipy/linalg/_decomp_cholesky.py", line 39, in _cholesky raise LinAlgError( numpy.linalg.LinAlgError: 32-th leading minor of the array is not positive definite

A hint is that the first (y*, yb) bin has 31 points, so the failure happens immediately after the first bin. I am not sure how to fix this.

@achiefa
Copy link
Contributor

achiefa commented Mar 12, 2026

@enocera

@enocera
Copy link
Contributor

enocera commented Mar 12, 2026

Dear @andrpie I'm afraid I don't understand the problem. The shifted data-theory comparison plots look very sensible to me.
As far as I know, validphys doesn't care whether an uncertainty is statistical or systematic. it only cares about th efact it is CORR/UNCORR or ADD/MULT. Insofar as the name is kept consistent nothing should change. In other words, if you rename each "artificial uncertainty" from art_unc_x to stat_art_unc nothing should change. of course oyu have to change all the names consistently. perhaps I misunderstood your problem?

@andrpie
Copy link
Contributor Author

andrpie commented Mar 12, 2026

Hi @enocera, thank you for the clarification. You didn't misunderstand my problem. If the treatment of the uncertainty does not depend on its name, then I am also happy with the data-theory plots and consider the implementation of this dataset done.

Yet it still is the case that if I name the uncertainties 'stat_art_unc', then the Cholesky decomposition breaks down, as opposed to when they are just called 'art_unc' (of course, in a consistent way). I have no idea what kind of evil force can be causing this issue...

@scarlehoff
Copy link
Member

Please have a look whether the problem persists if you change

[i for i in uncertainties_df.columns.get_level_values(0) if not i.startswith("stat")]
and the lines below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data toolchain Done PRs that are done but waiting on something else to merge/approve regenerate-data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants