Skip to content

Conversation

@mwestphall
Copy link
Contributor

The probe picked up ~30 records from a (misconfigured)? machine at ULAR with no SiteName between June 11th and 14th, which broke later processing. Removing these records allowed processing to resume. This PR adds an additional check to make sure that either site or resource is present in ingested records.

@osg-cat
Copy link

osg-cat commented Jul 15, 2025

With this change, how will anyone know if records are being dropped if there’s no site or resource?

@mwestphall
Copy link
Contributor Author

mwestphall commented Jul 15, 2025

@osg-cat Another approach we could take here is to replace empty site/resource with "Unknown" or some similar indicator, would that be preferable? Records would still not get accounted to the correct site but they would at least make it out to GRACC (what GRACC does with them after that point is another question)

@osg-cat
Copy link

osg-cat commented Jul 15, 2025

I don’t know. It’s probably a Derek question. I guess my main point is that this could use some design thinking.


filter_cond = 'SlotType != "Static"'
# Need at least one defined from Site and ResourceName for proper accounting
filter_cond = 'SlotType != "Static" && (GLIDEIN_ResourceName =!= UNDEFINED || GLIDEIN_Site =!= UNDEFINED) '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your code change, but do you know why we're excluding static slots?

@brianhlin
Copy link
Member

With this change, how will anyone know if records are being dropped if there’s no site or resource?

I vote that we configure the OSPool CMs to reject any EPs that are missing GLIDEIN_Site and GLIDEIN_ResourceName. If we're getting this kind of capacity, they're clearly broken, they'll pollute our accounting + downstream reporting, and will require hacky GRACC / reporting fixes every time we get a batch of bad records.

@rynge @djw8605 thoughts?

@djw8605
Copy link
Member

djw8605 commented Jul 16, 2025

Rejecting the EPs work for me. Though, I guess it's the same question from Tim... will we know they are rejected somewhere?

@brianhlin
Copy link
Member

Rejecting the EPs work for me. Though, I guess it's the same question from Tim... will we know they are rejected somewhere?

I think we can set up the container images so that they bail and exit non-zero if they fail to advertise to a CM. For factory-submitted glideins, I imagine they will show up in the monitoring somehow so that the operators can fix their reconfig / go down the troubleshooting path.

To me, that's all strictly better than finding out when someone happens to look at a report where the damage is already done.

@osg-cat
Copy link

osg-cat commented Jul 17, 2025

GitHub comments is not the place to make a real design decision for the OSPool. Let’s pause any policy changes here and get a real (and ideally brief!) design doc going.

@brianhlin brianhlin merged commit 81aa005 into opensciencegrid:2.x Oct 14, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants