Skip to content

HPCC-34746 Allow cost definitions for agent, manager and workers in Thor's yaml#20683

Open
shamser wants to merge 6 commits intohpcc-systems:candidate-10.0.xfrom
shamser:issue34746
Open

HPCC-34746 Allow cost definitions for agent, manager and workers in Thor's yaml#20683
shamser wants to merge 6 commits intohpcc-systems:candidate-10.0.xfrom
shamser:issue34746

Conversation

@shamser
Copy link
Contributor

@shamser shamser commented Nov 26, 2025

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copilot AI review requested due to automatic review settings November 26, 2025 12:19
@github-actions
Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34746

Jirabot Action Result:
Workflow Transition To: Merge Pending
Additional PR: #20683

@shamser
Copy link
Contributor Author

shamser commented Nov 26, 2025

Documentation jira: https://hpccsystems.atlassian.net/browse/HPCC-35425

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements granular cost configuration for Thor subcomponents (manager, worker, and eclagent), allowing each component to have its own cost definition with automatic inheritance from global cost settings when not explicitly specified.

Key Changes:

  • Added per-subcomponent cost configuration support in Thor's helm templates with inheritance fallback to global.cost.perCpu
  • Updated C++ cost calculation functions to query subcomponent-specific cost values
  • Enhanced getResourcedCpus() to handle both containerized and bare-metal deployments consistently

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
helm/hpcc/values.yaml Added example configuration comments for manager, worker, and eclagent cost definitions
helm/hpcc/values.schema.json Defined thorSubComponent schema with cost properties and marked top-level thor.cost as deprecated
helm/hpcc/templates/thor.yaml Implemented setThorComponentCost helper and cost inheritance logic for Thor subcomponents
helm/hpcc/templates/_helpers.tpl Updated generateGlobalConfigMap to accept optional excludePerCpuCost parameter for selective cost exclusion
helm/hpcc/templates/_warnings.tpl Added validation warning for inconsistent Thor subcomponent cost configurations
helm/hpcc/templates/{roxie,esp,dali,etc}.yaml Updated all template calls to generateGlobalConfigMap with new dict-based signature
common/workunit/workunit.cpp Enhanced cost calculation functions to support subcomponent-specific cost lookups
system/jlib/jutil.cpp Updated getResourcedCpus() to handle bare-metal deployments using affinity CPUs
testing/helm/tests/thor-*.yaml Added test cases for explicit costs, inheritance, and conflict scenarios

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser some questions.
Also, the jira does not have a conclusion - a summary of the changes and how to configure them - or a link to the documentation jira.

VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName);
double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0);
if (costCpuHour < 0.0)
return 0.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar question to copilot. This now means that the manager will default to 0 cost, when previously it would have been the default cost - that looks like a backward compatibility problem. Should it only use that cost if that property exists, otherwise use the default?

Is the section now always filled in?

What is the situation for bare-metal?

Copy link
Contributor Author

@shamser shamser Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backward compatibility is handled by the Thor templates as suggested here: #20263 (comment) .

Again, the thor templates will ensure that the cost section is always created.

In bare-metal, the global perCPU rate will continue to be used which I believe is reasonable and make sense for bare-metal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.

In a bare‑metal environment, there is no resource section that specifies the number of CPUs. Additionally, different components can be deployed to different node pools, meaning they may run on different VM types with varying CPU costs. On bare metal, CPU costs are less likely to vary, so using a single per‑CPU cost is generally sufficient. @ghalliday

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading the code correctly, if getThorManagerRate() is called in bare-metal it will always return 0, because getCostCpuHour() will return 0.
I don't see how this returns non-zero thor costs for bare-metakl (but I haven't tested to verify).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue. I'll re-introduce the drop-through to use the legacy cost values. (It was removed following a code review comment suggesting it wasn't required for containerized. I'll add back legacy support as it's the cleanest way to address this.)

}

static double getCostCpuHour()
static double getCostCpuHour(const char * subComponentName = nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove the default parameter, so you explicitly pass null for the default.

}
else
{
return (double) getAffinityCpus();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: better to allow resource configuration in containerized, and only return this if no resourceTree. i.e. aim to keep containerized and bare-metal as similar as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a containerized environment, the Thor manager calculates the job cost. If we use getAffinityCpus(), the Thor worker, agent, and manager costs will all be derived from the number of CPUs assigned to the manager container. This results in Thor job costs that are almost always incorrect, and it can be very difficult to diagnose why the values are so inaccurate.
Given that, it seems more reasonable for Thor costs to default to zero in a misconfigured containerized setup, rather than producing cost values that are consistently and obviously wrong.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - will pass back to you for now until GH's comments are looked at.

@shamser
Copy link
Contributor Author

shamser commented Feb 12, 2026

ummary of the changes and how to configure them - or a link to the documentation jira.

I have add a description of the changes. The documentation jira is in the parent jira: https://hpccsystems.atlassian.net/browse/HPCC-35425.

@shamser shamser requested a review from ghalliday February 12, 2026 15:13
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser one question/clarification

VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName);
double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0);
if (costCpuHour < 0.0)
return 0.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.

@shamser
Copy link
Contributor Author

shamser commented Feb 16, 2026

@ghalliday In a bare‑metal environment, there is no resource section that specifies the number of CPUs. Additionally, different components can be deployed to different node pools, meaning they may run on different VM types with varying CPU costs. On bare metal, CPU costs are less likely to vary, so using a single per‑CPU cost is generally sufficient.

@shamser shamser requested a review from ghalliday February 16, 2026 16:30
…hor's yaml

- Added cost definitions for agent, manager and workers in Thor's yaml
- The new component level cost definition overrides existing thor cost definition
- Warn if the thor level cost definition is not consistent with the component level cost definition
- Warn if thor level cost definition is being used
- Updated the values.yaml with the new cost definitions (commented out)

REDBOOK NOTE:
- Breaking change: thor.cost has been depreciated and is no longer supported. Users must migrate to subcomponent-level cost definitions.

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
@shamser shamser changed the base branch from candidate-9.14.x to candidate-10.0.x February 20, 2026 10:08
@shamser shamser requested a review from Copilot February 20, 2026 10:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Comment on lines +46 to +49
{{- $eclAgentCostConfig := dict -}}
{{- $_ := include "setThorComponentCost" (dict "root" .root "me" .me "subComponent" "eclagent" "cost" $eclAgentCostConfig) -}}
{{- $eclAgentSpecifications := pick .me "eclagent" | default (dict "eclagent" (dict "cost" $eclAgentCostConfig.cost)) -}}
{{- $eclAgentScope := dict "name" .eclAgentName "type" $eclAgentType "useChildProcesses" .eclAgentUseChildProcesses "replicas" .eclAgentReplicas "maxActive" .me.maxJobs | merge (pick .me "auxQueues" "expert" "keepJobs" "logging" "tracing" "schedulingTimeoutSecs") ((hasKey .me "holdAgent") | ternary (dict "hold" .me.holdAgent) dict) $eclAgentSpecifications.eclagent }}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inheritance helper populates $eclAgentCostConfig.cost, but if .me.eclagent exists (even without a cost key), pick .me \"eclagent\" is non-empty and the default(...) won’t run—so the computed/inherited cost never gets merged in. Similar logic applies to manager/worker. Consider always merging the computed cost into the picked subcomponent map (e.g., deepCopy the picked map, set/merge cost when missing, then wrap back into {eclagent: ...}) so inheritance works whether the subcomponent section exists or not.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +14221 to +14230
static double getCostCpuHour(const char * subComponentName)
{
if (!isEmptyString(subComponentName))
{
VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName);
double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0);
if (costCpuHour < 0.0)
return 0.0;
return costCpuHour;
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When subComponentName is provided but no subcomponent perCpu is configured, this returns 0.0 rather than falling back to the legacy/default sources (top-level cost/@perCpu or global cost). That’s a behavior change that will silently zero out Thor rates when users rely on deprecated thor.cost or only configure global cost. A safer approach is: if the subcomponent value is missing/negative, fall back to the same logic used for the nullptr case.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,27 @@
# Test inheritance from global.cost to subcomponents without explicit override
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'inconsistant' to 'inconsistent' (also consider renaming the file to match).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +1005 to +1014
"thorSubComponent": {
"description": "Thor sub-component with resource and cost configuration",
"type": "object",
"properties": {
"cost": {
"$ref": "#/definitions/componentCost"
}
},
"additionalProperties": false
},
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema description says the sub-component supports 'resource and cost configuration', but the schema only allows cost (and disallows all other keys via additionalProperties: false). Either update the description to match (cost-only), or add the intended resource-related properties if they’re meant to be supported.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema will be modified in subsequent PR to include resources. When that is done, the new "resource" property will be included in thorSubComponent's definition

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
@shamser shamser requested a review from jakesmith February 20, 2026 10:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… not available

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - please see comments/questions

cost:
{{ toYaml .Values.global.cost | indent 2 }}
{{- if $excludePerCpuCost }}
{{ toYaml (omit $root.Values.global.cost "perCpu") | indent 2 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we try to indent the template scopes for readibiltiy and chomp leading whitespace.

global:
{{ include "hpcc.generateGlobalConfigMap" .root| indent 6 }}
{{/* Thor uses sub-component costs (manager, worker, eclagent), so exclude global perCpu */}}
{{ include "hpcc.generateGlobalConfigMap" (dict "root" .root "excludePerCpuCost" true) | indent 6 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be excluded everywhere that defines it's own cost.perCpu too?
and/or what's the difference (I may be missing something), if Thor per sub component costs are missing, then it would (without excludePerCpuCost=true) drop through to global?
but how is that different really from e.g. eclagent having no cost.perCpu and using global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR relates to Thor. I don't want to expand the scope of this PR to other components cost configuration.

@ghalliday ghalliday closed this Feb 26, 2026
@ghalliday ghalliday reopened this Feb 26, 2026
@github-actions
Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34746

Jirabot Action Result:
Workflow Transition To: Merge Pending

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants