HPCC-34746 Allow cost definitions for agent, manager and workers in Thor's yaml#20683
HPCC-34746 Allow cost definitions for agent, manager and workers in Thor's yaml#20683shamser wants to merge 6 commits intohpcc-systems:candidate-10.0.xfrom
Conversation
|
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34746 Jirabot Action Result: |
|
Documentation jira: https://hpccsystems.atlassian.net/browse/HPCC-35425 |
There was a problem hiding this comment.
Pull request overview
This pull request implements granular cost configuration for Thor subcomponents (manager, worker, and eclagent), allowing each component to have its own cost definition with automatic inheritance from global cost settings when not explicitly specified.
Key Changes:
- Added per-subcomponent cost configuration support in Thor's helm templates with inheritance fallback to
global.cost.perCpu - Updated C++ cost calculation functions to query subcomponent-specific cost values
- Enhanced
getResourcedCpus()to handle both containerized and bare-metal deployments consistently
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
helm/hpcc/values.yaml |
Added example configuration comments for manager, worker, and eclagent cost definitions |
helm/hpcc/values.schema.json |
Defined thorSubComponent schema with cost properties and marked top-level thor.cost as deprecated |
helm/hpcc/templates/thor.yaml |
Implemented setThorComponentCost helper and cost inheritance logic for Thor subcomponents |
helm/hpcc/templates/_helpers.tpl |
Updated generateGlobalConfigMap to accept optional excludePerCpuCost parameter for selective cost exclusion |
helm/hpcc/templates/_warnings.tpl |
Added validation warning for inconsistent Thor subcomponent cost configurations |
helm/hpcc/templates/{roxie,esp,dali,etc}.yaml |
Updated all template calls to generateGlobalConfigMap with new dict-based signature |
common/workunit/workunit.cpp |
Enhanced cost calculation functions to support subcomponent-specific cost lookups |
system/jlib/jutil.cpp |
Updated getResourcedCpus() to handle bare-metal deployments using affinity CPUs |
testing/helm/tests/thor-*.yaml |
Added test cases for explicit costs, inheritance, and conflict scenarios |
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
c742cdc to
67540ec
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
common/workunit/workunit.cpp
Outdated
| VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName); | ||
| double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0); | ||
| if (costCpuHour < 0.0) | ||
| return 0.0; |
There was a problem hiding this comment.
I have a similar question to copilot. This now means that the manager will default to 0 cost, when previously it would have been the default cost - that looks like a backward compatibility problem. Should it only use that cost if that property exists, otherwise use the default?
Is the section now always filled in?
What is the situation for bare-metal?
There was a problem hiding this comment.
The backward compatibility is handled by the Thor templates as suggested here: #20263 (comment) .
Again, the thor templates will ensure that the cost section is always created.
In bare-metal, the global perCPU rate will continue to be used which I believe is reasonable and make sense for bare-metal.
There was a problem hiding this comment.
@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.
There was a problem hiding this comment.
@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.
In a bare‑metal environment, there is no resource section that specifies the number of CPUs. Additionally, different components can be deployed to different node pools, meaning they may run on different VM types with varying CPU costs. On bare metal, CPU costs are less likely to vary, so using a single per‑CPU cost is generally sufficient. @ghalliday
There was a problem hiding this comment.
If I am reading the code correctly, if getThorManagerRate() is called in bare-metal it will always return 0, because getCostCpuHour() will return 0.
I don't see how this returns non-zero thor costs for bare-metakl (but I haven't tested to verify).
There was a problem hiding this comment.
I see the issue. I'll re-introduce the drop-through to use the legacy cost values. (It was removed following a code review comment suggesting it wasn't required for containerized. I'll add back legacy support as it's the cleanest way to address this.)
common/workunit/workunit.cpp
Outdated
| } | ||
|
|
||
| static double getCostCpuHour() | ||
| static double getCostCpuHour(const char * subComponentName = nullptr) |
There was a problem hiding this comment.
You could remove the default parameter, so you explicitly pass null for the default.
system/jlib/jutil.cpp
Outdated
| } | ||
| else | ||
| { | ||
| return (double) getAffinityCpus(); |
There was a problem hiding this comment.
minor: better to allow resource configuration in containerized, and only return this if no resourceTree. i.e. aim to keep containerized and bare-metal as similar as possible.
There was a problem hiding this comment.
In a containerized environment, the Thor manager calculates the job cost. If we use getAffinityCpus(), the Thor worker, agent, and manager costs will all be derived from the number of CPUs assigned to the manager container. This results in Thor job costs that are almost always incorrect, and it can be very difficult to diagnose why the values are so inaccurate.
Given that, it seems more reasonable for Thor costs to default to zero in a misconfigured containerized setup, rather than producing cost values that are consistently and obviously wrong.
I have add a description of the changes. The documentation jira is in the parent jira: https://hpccsystems.atlassian.net/browse/HPCC-35425. |
common/workunit/workunit.cpp
Outdated
| VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName); | ||
| double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0); | ||
| if (costCpuHour < 0.0) | ||
| return 0.0; |
There was a problem hiding this comment.
@shamser can you clarify this comment. "In bare-metal...", the getThorManagerRate() and getThorWorkerRate() don't seem to have any containerised specific logic.
|
@ghalliday In a bare‑metal environment, there is no resource section that specifies the number of CPUs. Additionally, different components can be deployed to different node pools, meaning they may run on different VM types with varying CPU costs. On bare metal, CPU costs are less likely to vary, so using a single per‑CPU cost is generally sufficient. |
…hor's yaml - Added cost definitions for agent, manager and workers in Thor's yaml - The new component level cost definition overrides existing thor cost definition - Warn if the thor level cost definition is not consistent with the component level cost definition - Warn if thor level cost definition is being used - Updated the values.yaml with the new cost definitions (commented out) REDBOOK NOTE: - Breaking change: thor.cost has been depreciated and is no longer supported. Users must migrate to subcomponent-level cost definitions. Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
| {{- $eclAgentCostConfig := dict -}} | ||
| {{- $_ := include "setThorComponentCost" (dict "root" .root "me" .me "subComponent" "eclagent" "cost" $eclAgentCostConfig) -}} | ||
| {{- $eclAgentSpecifications := pick .me "eclagent" | default (dict "eclagent" (dict "cost" $eclAgentCostConfig.cost)) -}} | ||
| {{- $eclAgentScope := dict "name" .eclAgentName "type" $eclAgentType "useChildProcesses" .eclAgentUseChildProcesses "replicas" .eclAgentReplicas "maxActive" .me.maxJobs | merge (pick .me "auxQueues" "expert" "keepJobs" "logging" "tracing" "schedulingTimeoutSecs") ((hasKey .me "holdAgent") | ternary (dict "hold" .me.holdAgent) dict) $eclAgentSpecifications.eclagent }} |
There was a problem hiding this comment.
The inheritance helper populates $eclAgentCostConfig.cost, but if .me.eclagent exists (even without a cost key), pick .me \"eclagent\" is non-empty and the default(...) won’t run—so the computed/inherited cost never gets merged in. Similar logic applies to manager/worker. Consider always merging the computed cost into the picked subcomponent map (e.g., deepCopy the picked map, set/merge cost when missing, then wrap back into {eclagent: ...}) so inheritance works whether the subcomponent section exists or not.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| static double getCostCpuHour(const char * subComponentName) | ||
| { | ||
| if (!isEmptyString(subComponentName)) | ||
| { | ||
| VStringBuffer subComponentSection("%s/cost/@perCpu", subComponentName); | ||
| double costCpuHour = getComponentConfigSP()->getPropReal(subComponentSection.str(), -1.0); | ||
| if (costCpuHour < 0.0) | ||
| return 0.0; | ||
| return costCpuHour; | ||
| } |
There was a problem hiding this comment.
When subComponentName is provided but no subcomponent perCpu is configured, this returns 0.0 rather than falling back to the legacy/default sources (top-level cost/@perCpu or global cost). That’s a behavior change that will silently zero out Thor rates when users rely on deprecated thor.cost or only configure global cost. A safer approach is: if the subcomponent value is missing/negative, fall back to the same logic used for the nullptr case.
| @@ -0,0 +1,27 @@ | |||
| # Test inheritance from global.cost to subcomponents without explicit override | |||
There was a problem hiding this comment.
Corrected spelling of 'inconsistant' to 'inconsistent' (also consider renaming the file to match).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| "thorSubComponent": { | ||
| "description": "Thor sub-component with resource and cost configuration", | ||
| "type": "object", | ||
| "properties": { | ||
| "cost": { | ||
| "$ref": "#/definitions/componentCost" | ||
| } | ||
| }, | ||
| "additionalProperties": false | ||
| }, |
There was a problem hiding this comment.
The schema description says the sub-component supports 'resource and cost configuration', but the schema only allows cost (and disallows all other keys via additionalProperties: false). Either update the description to match (cost-only), or add the intended resource-related properties if they’re meant to be supported.
There was a problem hiding this comment.
The schema will be modified in subsequent PR to include resources. When that is done, the new "resource" property will be included in thorSubComponent's definition
Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… not available Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
helm/hpcc/templates/_helpers.tpl
Outdated
| cost: | ||
| {{ toYaml .Values.global.cost | indent 2 }} | ||
| {{- if $excludePerCpuCost }} | ||
| {{ toYaml (omit $root.Values.global.cost "perCpu") | indent 2 }} |
There was a problem hiding this comment.
minor: we try to indent the template scopes for readibiltiy and chomp leading whitespace.
| global: | ||
| {{ include "hpcc.generateGlobalConfigMap" .root| indent 6 }} | ||
| {{/* Thor uses sub-component costs (manager, worker, eclagent), so exclude global perCpu */}} | ||
| {{ include "hpcc.generateGlobalConfigMap" (dict "root" .root "excludePerCpuCost" true) | indent 6 }} |
There was a problem hiding this comment.
Shouldn't be excluded everywhere that defines it's own cost.perCpu too?
and/or what's the difference (I may be missing something), if Thor per sub component costs are missing, then it would (without excludePerCpuCost=true) drop through to global?
but how is that different really from e.g. eclagent having no cost.perCpu and using global?
There was a problem hiding this comment.
This PR relates to Thor. I don't want to expand the scope of this PR to other components cost configuration.
|
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34746 Jirabot Action Result: |
Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexisrisk.com>
Type of change:
Checklist:
Smoketest:
Testing: