Sonic dynamic model loading #25

trevin-lee · 2025-10-13T19:07:18Z

PR description:

This PR introduces dynamic model loading/unloading capabilities and server health monitoring to the SONIC Triton integration in CMSSW. The main features include:

1. Dynamic Model Loading and Unloading:

Adds loadModel() and unloadModel() methods to TritonService for managing model lifecycle at runtime
Implements thread-safe model operations using mutex protection
Introduces DynamicModelLoadingProducer test module to validate dynamic model management
Models can be loaded on-demand and unloaded when no longer needed, improving resource utilization

4. Code Improvements:

Moves retry configuration options to customize.py for better configurability
Updates TritonClient with new constructor for testing and enhanced server connection methods
Improves logging for model operations and server health status
Refactors code for better maintainability and documentation

Expected Output Changes:

Users can now dynamically load and unload models during job execution
Improved resilience through automatic server health monitoring and failover
Better error handling and retry logic for transient server failures
Enhanced logging messages for model operations and server health

Dependencies:

Based on CMSSW_15_1_0_pre6
No external PR dependencies

Files Modified:

HeterogeneousCore/SonicCore/src/SonicClientBase.cc
HeterogeneousCore/SonicCore/plugins/BuildFile.xml
HeterogeneousCore/SonicTriton/interface/TritonService.h
HeterogeneousCore/SonicTriton/interface/TritonClient.h
HeterogeneousCore/SonicTriton/src/TritonService.cc
HeterogeneousCore/SonicTriton/src/TritonClient.cc
HeterogeneousCore/SonicTriton/src/RetryActionDiffServer.cc
HeterogeneousCore/SonicTriton/test/BuildFile.xml
HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py

New Files:

HeterogeneousCore/SonicTriton/test/DynamicModelLoadingProducer.cc
HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc

Removed Files:

HeterogeneousCore/SonicTriton/test/RetryActionDiffServer.cc (replaced with unit test)

PR validation:

Integration Tests:

Compiled successfully with scram b -j 8
No compilation warnings or errors
All modified code follows CMS coding standards

Known Issues:

Dynamic model loading tests require polling to be disabled in configuration
Some unit tests need mock server environment adjustments
Will be addressed in follow-up commits or configuration updates

Documentation:

Code includes inline documentation for new methods
Test configurations demonstrate usage patterns
README updates may be needed (can be done in follow-up)

Backport Information:

This PR is NOT a backport. It is intended for CMSSW_15_1_X release cycle.

If backporting becomes necessary, it would target future release cycles after initial integration and validation in CMSSW_15_1_X.

…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.

…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.

…tests; remove old cfg

…lection; remove unused parameters and improve documentation.

- Introduced `loadModel` and `unloadModel` methods for managing model lifecycle. - Added mutex for thread safety during model operations. - Updated `TritonService` header and implementation to support dynamic model management. - Enhanced logging for model loading and unloading processes. - Updated test configurations to include dynamic model loading tests.

…requirements - Modified input handling to utilize actual model input for "x" instead of dummy data. - Adjusted shape and data allocation for input to meet base class expectations. - Updated parameter set description method to use TritonClient for configuration.

HeterogeneousCore/SonicTriton/src/TritonService.cc

kpedro88 · 2025-10-17T10:11:20Z

I have not looked at the test code yet because I think the logic in the TritonService needs to be addressed first.

Another general point: part of the idea for dynamic loading with the fallback server would be to get rid of the unservedModels_ list that is currently formed at the start of the job. The model repository folder for the fallback server should be created with all models known to the job included, but the fallback server should be launched in explicit model control mode (modifying the cmsTriton script). Then, whenever a module switches over to the fallback server, it should ask the fallback server to load its model. (Dynamic loading with remote servers may reuse some of the logic, but will be somewhat different and needs more thought.)

- Applied scram b code-format to fix formatting issues - Added TRITON_THROW_IF_ERROR for error handling in loadModel/unloadModel - Fixed unload semantics to not erase tracking data on failure - Added comments explaining dynamic loading test requirements - Removed spurious formatting changes

- Removed loadedModels_ set, now derive loaded status from modelRefCount_ > 0 - Simplified loadModel() and unloadModel() to only work with fallback server - Removed server searching logic since only fallback is supported - Added documentation clarifying fallback-only limitation - Updated error messages to reflect this architectural decision - Fixed XML comment syntax in BuildFile.xml

…ading - Updated retry logic to switch to the fallback server directly for model loading. - Removed previous best server selection logic, simplifying the retry process. - Added logging for fallback loading failures and re-evaluation after loading. - Ensured dynamic loading is explicitly handled in TritonService with updated command options.

…in TritonService - Introduced FallbackModelState to track dynamic model state, including reference count and model path. - Updated loadModel and unloadModel methods to utilize FallbackModelState for managing model lifecycle. - Enhanced logging to reflect model state changes and operations. - Adjusted test configuration to reflect changes in module handling for dynamic model loading.

… handling - Eliminated unservedModels_ from TritonService, simplifying model management. - Updated addModel and preBeginJob methods to work directly with models_. - Enhanced error handling in loadModel for fallback server scenarios. - Improved comments for clarity on model path handling and fallback server usage.

kpedro88 · 2025-11-04T22:25:54Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

-  }
+  // Ensure model exists in declared models; preserve non-empty path if provided
+  auto& modelInfo(models_.emplace(modelName, path).first->second);
+  if (modelInfo.path.empty() && !path.empty()) modelInfo.path = path;


why is this line necessary? path is used to construct the Model object on the previous line.

kpedro88 · 2025-11-04T22:35:51Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

    fallbackOpts_.command += " -w " + std::to_string(fallbackOpts_.wait);
-  for (const auto& [modelName, model] : unservedModels_) {
+  // Explicit model control mode is required for dynamic loading
+  fallbackOpts_.command += " --model-control-mode explicit";


this does not go here. it has to be inserted in the correct places in scripts/cmsTriton. probably a short option should be provided to toggle the behavior (e.g., if we make this the default, then -L could disable it).

Here is a patch showing how I had briefly tested this in cmsTriton recently (without adding a toggle):

diff --git a/HeterogeneousCore/SonicTriton/scripts/cmsTriton b/HeterogeneousCore/SonicTriton/scripts/cmsTriton index ecf84107bac7..a775ed872697 100755 --- a/HeterogeneousCore/SonicTriton/scripts/cmsTriton +++ b/HeterogeneousCore/SonicTriton/scripts/cmsTriton @@ -245,7 +245,7 @@ handle_ports(){ start_docker(){ # mount all model repositories MOUNTARGS="" - REPOARGS="" + REPOARGS="--model-control-mode=explicit" for REPO in ${REPOS[@]}; do MOUNTARGS="$MOUNTARGS -v$REPO:$REPO" REPOARGS="$REPOARGS --model-repository=${REPO}" @@ -265,7 +265,7 @@ start_docker(){ start_podman(){ # mount all model repositories MOUNTARGS="" - REPOARGS="" + REPOARGS="--model-control-mode=explicit" for REPO in ${REPOS[@]}; do MOUNTARGS="$MOUNTARGS --volume $REPO:$REPO" REPOARGS="$REPOARGS --model-repository=${REPO}" @@ -285,7 +285,7 @@ start_podman(){ start_apptainer(){ # mount all model repositories MOUNTARGS="" - REPOARGS="" + REPOARGS="--model-control-mode=explicit" for REPO in ${REPOS[@]}; do MOUNTARGS="$MOUNTARGS -B $REPO" REPOARGS="$REPOARGS --model-repository=${REPO}"

kpedro88 · 2025-11-04T22:38:46Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

  descriptions.addWithDefaultLabel(desc);
 }
+
+bool TritonService::loadModel(const std::string& modelName, const std::string& path) {


why does this function allow a different path? the model path is part of the Model object, which is already stored.

kpedro88 · 2025-11-04T22:52:40Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

+  // if already loaded, bump refcount
+  if (state.refCount > 0) {
+    ++state.refCount;
+    modelRefCount_[state.modelName] = state.refCount;


it seems like there is unnecessary duplication between Model and FallbackModelState objects, and additionally modelRefCount_ and fallbackModels_. Having to maintain two copies of the refCount info is not preferable.

Instead, I would incorporate the refCount member variable into Model itself, and then only deal directly with Model objects from the models_ map. (Possibly keeping a list of model names currently used by the fallback server, and then getting the objects out of the map by name.)

kpedro88 · 2025-11-04T22:55:04Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

+  if (state.modelName.empty()) state.modelName = modelName;
+  if (state.path.empty() && !path.empty()) state.path = path;
+
+  return loadModel(state);


The lock still applies to everything that happens in the delegated internal function, so this change does not actually address the underlying point. Again, this is fine for a first implementation, but it will need further examination later.

kpedro88 · 2025-11-04T23:03:06Z

HeterogeneousCore/SonicTriton/src/TritonService.cc

+                        false);
+
+  auto err = client->UnloadModel(state.modelName);
+  TRITON_THROW_IF_ERROR(err, "unloadModel: failed to unload model " + state.modelName + " from fallback server",


wrap call (as above)

kpedro88 · 2025-11-04T23:05:36Z

HeterogeneousCore/SonicTriton/src/RetryActionDiffServer.cc

+    edm::Service<TritonService> ts;
+
+    // Fallback-only: update client to fallback server and dynamically load the model there
+    const std::string& fallbackName = TritonService::Server::fallbackName;


there is no reason for this variable to exist. just pass TritonService::Server::fallbackName directly to the function below.

kpedro88 · 2025-11-04T23:08:56Z

HeterogeneousCore/SonicTriton/src/RetryActionDiffServer.cc

+
+    // Fallback-only: update client to fallback server and dynamically load the model there
+    const std::string& fallbackName = TritonService::Server::fallbackName;
+    tritonClient->updateServer(fallbackName);


the recent changes in this file seem to have completely eliminated the use of getBestServer() to find another remote server. that is not correct. please restore the original functionality.

kpedro88 · 2025-11-04T23:10:46Z

HeterogeneousCore/SonicTriton/test/DynamicModelLoadingProducer.cc

+        testModelName_(cfg.getParameter<std::string>("testModelName")),
+        testModelPath_(cfg.getParameter<std::string>("testModelPath")),


why are separate model name and path parameters necessary for this test producer? Triton producers already have these parameters through the client. Just use the existing parameters. (this should also eliminate the need for the extra "path" argument in loadModel())

kpedro88 · 2025-11-04T23:13:05Z

HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py


 parser = getParser()
-parser.add_argument("--modules", metavar=("MODULES"), default=["TritonGraphProducer"], nargs='+', type=str, choices=list(models), help="list of modules to run (choices: %(choices)s)")
+parser.add_argument("--modules", metavar=("MODULES"), default=["DynamicModelLoadingProducer"], nargs='+', type=str, choices=list(models), help="list of modules to run (choices: %(choices)s)")


the default should not be changed

Martin and others added 21 commits October 12, 2025 22:17

RetryAction compiles

2e55676

Include RetryAction in SonicClientBase

7831591

Update PR comments

d7d40b0

PR comments, fix fillDescriptions

5cb6e8e

rebase 15_1_0_pre6

a10ee94

Add update server function for client

8843bbc

Add test for Triton retry action in BuildFile.xml

2f06ebe

Implement retry logic in RetryActionDiffServer and add connectToServe…

6d61227

…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.

Add RetryActionDiffServer class documentation, implement testing cons…

c5d4a93

…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.

SonicTriton: implement retry action against different server; update …

91f5cb3

…tests; remove old cfg

Refactor RetryActionDiffServer to utilize TritonService for server se…

00cfd4d

…lection; remove unused parameters and improve documentation.

rebase 15_1_0_pre6

aec18e6

Fixes to compile

d07ba13

PR comments

ccc6607

Move retry options to customize.py

5fee179

more clean ups

2670dee

First draft for server health

143be1e

remove redundant

9d264df

Use getBestServer in RetryDiffServerAction

965cae1

kpedro88 reviewed Oct 17, 2025

View reviewed changes

trevin-lee added 5 commits November 2, 2025 13:42

kpedro88 reviewed Nov 4, 2025

View reviewed changes

		testModelName_(cfg.getParameter<std::string>("testModelName")),
		testModelPath_(cfg.getParameter<std::string>("testModelPath")),

Sonic dynamic model loading #25

Are you sure you want to change the base?

Sonic dynamic model loading #25

Uh oh!

Conversation

trevin-lee commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR description:

PR validation:

Backport Information:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpedro88 commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trevin-lee commented Oct 13, 2025 •

edited

Loading