Skip to content

Enhance service lifecycle management, cohort handling, and status synchronization:#77

Merged
fvilla merged 5 commits into
developfrom
hotfix/service-startup-logic
Jun 11, 2026
Merged

Enhance service lifecycle management, cohort handling, and status synchronization:#77
fvilla merged 5 commits into
developfrom
hotfix/service-startup-logic

Conversation

@fvilla

@fvilla fvilla commented Jun 10, 2026

Copy link
Copy Markdown
Member

No description provided.

…chronization:

- Add `registerCohortUpdates` for tracking geometry changes in cohorts.
- Refactor `ServiceMonitor` with improved client status handling and local service shutdown coordination.
- Introduce `stopRequested` flag for process stop management in `LocalInstanceImpl`.
- Add methods for geometry retrieval and validation across services.
- Improve `refreshStatus` and `refreshClientStatus` synchronization logic.
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR enhances service lifecycle management across the k.LAB engine: it introduces an async shutdown flow with a WAITING intermediate state, adds auxiliary-service auto-start for DATABASE and AMQP_BROKER when a local runtime is detected, improves engine status synchronization, and stores cohort geometry in the Neo4j knowledge graph.

  • Service lifecycle overhaul (LocalInstanceImpl, ServiceMonitor, EngineImpl): stop now transitions through WAITING before STOPPED; waitForStop() blocks restarts until the process exits; stopLocalServices is fully async using virtual threads; new ensureRuntimeAuxiliariesForLocalRuntime auto-provisions DATABASE/AMQP_BROKER.
  • Status synchronization improvements (ServiceClientCatalog, KlabService, EngineImpl): remote services are marked offline only after 3 consecutive failed polls; hasChangedComparedTo now tracks shutdown-state changes; engine status diff expanded to include condition and auxiliary-service changes.
  • Cohort geometry persistence (AbstractKnowledgeGraph, KnowledgeGraphNeo4j): cohort geometry is now written to and read from Neo4j nodes, with a backward-compatibility null check on read.

Confidence Score: 3/5

Not safe to merge as-is: LSP restarts will silently fail, cohort commit transactions will NPE when geometry is null, and restart failures after a stop timeout are swallowed with no diagnostic.

Three separate defects are present on active code paths: the LSP server restart sequence calls stop() then start() without waiting for the WAITING state to clear, so every restart returns false silently. Cohort geometry serialization calls .encode() on a potentially-null getGeometry(), crashing the whole Neo4j commit transaction. And forceRestart() abandons the restart after a 10-second timeout with no log or exception, leaving callers with no information about the failure.

ServiceMonitor.java (LSP restart), AbstractKnowledgeGraph.java (cohort geometry NPE), LocalInstanceImpl.java (silent restart failure after timeout), and WorldviewValidationScope.java (transient empty scope registration).

Important Files Changed

Filename Overview
klab.core.api/src/main/java/org/integratedmodelling/klab/api/configuration/Setting.java Adds LIST_LOCAL_COMMIT_OPERATIONS debug setting and reduces POLLING_INTERVAL_REMOTE default from 20s to 10s.
klab.core.api/src/main/java/org/integratedmodelling/klab/api/services/KlabService.java Adds shutdown-state comparison to hasChangedComparedTo so callers detect service shutdown transitions.
klab.core.common/src/main/java/org/integratedmodelling/common/distribution/LocalInstanceImpl.java Overhauls stop/start lifecycle: introduces stopRequested flag, WAITING intermediate state, waitForStop(), and markStopped/markError helpers. forceRestart now blocks until stop completes, but a timeout expiry leaves status in WAITING and start() silently returns false.
klab.core.common/src/main/java/org/integratedmodelling/common/services/client/BaseServiceClient.java Splits shutdown() into shutdown() and requestShutdown(); adds refreshStatus() passthrough. Clean refactor.
klab.core.common/src/main/java/org/integratedmodelling/common/services/client/ServiceClientCatalog.java Adds consecutive-failure back-off before marking remote services offline; splits timedTasks into refreshStatus(bool) with optional listener notification; switches registeredClients to a concurrent set.
klab.core.common/src/main/java/org/integratedmodelling/common/services/client/engine/EngineImpl.java Adds ensureRuntimeAuxiliariesForLocalRuntime to auto-start DATABASE/AMQP_BROKER when a local runtime is detected; delegates aux-service lifecycle to ServiceMonitor; renames stopAuxServices to stopApplicationAuxiliaryServices.
klab.core.common/src/main/java/org/integratedmodelling/common/services/client/engine/ServiceMonitor.java Major rewrite: async shutdown, auxiliary-service lifecycle, LSP shutdown hook, stoppingLocalServices flag, status snapshots. startLSPServer still calls stop() then immediately start() while status is WAITING, causing silent restart failure.
klab.core.common/src/main/java/org/integratedmodelling/common/utils/Utils.java isAlive() now catches Throwable instead of Exception for broader socket-error resilience.
klab.services.resources/src/main/java/org/integratedmodelling/klab/services/resources/lang/WorldviewValidationScope.java Implements ReasoningValidationScope; parametrized constructor now delegates to no-arg this(), which may register an empty scope as the global validation scope via the parent constructor side effect.
klab.services.runtime/src/main/java/org/integratedmodelling/klab/services/runtime/digitaltwin/DigitalTwinImpl.java Adds registerCohortUpdates() call inside the commit path, but the method body is an empty loop that reads cohort geometry and immediately discards it — no actual update is registered.
klab.services.runtime/src/main/java/org/integratedmodelling/klab/services/runtime/neo4j/AbstractKnowledgeGraph.java Persists cohort geometry in the node properties, but cohort.getGeometry().encode() will NPE if getGeometry() returns null, crashing the entire commit transaction.
klab.services.runtime/src/main/java/org/integratedmodelling/klab/services/runtime/neo4j/KnowledgeGraphNeo4j.java Reads cohort geometry from the node with a null-safe check; backward-compatible for nodes written before this PR. TODO comment acknowledges tentative nature.

Sequence Diagram

sequenceDiagram
    participant E as EngineImpl
    participant SM as ServiceMonitor
    participant LI as LocalInstanceImpl
    participant CC as ServiceClientCatalog
    participant SC as BaseServiceClient

    Note over E,SC: Service Startup
    E->>SM: startLocalServices(stack, tag, user)
    SM->>SM: publishTransitionStatus(false)
    SM->>LI: startAuxiliaryService(DATABASE)
    SM->>LI: startAuxiliaryService(AMQP_BROKER)
    SM->>LI: product.start() for each primary service
    SM->>SM: refreshLocalClientStatusesAsync()
    SM-->>CC: timedTasks() poll on schedule

    Note over E,SC: Status Refresh Flow
    CC->>CC: "refreshStatus(notifyListeners=true)"
    CC->>SC: HTTP GET /status
    SC-->>CC: ServiceStatusImpl
    CC->>CC: notify statusListeners
    SM->>SM: handleStatus() → recomputeEngineStatus()
    SM-->>E: engineConsumers.accept(engineStatus)
    E->>E: ensureRuntimeAuxiliariesForLocalRuntime(status)

    Note over E,SC: Service Shutdown
    E->>SM: stopLocalServices()
    SM->>SM: "stoppingLocalServices=true"
    SM->>SM: publishTransitionStatus(true)
    loop parallel virtual threads
        SM->>SC: requestShutdown()
    end
    loop join shutdown threads
        SM->>SM: wait for each thread
    end
    loop primary service instances
        SM->>LI: service.stop()
        LI->>LI: "stopRequested=true, status=WAITING"
        LI->>LI: monitorAlreadyRunningProcess(pid)
        LI->>LI: watchdog.destroyProcess()
        Note over LI: onProcessFailed → markStopped()
    end
    SM->>SM: refreshLocalClientStatuses()
    SM->>SM: recomputeEngineStatus()
    E->>SM: stopApplicationAuxiliaryServices()
    SM->>SM: stopLanguageServer()
    SM->>SM: stopRuntimeAuxiliaryServices()
Loading

Comments Outside Diff (2)

  1. klab.core.common/src/main/java/org/integratedmodelling/common/services/client/engine/ServiceMonitor.java, line 576-590 (link)

    P1 startLSPServer calls languageServer.stop() and then immediately calls languageServer.start(). stop() now sets the status to WAITING, and start() returns false unconditionally when status is WAITING. The waitForStop() helper added in LocalInstanceImpl is never called here, so every invocation of this code path (the comment even says it is "required even if the server already runs") will silently fail to start the new process.

  2. klab.core.common/src/main/java/org/integratedmodelling/common/distribution/LocalInstanceImpl.java, line 152-162 (link)

    P1 Silent restart failure after waitForStop timeout

    waitForStop() exits after 10 seconds regardless of whether the process actually stopped. When it returns with status still WAITING, forceRestart() calls start(), which hits the status.get() == Status.WAITING guard and returns false silently. The caller receives false with no log, no exception, and no indication that the restart was abandoned mid-flight. Adding a log or throwing when the deadline is exceeded would make the failure visible.

Reviews (4): Last reviewed commit: "Add runtime auxiliary service monitoring..." | Re-trigger Greptile

fvilla added 4 commits June 11, 2026 12:31
- Improve polling logic with failure threshold handling for remote services.
- Add shutdown hooks for auxiliary services, including language server.
- Enhance auxiliary service startup and stop logic with better status synchronization.
- Adjust service configuration defaults for compatibility and performance.
- Add backward-compatibility checks in geometry handling logic.
…tion. Simplify geometry null check in `KnowledgeGraphNeo4j`.
@fvilla fvilla merged commit 7e0a42e into develop Jun 11, 2026
1 check passed
ret.put("childrenCount", cohort.getChildrenCount());
ret.put("urn", cohort.getObservable().getUrn() + "_cohort");
ret.put("id", cohort.getId());
ret.put("geometry", cohort.getGeometry().encode());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 cohort.getGeometry() can return null for a freshly constructed cohort whose geometry has not been set, which would make .encode() throw a NullPointerException and crash the entire commit transaction. A null guard (falling back to an empty string or skipping the property) is needed before calling encode().

Suggested change
ret.put("geometry", cohort.getGeometry().encode());
ret.put("geometry", cohort.getGeometry() != null ? cohort.getGeometry().encode() : null);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant