Enhance service lifecycle management, cohort handling, and status synchronization:#77
Conversation
…chronization: - Add `registerCohortUpdates` for tracking geometry changes in cohorts. - Refactor `ServiceMonitor` with improved client status handling and local service shutdown coordination. - Introduce `stopRequested` flag for process stop management in `LocalInstanceImpl`. - Add methods for geometry retrieval and validation across services. - Improve `refreshStatus` and `refreshClientStatus` synchronization logic.
Greptile SummaryThis PR enhances service lifecycle management across the k.LAB engine: it introduces an async shutdown flow with a
Confidence Score: 3/5Not safe to merge as-is: LSP restarts will silently fail, cohort commit transactions will NPE when geometry is null, and restart failures after a stop timeout are swallowed with no diagnostic. Three separate defects are present on active code paths: the LSP server restart sequence calls stop() then start() without waiting for the WAITING state to clear, so every restart returns false silently. Cohort geometry serialization calls .encode() on a potentially-null getGeometry(), crashing the whole Neo4j commit transaction. And forceRestart() abandons the restart after a 10-second timeout with no log or exception, leaving callers with no information about the failure. ServiceMonitor.java (LSP restart), AbstractKnowledgeGraph.java (cohort geometry NPE), LocalInstanceImpl.java (silent restart failure after timeout), and WorldviewValidationScope.java (transient empty scope registration). Important Files Changed
Sequence DiagramsequenceDiagram
participant E as EngineImpl
participant SM as ServiceMonitor
participant LI as LocalInstanceImpl
participant CC as ServiceClientCatalog
participant SC as BaseServiceClient
Note over E,SC: Service Startup
E->>SM: startLocalServices(stack, tag, user)
SM->>SM: publishTransitionStatus(false)
SM->>LI: startAuxiliaryService(DATABASE)
SM->>LI: startAuxiliaryService(AMQP_BROKER)
SM->>LI: product.start() for each primary service
SM->>SM: refreshLocalClientStatusesAsync()
SM-->>CC: timedTasks() poll on schedule
Note over E,SC: Status Refresh Flow
CC->>CC: "refreshStatus(notifyListeners=true)"
CC->>SC: HTTP GET /status
SC-->>CC: ServiceStatusImpl
CC->>CC: notify statusListeners
SM->>SM: handleStatus() → recomputeEngineStatus()
SM-->>E: engineConsumers.accept(engineStatus)
E->>E: ensureRuntimeAuxiliariesForLocalRuntime(status)
Note over E,SC: Service Shutdown
E->>SM: stopLocalServices()
SM->>SM: "stoppingLocalServices=true"
SM->>SM: publishTransitionStatus(true)
loop parallel virtual threads
SM->>SC: requestShutdown()
end
loop join shutdown threads
SM->>SM: wait for each thread
end
loop primary service instances
SM->>LI: service.stop()
LI->>LI: "stopRequested=true, status=WAITING"
LI->>LI: monitorAlreadyRunningProcess(pid)
LI->>LI: watchdog.destroyProcess()
Note over LI: onProcessFailed → markStopped()
end
SM->>SM: refreshLocalClientStatuses()
SM->>SM: recomputeEngineStatus()
E->>SM: stopApplicationAuxiliaryServices()
SM->>SM: stopLanguageServer()
SM->>SM: stopRuntimeAuxiliaryServices()
|
- Improve polling logic with failure threshold handling for remote services. - Add shutdown hooks for auxiliary services, including language server. - Enhance auxiliary service startup and stop logic with better status synchronization. - Adjust service configuration defaults for compatibility and performance. - Add backward-compatibility checks in geometry handling logic.
…tion. Simplify geometry null check in `KnowledgeGraphNeo4j`.
…eImpl` and `ServiceMonitor`.
| ret.put("childrenCount", cohort.getChildrenCount()); | ||
| ret.put("urn", cohort.getObservable().getUrn() + "_cohort"); | ||
| ret.put("id", cohort.getId()); | ||
| ret.put("geometry", cohort.getGeometry().encode()); |
There was a problem hiding this comment.
cohort.getGeometry() can return null for a freshly constructed cohort whose geometry has not been set, which would make .encode() throw a NullPointerException and crash the entire commit transaction. A null guard (falling back to an empty string or skipping the property) is needed before calling encode().
| ret.put("geometry", cohort.getGeometry().encode()); | |
| ret.put("geometry", cohort.getGeometry() != null ? cohort.getGeometry().encode() : null); |
No description provided.