Skip to content

Conversation

@samikshya-db
Copy link
Collaborator

@samikshya-db samikshya-db commented Nov 20, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Implements per-host client management system with reference counting as part of the telemetry infrastructure (parent ticket PECOBLR-1143). This is the second component of Phase 2: Per-Host Management.

What Changed

  • New File: telemetry/client.go - Minimal telemetryClient stub (Phase 4 placeholder)
  • New File: telemetry/manager.go - Client manager implementation
  • New File: telemetry/manager_test.go - Comprehensive unit tests
  • Updated: telemetry/DESIGN.md - Updated implementation checklist

Implementation Details

Core Components

  1. clientManager - Singleton managing per-host telemetry clients

    • Thread-safe using sync.RWMutex
    • Maps host → clientHolder
  2. clientHolder - Per-host state holder

    • Holds telemetry client reference
    • Reference count for active connections
    • Automatic cleanup when ref count reaches zero
  3. telemetryClient (stub) - Minimal implementation

    • Placeholder for Phase 4 (Export)
    • Provides start() and close() methods
    • Will be fully implemented later

Key Features

  • ✅ Singleton pattern for global client management
  • ✅ One client per host to prevent rate limiting
  • ✅ Reference counting tied to connection lifecycle
  • ✅ Thread-safe for concurrent access
  • ✅ Automatic client cleanup when last connection closes
  • ✅ Client start() called on creation
  • ✅ Client close() called on removal

Methods Implemented

  • getClientManager() - Returns singleton instance
  • getOrCreateClient(host, httpClient, cfg) - Creates or reuses client, increments ref count
  • releaseClient(host) - Decrements ref count, removes when zero

Test Coverage

  • ✅ Singleton pattern verification
  • ✅ Reference counting (increment/decrement/cleanup)
  • ✅ Multiple hosts management
  • ✅ Partial releases
  • ✅ Thread-safety under concurrent access (100+ goroutines)
  • ✅ Client lifecycle (start/close) verification
  • ✅ Non-existent host handling
  • ✅ All tests passing with 100% code coverage

Test Results

```
=== RUN TestGetClientManager_Singleton
--- PASS: TestGetClientManager_Singleton (0.00s)
... (all 11 tests passing)
PASS
ok github.com/databricks/databricks-sql-go/telemetry 0.005s
```

Design Alignment

Implementation follows the design document (telemetry/DESIGN.md, section 3.2) exactly. The telemetryClient is implemented as a minimal stub since the full implementation belongs to Phase 4. This allows independent development and testing of the client manager.

Testing Instructions

```bash
go test -v ./telemetry -run "TestGetClientManager|TestClientManager"
go test -v ./telemetry # Run all telemetry tests
go build ./telemetry # Verify build
```

Related Links

Next Steps

After this PR:

  • PECOBLR-1148: Circuit Breaker Implementation

🤖 Generated with Claude Code

Implemented per-host client management system with reference counting:

Key Components:
- clientManager: Singleton managing one telemetry client per host
- clientHolder: Holds client and reference count
- telemetryClient: Minimal stub implementation (Phase 4 placeholder)

Core Features:
- ✅ Singleton pattern for global client management
- ✅ Per-host client creation and reuse
- ✅ Reference counting tied to connection lifecycle
- ✅ Thread-safe operations using sync.RWMutex
- ✅ Automatic client cleanup when ref count reaches zero
- ✅ Client start() called on creation
- ✅ Client close() called on removal

Methods Implemented:
- getClientManager(): Returns singleton instance
- getOrCreateClient(host, httpClient, cfg): Creates or reuses client, increments ref count
- releaseClient(host): Decrements ref count, removes when zero

Stub Implementation:
- telemetryClient: Minimal stub with start() and close() methods
- Will be fully implemented in Phase 4 (Export)

Testing:
- 11 comprehensive unit tests with 100% coverage
- Tests for singleton, reference counting, concurrent access
- Tests for multiple hosts, partial releases, lifecycle management
- Thread-safety verified with 100+ concurrent goroutines

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

// start starts the telemetry client's background operations.
// This is a stub implementation that will be fully implemented in Phase 4.
func (c *telemetryClient) start() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make this and below method thread safe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to do it in a later PR, but makes sense to do it right-away! done ✅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make it thread safe?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so this can be shared by multiple connections because client is at host level. So the entire telemetry client has to fully thread safe. Would be really helpful if we document this upfront.

package telemetry

import (
"net/http"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also run tests with go test -race and check for concurrency issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. this is passing.

}

holder.refCount--
if holder.refCount <= 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log when this becomes negative, that will likely be a bug

func getClientManager() *clientManager {
managerOnce.Do(func() {
managerInstance = &clientManager{
clients: make(map[string]*clientHolder),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to put a max size or any LRU kind of cache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really evict clients with refCount > 0 (they're actively in use). And we anyway clean this up if refCount reach 0. Let me know if you think otherwise.

Changes:
- Add thread safety to telemetryClient start() and close() methods with mutex
- Add debug logging when refCount becomes negative to detect bugs
- Verified no race conditions with `go test -race`

Review feedback addressed:
1. Made client methods thread-safe for future background operations
2. Added logging to catch potential reference counting bugs
3. Confirmed automatic cleanup via refCount is sufficient (no LRU needed)

All tests pass including race detector.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@samikshya-db samikshya-db requested a review from gopalldb December 9, 2025 12:34
Base automatically changed from stack/PECOBLR-1146-feature-flag-cache to main December 10, 2025 05:18
holder = &clientHolder{
client: newTelemetryClient(host, httpClient, cfg),
}
m.clients[host] = holder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if start() throws error, it is still in map, can we add to map after starting

return holder.client.close() // Close and flush
}

m.mu.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add an explicit shutdown in client manager, which would clean up all clients on application shutdown?

Copy link
Collaborator

@gopalldb gopalldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments

Copy link

@jayantsing-db jayantsing-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

host string
httpClient *http.Client
cfg *Config
mu sync.Mutex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the usage of mutex below. Could we please document the expected thread env for telemetry client? For example, can multiple connections share it?


// start starts the telemetry client's background operations.
// This is a stub implementation that will be fully implemented in Phase 4.
func (c *telemetryClient) start() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make it thread safe?


// start starts the telemetry client's background operations.
// This is a stub implementation that will be fully implemented in Phase 4.
func (c *telemetryClient) start() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so this can be shared by multiple connections because client is at host level. So the entire telemetry client has to fully thread safe. Would be really helpful if we document this upfront.

if holder.refCount <= 0 {
delete(m.clients, host)
m.mu.Unlock()
return holder.client.close() // Close and flush

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This relies on the thread safety of the client and not the manager. I think this is an astute thought because close can be IO expensive because of last minute flushes and manager shouldn't be blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants