Skip to content

Conversation

@mrproliu
Copy link
Contributor

@mrproliu mrproliu commented Jan 7, 2026

@mrproliu mrproliu added this to the 0.10.0 milestone Jan 7, 2026
@mrproliu mrproliu requested review from Copilot and hanahmily January 7, 2026 07:35
@mrproliu mrproliu added the enhancement New feature or request label Jan 7, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds file-based node discovery as a third discovery mechanism to BanyanDB, alongside the existing etcd-based and DNS-based discovery modes. This enables static node configuration through YAML files, suitable for development, testing, and simple deployment scenarios.

Key Changes:

  • Implements a new file-based node discovery service that reads node configurations from a YAML file and periodically reloads them
  • Adds comprehensive documentation for file-based discovery configuration and usage
  • Refactors DNS discovery into a dedicated discovery/dns package (previously under metadata/dns)

Reviewed changes

Copilot reviewed 6 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
docs/operation/node-discovery.md Adds documentation for the new file-based discovery mode, including configuration examples, error handling, and usage guidance
banyand/metadata/discovery/file/file.go Core implementation of file-based node discovery with YAML config parsing, periodic reloading, and gRPC metadata fetching
banyand/metadata/discovery/file/metrics.go Defines metrics for monitoring file discovery operations (load count, failures, duration, node count)
banyand/metadata/discovery/file/file_test.go Comprehensive test coverage for file discovery including configuration validation, node caching, handler notifications, and concurrent access
banyand/metadata/discovery/dns/dns.go DNS discovery implementation moved to new package structure with TLS support and certificate reloading
banyand/metadata/discovery/dns/metrics.go Metrics definitions for DNS discovery operations
banyand/metadata/discovery/dns/export_test.go Test exports for DNS discovery internal methods
banyand/metadata/discovery/dns/dns_test.go Updates import path to reflect new package structure
banyand/metadata/discovery/dns/testdata/* TLS test certificates for DNS discovery testing
banyand/metadata/client.go Integrates file discovery mode with metadata client, adds configuration flags and lifecycle management

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 288 to 296
The service monitors the configuration file for changes and automatically updates the node registry when the file is modified.

### How it Works

1. Read node configurations from a YAML file on startup
2. Attempt to connect to each node via gRPC to fetch full metadata
3. Successfully connected nodes are added to the cache
4. Failed nodes are tracked separately and retried periodically
5. Watch the file for changes and reload automatically
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation describes retry behavior for failed nodes (line 296, 312, 364), but the actual implementation uses FetchInterval for periodic file reloading in periodicFetch function, not specifically for retrying failed nodes. Failed nodes are simply skipped and will be retried on the next periodic file load. Consider clarifying that this interval controls how often the file is reloaded and all nodes are re-attempted, rather than implying a specific retry queue mechanism.

Suggested change
The service monitors the configuration file for changes and automatically updates the node registry when the file is modified.
### How it Works
1. Read node configurations from a YAML file on startup
2. Attempt to connect to each node via gRPC to fetch full metadata
3. Successfully connected nodes are added to the cache
4. Failed nodes are tracked separately and retried periodically
5. Watch the file for changes and reload automatically
The service periodically reloads the configuration file and automatically updates the node registry when changes are detected.
### How it Works
1. Read node configurations from a YAML file on startup
2. Attempt to connect to each node via gRPC to fetch full metadata
3. Successfully connected nodes are added to the cache
4. Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload
5. Reload the file at a configured interval (FetchInterval), reprocessing all nodes (including previously failed ones)

Copilot uses AI. Check for mistakes.
Comment on lines 414 to 422
if s.dnsDiscovery != nil {
return s.dnsDiscovery
}
// If file discovery is enabled, use it instead of etcd
if s.fileDiscovery != nil {
return s.fileDiscovery
}
// Otherwise use etcd schema registry
return s.schemaRegistry
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of checking discovery services in NodeRegistry() is inconsistent with the order in RegisterHandler(). In RegisterHandler, file discovery is checked before DNS discovery (lines 373-379), but in NodeRegistry, DNS discovery is checked before file discovery (lines 414-422). This inconsistency could lead to confusion. Consider using a consistent priority order across all methods.

Copilot uses AI. Check for mistakes.
--node-discovery-grpc-timeout=5s # Timeout for metadata fetch (default: 5s)
# Retry settings
--node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value shown in the comment (default: 20s) doesn't match the flag name which suggests it should be for retry interval, but the configuration flag on line 312 is actually for retry-interval while line 306 shows file-path. The documentation should clarify that this is the periodic fetch/reload interval for checking file changes, not just retry interval for failed nodes.

Suggested change
--node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s)
--node-discovery-file-retry-interval=20s # Interval to poll the discovery file and retry failed nodes (default: 20s)

Copilot uses AI. Check for mistakes.
assert.Equal(t, nodeName, nodes[0].GetMetadata().GetName())
assert.Equal(t, serverAddr, nodes[0].GetGrpcAddress())

nodeFromCache, getErr := svc.GetNode(ctx, serverAddr)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test passes "addr" (an address like "127.0.0.1:12345") to GetNode but based on the interface contract established by the etcd and DNS implementations, GetNode should accept a node name, not an address. This test is passing because the file implementation incorrectly uses address as the cache key, but this is actually testing incorrect behavior. The test should be updated to call GetNode with the node name once the implementation is fixed to properly look up by name.

Suggested change
nodeFromCache, getErr := svc.GetNode(ctx, serverAddr)
nodeFromCache, getErr := svc.GetNode(ctx, nodeName)

Copilot uses AI. Check for mistakes.
Comment on lines 307 to 309
// file mode doesn't support role filtering as roles are not stored in config
// return all nodes when ROLE_UNSPECIFIED, otherwise return empty list
if role == databasev1.Role_ROLE_UNSPECIFIED {
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states that file mode doesn't support role filtering and returns an empty list for non-UNSPECIFIED roles. However, nodes fetched via gRPC (fetchNodeMetadata) include role information in their metadata. Consider fetching and storing role information from the nodes, then implementing proper role filtering like DNS discovery does, rather than silently returning empty results which could be confusing for users expecting role-based filtering to work.

Suggested change
// file mode doesn't support role filtering as roles are not stored in config
// return all nodes when ROLE_UNSPECIFIED, otherwise return empty list
if role == databasev1.Role_ROLE_UNSPECIFIED {
// return all nodes when ROLE_UNSPECIFIED, otherwise filter by node role
if role == databasev1.Role_ROLE_UNSPECIFIED || node.GetRole() == role {

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 58.43829% with 165 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.00%. Comparing base (3530dd9) to head (b8e92b9).
⚠️ Report is 101 commits behind head on main.

Files with missing lines Patch % Lines
banyand/metadata/discovery/file/file.go 66.96% 53 Missing and 20 partials ⚠️
banyand/metadata/client.go 26.53% 31 Missing and 5 partials ⚠️
banyand/metadata/discovery/dns/dns.go 73.95% 20 Missing and 5 partials ⚠️
banyand/metadata/discovery/file/metrics.go 0.00% 19 Missing ⚠️
banyand/metadata/discovery/dns/metrics.go 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
+ Coverage   45.97%   47.00%   +1.02%     
==========================================
  Files         328      382      +54     
  Lines       55505    59106    +3601     
==========================================
+ Hits        25520    27780    +2260     
- Misses      27909    28727     +818     
- Partials     2076     2599     +523     
Flag Coverage Δ
banyand 49.41% <58.43%> (?)
bydbctl 81.91% <ø> (?)
fodc 80.17% <ø> (?)
integration-distributed 80.00% <ø> (?)
pkg 29.14% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wu-sheng
Copy link
Member

wu-sheng commented Jan 7, 2026

We need documents about three service discovery ways.

  1. dns based(default)
  2. file based(optional)
  3. etcd based(legacy).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 203 to 221
s.cacheMutex.Lock()
if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded {
s.nodeCache[n.Address] = node

// notify handlers after releasing lock
s.notifyHandlers(schema.Metadata{
TypeMeta: schema.TypeMeta{
Kind: schema.KindNode,
Name: node.GetMetadata().GetName(),
},
Spec: node,
}, true)

s.log.Debug().
Str("address", n.Address).
Str("name", node.GetMetadata().GetName()).
Msg("New node discovered and added to cache")
}
s.cacheMutex.Unlock()
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notifyHandlers call at line 208-214 is made while holding the cacheMutex lock. If a handler performs slow operations or calls back into the service, this could cause deadlocks or performance issues. The handlers should be notified after releasing the lock, similar to how deletion notifications are handled at lines 250-258.

Copilot uses AI. Check for mistakes.
Comment on lines 144 to 145
fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second,
"Fetch file interval for nodes in file discovery mode")
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag name uses "fetch-interval" but the documentation calls it "retry-interval". The flag should be renamed to "node-discovery-file-retry-interval" to match the documentation at line 311, or the documentation should be updated to use "fetch-interval".

Suggested change
fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second,
"Fetch file interval for nodes in file discovery mode")
fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-retry-interval", 20*time.Second,
"Retry interval for reading node configuration file in file discovery mode")

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be changed to this one: "Interval to poll the discovery file and retry failed nodes in file discovery mode".

Comment on lines 333 to 376
func TestConcurrentAccess(t *testing.T) {
listener, grpcServer, nodeServer := startMockGRPCServer(t)
defer grpcServer.Stop()
defer listener.Close()
addr := listener.Addr().String()
nodeServer.setNode(newTestNode("concurrent-node", addr))

configFile := createTempConfigFile(t, fmt.Sprintf(`
nodes:
- name: concurrent-node
grpc_address: %s
`, addr))
defer os.Remove(configFile)

svc, err := NewService(Config{
FilePath: configFile,
GRPCTimeout: testGRPCTimeout,
FetchInterval: testFetchInterval,
})
require.NoError(t, err)
defer svc.Close()

ctx, cancel := context.WithCancel(context.Background())
defer cancel()
require.NoError(t, svc.Start(ctx))

var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := 0; j < 100; j++ {
if _, errList := svc.ListNode(ctx, databasev1.Role_ROLE_UNSPECIFIED); errList != nil {
t.Errorf("list node: %v", errList)
}
if _, errGet := svc.GetNode(ctx, "concurrent-node"); errGet != nil {
t.Errorf("get node: %v", errGet)
}
}
}()
}

wg.Wait()
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses small intervals (200ms) for fetchInterval, but doesn't verify that the file is actually reloaded at that interval. Consider adding a test that verifies the periodic reload behavior by modifying the file after initial load and checking that changes are detected within the expected interval.

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 148
func TestStartWithInvalidConfig(t *testing.T) {
ctx := context.Background()

t.Run("invalid yaml", func(t *testing.T) {
configFile := createTempConfigFile(t, `
nodes:
- name: node1
grpc_address: [invalid
`)
defer os.Remove(configFile)

svc, err := NewService(Config{
FilePath: configFile,
GRPCTimeout: testGRPCTimeout,
FetchInterval: testFetchInterval,
})
require.NoError(t, err)
defer svc.Close()

err = svc.Start(ctx)
require.Error(t, err)
assert.Contains(t, err.Error(), "failed to parse YAML")
})

t.Run("missing address", func(t *testing.T) {
configFile := createTempConfigFile(t, `
nodes:
- name: node1
`)
defer os.Remove(configFile)

svc, err := NewService(Config{
FilePath: configFile,
GRPCTimeout: testGRPCTimeout,
FetchInterval: testFetchInterval,
})
require.NoError(t, err)
defer svc.Close()

err = svc.Start(ctx)
require.Error(t, err)
assert.Contains(t, err.Error(), "missing required field: grpc_address")
})

t.Run("tls enabled without ca cert", func(t *testing.T) {
configFile := createTempConfigFile(t, `
nodes:
- name: node1
grpc_address: 127.0.0.1:17912
tls_enabled: true
`)
defer os.Remove(configFile)

svc, err := NewService(Config{
FilePath: configFile,
GRPCTimeout: testGRPCTimeout,
FetchInterval: testFetchInterval,
})
require.NoError(t, err)
defer svc.Close()

err = svc.Start(ctx)
require.Error(t, err)
assert.Contains(t, err.Error(), "missing ca_cert_path")
})
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation at line 294 states "Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload". However, there's no test coverage for this retry behavior. Consider adding a test that starts with an unreachable node, then makes it reachable, and verifies it gets added to the cache after the retry interval.

Copilot uses AI. Check for mistakes.
Comment on lines 186 to 221
for _, n := range newNodes {
s.cacheMutex.RLock()
_, exists := s.nodeCache[n.Address]
s.cacheMutex.RUnlock()

if !exists {
// fetch node metadata from gRPC
node, fetchErr := s.fetchNodeMetadata(ctx, n)
if fetchErr != nil {
s.log.Warn().
Err(fetchErr).
Str("node", n.Name).
Str("address", n.Address).
Msg("Failed to fetch node metadata, will skip")
continue
}

s.cacheMutex.Lock()
if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded {
s.nodeCache[n.Address] = node

// notify handlers after releasing lock
s.notifyHandlers(schema.Metadata{
TypeMeta: schema.TypeMeta{
Kind: schema.KindNode,
Name: node.GetMetadata().GetName(),
},
Spec: node,
}, true)

s.log.Debug().
Str("address", n.Address).
Str("name", node.GetMetadata().GetName()).
Msg("New node discovered and added to cache")
}
s.cacheMutex.Unlock()
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a race condition between the RLock check at line 187-189 and the Lock at line 203. Two goroutines could both pass the existence check and both attempt to fetch metadata for the same node. Consider using a single Lock instead of RLock-then-Lock, or implement proper double-checked locking with a second existence check after acquiring the write lock.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +271
// skip node registration if DNS/file mode is enabled or node registration is disabled
if !s.toRegisterNode || s.nodeDiscoveryMode == NodeDiscoveryModeDNS ||
s.nodeDiscoveryMode == NodeDiscoveryModeFile {
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "skip node registration if DNS/file mode is enabled" but the actual condition only skips if the mode is DNS or File AND toRegisterNode is true. The comment should be more precise: "skip node registration if DNS/file mode is enabled (these modes don't support manual registration) or node registration is disabled".

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong comment, for now, the judgment is correct. If there is no need to register a node, or if it is DNS discovery, or if it is file discovery, the register would be ignored.

mrproliu and others added 4 commits January 8, 2026 13:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return errors.New("manual node update not supported in file discovery mode")
}

func (s *Service) periodicFetch(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use "github.com/fsnotify/fsnotify" to check the change of this file? We have examples to implement authN and TLS file reloading.

return resp.GetNode(), nil
}

func (s *Service) updateNodeCache(ctx context.Context, newNodes []NodeConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to be the same as the one used in DNS discovery. Could you extract the same logic to reuse in those cases?

@hanahmily hanahmily merged commit 8a880f5 into apache:main Jan 12, 2026
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Remove Etcd Dependency via DNS-Based Node Discovery

4 participants