Support file base node discovery #928

mrproliu · 2026-01-07T07:35:05Z

If this pull request closes/resolves/fixes an existing issue, replace the issue number. Fixes [Feature] Remove Etcd Dependency via DNS-Based Node Discovery skywalking#13621.
Update the CHANGES log.

Copilot

Pull request overview

This PR adds file-based node discovery as a third discovery mechanism to BanyanDB, alongside the existing etcd-based and DNS-based discovery modes. This enables static node configuration through YAML files, suitable for development, testing, and simple deployment scenarios.

Key Changes:

Implements a new file-based node discovery service that reads node configurations from a YAML file and periodically reloads them
Adds comprehensive documentation for file-based discovery configuration and usage
Refactors DNS discovery into a dedicated discovery/dns package (previously under metadata/dns)

Reviewed changes

Copilot reviewed 6 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`docs/operation/node-discovery.md`	Adds documentation for the new file-based discovery mode, including configuration examples, error handling, and usage guidance
`banyand/metadata/discovery/file/file.go`	Core implementation of file-based node discovery with YAML config parsing, periodic reloading, and gRPC metadata fetching
`banyand/metadata/discovery/file/metrics.go`	Defines metrics for monitoring file discovery operations (load count, failures, duration, node count)
`banyand/metadata/discovery/file/file_test.go`	Comprehensive test coverage for file discovery including configuration validation, node caching, handler notifications, and concurrent access
`banyand/metadata/discovery/dns/dns.go`	DNS discovery implementation moved to new package structure with TLS support and certificate reloading
`banyand/metadata/discovery/dns/metrics.go`	Metrics definitions for DNS discovery operations
`banyand/metadata/discovery/dns/export_test.go`	Test exports for DNS discovery internal methods
`banyand/metadata/discovery/dns/dns_test.go`	Updates import path to reflect new package structure
`banyand/metadata/discovery/dns/testdata/*`	TLS test certificates for DNS discovery testing
`banyand/metadata/client.go`	Integrates file discovery mode with metadata client, adds configuration flags and lifecycle management

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T07:41:53Z

docs/operation/node-discovery.md

+The service monitors the configuration file for changes and automatically updates the node registry when the file is modified.
+
+### How it Works
+
+1. Read node configurations from a YAML file on startup
+2. Attempt to connect to each node via gRPC to fetch full metadata
+3. Successfully connected nodes are added to the cache
+4. Failed nodes are tracked separately and retried periodically
+5. Watch the file for changes and reload automatically


The documentation describes retry behavior for failed nodes (line 296, 312, 364), but the actual implementation uses FetchInterval for periodic file reloading in periodicFetch function, not specifically for retrying failed nodes. Failed nodes are simply skipped and will be retried on the next periodic file load. Consider clarifying that this interval controls how often the file is reloaded and all nodes are re-attempted, rather than implying a specific retry queue mechanism.

Suggested change

The service monitors the configuration file for changes and automatically updates the node registry when the file is modified.

### How it Works

1. Read node configurations from a YAML file on startup

2. Attempt to connect to each node via gRPC to fetch full metadata

3. Successfully connected nodes are added to the cache

4. Failed nodes are tracked separately and retried periodically

5. Watch the file for changes and reload automatically

The service periodically reloads the configuration file and automatically updates the node registry when changes are detected.

### How it Works

1. Read node configurations from a YAML file on startup

2. Attempt to connect to each node via gRPC to fetch full metadata

3. Successfully connected nodes are added to the cache

4. Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload

5. Reload the file at a configured interval (FetchInterval), reprocessing all nodes (including previously failed ones)

Copilot · 2026-01-07T07:41:54Z

banyand/metadata/client.go

 	if s.dnsDiscovery != nil {
 		return s.dnsDiscovery
 	}
+	// If file discovery is enabled, use it instead of etcd
+	if s.fileDiscovery != nil {
+		return s.fileDiscovery
+	}
 	// Otherwise use etcd schema registry
 	return s.schemaRegistry


The order of checking discovery services in NodeRegistry() is inconsistent with the order in RegisterHandler(). In RegisterHandler, file discovery is checked before DNS discovery (lines 373-379), but in NodeRegistry, DNS discovery is checked before file discovery (lines 414-422). This inconsistency could lead to confusion. Consider using a consistent priority order across all methods.

Copilot · 2026-01-07T07:41:54Z

docs/operation/node-discovery.md

+--node-discovery-grpc-timeout=5s             # Timeout for metadata fetch (default: 5s)
+
+# Retry settings
+--node-discovery-file-retry-interval=20s     # Retry interval for failed nodes (default: 20s)


The default value shown in the comment (default: 20s) doesn't match the flag name which suggests it should be for retry interval, but the configuration flag on line 312 is actually for retry-interval while line 306 shows file-path. The documentation should clarify that this is the periodic fetch/reload interval for checking file changes, not just retry interval for failed nodes.

Suggested change

--node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s)

--node-discovery-file-retry-interval=20s # Interval to poll the discovery file and retry failed nodes (default: 20s)

docs/operation/node-discovery.md

Copilot · 2026-01-07T07:41:55Z

banyand/metadata/discovery/file/file_test.go

+	assert.Equal(t, nodeName, nodes[0].GetMetadata().GetName())
+	assert.Equal(t, serverAddr, nodes[0].GetGrpcAddress())
+
+	nodeFromCache, getErr := svc.GetNode(ctx, serverAddr)


The test passes "addr" (an address like "127.0.0.1:12345") to GetNode but based on the interface contract established by the etcd and DNS implementations, GetNode should accept a node name, not an address. This test is passing because the file implementation incorrectly uses address as the cache key, but this is actually testing incorrect behavior. The test should be updated to call GetNode with the node name once the implementation is fixed to properly look up by name.

Suggested change

nodeFromCache, getErr := svc.GetNode(ctx, serverAddr)

nodeFromCache, getErr := svc.GetNode(ctx, nodeName)

docs/operation/node-discovery.md

banyand/metadata/discovery/file/file.go

Copilot · 2026-01-07T07:41:56Z

banyand/metadata/discovery/file/file.go

+		// file mode doesn't support role filtering as roles are not stored in config
+		// return all nodes when ROLE_UNSPECIFIED, otherwise return empty list
+		if role == databasev1.Role_ROLE_UNSPECIFIED {


The comment states that file mode doesn't support role filtering and returns an empty list for non-UNSPECIFIED roles. However, nodes fetched via gRPC (fetchNodeMetadata) include role information in their metadata. Consider fetching and storing role information from the nodes, then implementing proper role filtering like DNS discovery does, rather than silently returning empty results which could be confusing for users expecting role-based filtering to work.

Suggested change

// file mode doesn't support role filtering as roles are not stored in config

// return all nodes when ROLE_UNSPECIFIED, otherwise return empty list

if role == databasev1.Role_ROLE_UNSPECIFIED {

// return all nodes when ROLE_UNSPECIFIED, otherwise filter by node role

if role == databasev1.Role_ROLE_UNSPECIFIED || node.GetRole() == role {

banyand/metadata/discovery/file/file.go

codecov-commenter · 2026-01-07T08:15:28Z

Codecov Report

❌ Patch coverage is 58.43829% with 165 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.00%. Comparing base (3530dd9) to head (b8e92b9).
⚠️ Report is 101 commits behind head on main.

Files with missing lines	Patch %	Lines
banyand/metadata/discovery/file/file.go	66.96%	53 Missing and 20 partials ⚠️
banyand/metadata/client.go	26.53%	31 Missing and 5 partials ⚠️
banyand/metadata/discovery/dns/dns.go	73.95%	20 Missing and 5 partials ⚠️
banyand/metadata/discovery/file/metrics.go	0.00%	19 Missing ⚠️
banyand/metadata/discovery/dns/metrics.go	0.00%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
+ Coverage   45.97%   47.00%   +1.02%     
==========================================
  Files         328      382      +54     
  Lines       55505    59106    +3601     
==========================================
+ Hits        25520    27780    +2260     
- Misses      27909    28727     +818     
- Partials     2076     2599     +523

Flag	Coverage Δ
banyand	`49.41% <58.43%> (?)`
bydbctl	`81.91% <ø> (?)`
fodc	`80.17% <ø> (?)`
integration-distributed	`80.00% <ø> (?)`
pkg	`29.14% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wu-sheng · 2026-01-07T08:33:31Z

We need documents about three service discovery ways.

dns based(default)
file based(optional)
etcd based(legacy).

Copilot

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T11:32:28Z

banyand/metadata/discovery/file/file.go

+			s.cacheMutex.Lock()
+			if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded {
+				s.nodeCache[n.Address] = node
+
+				// notify handlers after releasing lock
+				s.notifyHandlers(schema.Metadata{
+					TypeMeta: schema.TypeMeta{
+						Kind: schema.KindNode,
+						Name: node.GetMetadata().GetName(),
+					},
+					Spec: node,
+				}, true)
+
+				s.log.Debug().
+					Str("address", n.Address).
+					Str("name", node.GetMetadata().GetName()).
+					Msg("New node discovered and added to cache")
+			}
+			s.cacheMutex.Unlock()


The notifyHandlers call at line 208-214 is made while holding the cacheMutex lock. If a handler performs slow operations or calls back into the service, this could cause deadlocks or performance issues. The handlers should be notified after releasing the lock, similar to how deletion notifications are handled at lines 250-258.

banyand/metadata/discovery/file/file.go

banyand/metadata/client.go

Copilot · 2026-01-07T11:32:29Z

banyand/metadata/client.go

+	fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second,
+		"Fetch file interval for nodes in file discovery mode")


The flag name uses "fetch-interval" but the documentation calls it "retry-interval". The flag should be renamed to "node-discovery-file-retry-interval" to match the documentation at line 311, or the documentation should be updated to use "fetch-interval".

Suggested change

fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second,

"Fetch file interval for nodes in file discovery mode")

fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-retry-interval", 20*time.Second,

"Retry interval for reading node configuration file in file discovery mode")

Should be changed to this one: "Interval to poll the discovery file and retry failed nodes in file discovery mode".

Copilot · 2026-01-07T11:32:30Z

banyand/metadata/discovery/file/file_test.go

+func TestConcurrentAccess(t *testing.T) {
+	listener, grpcServer, nodeServer := startMockGRPCServer(t)
+	defer grpcServer.Stop()
+	defer listener.Close()
+	addr := listener.Addr().String()
+	nodeServer.setNode(newTestNode("concurrent-node", addr))
+
+	configFile := createTempConfigFile(t, fmt.Sprintf(`
+nodes:
+  - name: concurrent-node
+    grpc_address: %s
+`, addr))
+	defer os.Remove(configFile)
+
+	svc, err := NewService(Config{
+		FilePath:      configFile,
+		GRPCTimeout:   testGRPCTimeout,
+		FetchInterval: testFetchInterval,
+	})
+	require.NoError(t, err)
+	defer svc.Close()
+
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+	require.NoError(t, svc.Start(ctx))
+
+	var wg sync.WaitGroup
+	for i := 0; i < 10; i++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			for j := 0; j < 100; j++ {
+				if _, errList := svc.ListNode(ctx, databasev1.Role_ROLE_UNSPECIFIED); errList != nil {
+					t.Errorf("list node: %v", errList)
+				}
+				if _, errGet := svc.GetNode(ctx, "concurrent-node"); errGet != nil {
+					t.Errorf("get node: %v", errGet)
+				}
+			}
+		}()
+	}
+
+	wg.Wait()
+}


The test uses small intervals (200ms) for fetchInterval, but doesn't verify that the file is actually reloaded at that interval. Consider adding a test that verifies the periodic reload behavior by modifying the file after initial load and checking that changes are detected within the expected interval.

Copilot · 2026-01-07T11:32:30Z

banyand/metadata/discovery/file/file_test.go

+func TestStartWithInvalidConfig(t *testing.T) {
+	ctx := context.Background()
+
+	t.Run("invalid yaml", func(t *testing.T) {
+		configFile := createTempConfigFile(t, `
+nodes:
+  - name: node1
+    grpc_address: [invalid
+`)
+		defer os.Remove(configFile)
+
+		svc, err := NewService(Config{
+			FilePath:      configFile,
+			GRPCTimeout:   testGRPCTimeout,
+			FetchInterval: testFetchInterval,
+		})
+		require.NoError(t, err)
+		defer svc.Close()
+
+		err = svc.Start(ctx)
+		require.Error(t, err)
+		assert.Contains(t, err.Error(), "failed to parse YAML")
+	})
+
+	t.Run("missing address", func(t *testing.T) {
+		configFile := createTempConfigFile(t, `
+nodes:
+  - name: node1
+`)
+		defer os.Remove(configFile)
+
+		svc, err := NewService(Config{
+			FilePath:      configFile,
+			GRPCTimeout:   testGRPCTimeout,
+			FetchInterval: testFetchInterval,
+		})
+		require.NoError(t, err)
+		defer svc.Close()
+
+		err = svc.Start(ctx)
+		require.Error(t, err)
+		assert.Contains(t, err.Error(), "missing required field: grpc_address")
+	})
+
+	t.Run("tls enabled without ca cert", func(t *testing.T) {
+		configFile := createTempConfigFile(t, `
+nodes:
+  - name: node1
+    grpc_address: 127.0.0.1:17912
+    tls_enabled: true
+`)
+		defer os.Remove(configFile)
+
+		svc, err := NewService(Config{
+			FilePath:      configFile,
+			GRPCTimeout:   testGRPCTimeout,
+			FetchInterval: testFetchInterval,
+		})
+		require.NoError(t, err)
+		defer svc.Close()
+
+		err = svc.Start(ctx)
+		require.Error(t, err)
+		assert.Contains(t, err.Error(), "missing ca_cert_path")
+	})
+}


The documentation at line 294 states "Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload". However, there's no test coverage for this retry behavior. Consider adding a test that starts with an unreachable node, then makes it reachable, and verifies it gets added to the cache after the retry interval.

banyand/metadata/discovery/file/file.go

Copilot · 2026-01-07T11:32:31Z

banyand/metadata/discovery/file/file.go

+	for _, n := range newNodes {
+		s.cacheMutex.RLock()
+		_, exists := s.nodeCache[n.Address]
+		s.cacheMutex.RUnlock()
+
+		if !exists {
+			// fetch node metadata from gRPC
+			node, fetchErr := s.fetchNodeMetadata(ctx, n)
+			if fetchErr != nil {
+				s.log.Warn().
+					Err(fetchErr).
+					Str("node", n.Name).
+					Str("address", n.Address).
+					Msg("Failed to fetch node metadata, will skip")
+				continue
+			}
+
+			s.cacheMutex.Lock()
+			if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded {
+				s.nodeCache[n.Address] = node
+
+				// notify handlers after releasing lock
+				s.notifyHandlers(schema.Metadata{
+					TypeMeta: schema.TypeMeta{
+						Kind: schema.KindNode,
+						Name: node.GetMetadata().GetName(),
+					},
+					Spec: node,
+				}, true)
+
+				s.log.Debug().
+					Str("address", n.Address).
+					Str("name", node.GetMetadata().GetName()).
+					Msg("New node discovered and added to cache")
+			}
+			s.cacheMutex.Unlock()


There's a race condition between the RLock check at line 187-189 and the Lock at line 203. Two goroutines could both pass the existence check and both attempt to fetch metadata for the same node. Consider using a single Lock instead of RLock-then-Lock, or implement proper double-checked locking with a second existence check after acquiring the write lock.

Copilot · 2026-01-07T11:32:31Z

banyand/metadata/client.go

+	// skip node registration if DNS/file mode is enabled or node registration is disabled
+	if !s.toRegisterNode || s.nodeDiscoveryMode == NodeDiscoveryModeDNS ||
+		s.nodeDiscoveryMode == NodeDiscoveryModeFile {


The comment says "skip node registration if DNS/file mode is enabled" but the actual condition only skips if the mode is DNS or File AND toRegisterNode is true. The comment should be more precise: "skip node registration if DNS/file mode is enabled (these modes don't support manual registration) or node registration is disabled".

Wrong comment, for now, the judgment is correct. If there is no need to register a node, or if it is DNS discovery, or if it is file discovery, the register would be ignored.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

banyand/metadata/discovery/file/file.go

hanahmily · 2026-01-09T04:33:41Z

banyand/metadata/discovery/file/file.go

+	return errors.New("manual node update not supported in file discovery mode")
+}
+
+func (s *Service) periodicFetch(ctx context.Context) {


Could you use "github.com/fsnotify/fsnotify" to check the change of this file? We have examples to implement authN and TLS file reloading.

hanahmily · 2026-01-09T04:38:14Z

banyand/metadata/discovery/file/file.go

+	return resp.GetNode(), nil
+}
+
+func (s *Service) updateNodeCache(ctx context.Context, newNodes []NodeConfig) {


This method seems to be the same as the one used in DNS discovery. Could you extract the same logic to reuse in those cases?

Support file base node discovery

1e7fa4a

mrproliu added this to the 0.10.0 milestone Jan 7, 2026

mrproliu requested review from Copilot and hanahmily January 7, 2026 07:35

mrproliu added the enhancement New feature or request label Jan 7, 2026

Copilot started reviewing on behalf of mrproliu January 7, 2026 07:35 View session

Merge branch 'main' into node-discovery-file

9f9397f

Copilot AI reviewed Jan 7, 2026

View reviewed changes

fix issue

5d31afd

fix issue

f086261

hanahmily requested a review from Copilot January 7, 2026 11:26

Copilot started reviewing on behalf of hanahmily January 7, 2026 11:27 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

mrproliu and others added 4 commits January 8, 2026 13:56

Merge branch 'main' into node-discovery-file

ba56d0f

fix test

3fe3550

update doc

8845442

Apply suggestions from code review

967b0ea

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

wu-sheng requested a review from Copilot January 8, 2026 09:01

Copilot started reviewing on behalf of wu-sheng January 8, 2026 09:01 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

banyand/metadata/discovery/file/file.go Outdated Show resolved Hide resolved

banyand/metadata/discovery/file/file.go Outdated Show resolved Hide resolved

Merge branch 'main' into node-discovery-file

b568d16

hanahmily requested changes Jan 9, 2026

View reviewed changes

mrproliu and others added 6 commits January 9, 2026 09:48

add back-off for the node metadata fetch

afe56c5

update the handler

4817136

fix lint

3b89d83

Merge branch 'main' into node-discovery-file

218e3e3

add common package for smilier code

a4fea10

Merge branch 'main' into node-discovery-file

32050db

mrproliu added 3 commits January 12, 2026 00:20

fix issue

02e5e9e

ajust common package

e4d8815

update docs

b8e92b9

hanahmily approved these changes Jan 12, 2026

View reviewed changes

hanahmily merged commit 8a880f5 into apache:main Jan 12, 2026
40 of 41 checks passed

	--node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s)
	--node-discovery-file-retry-interval=20s # Interval to poll the discovery file and retry failed nodes (default: 20s)

	nodeFromCache, getErr := svc.GetNode(ctx, serverAddr)
	nodeFromCache, getErr := svc.GetNode(ctx, nodeName)

		fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second,
		"Fetch file interval for nodes in file discovery mode")

Support file base node discovery #928

Support file base node discovery #928

Uh oh!

Conversation

mrproliu commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wu-sheng commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mrproliu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mrproliu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hanahmily Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

hanahmily Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 7, 2026 •

edited

Loading