-
Notifications
You must be signed in to change notification settings - Fork 105
Support file base node discovery #928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds file-based node discovery as a third discovery mechanism to BanyanDB, alongside the existing etcd-based and DNS-based discovery modes. This enables static node configuration through YAML files, suitable for development, testing, and simple deployment scenarios.
Key Changes:
- Implements a new file-based node discovery service that reads node configurations from a YAML file and periodically reloads them
- Adds comprehensive documentation for file-based discovery configuration and usage
- Refactors DNS discovery into a dedicated
discovery/dnspackage (previously undermetadata/dns)
Reviewed changes
Copilot reviewed 6 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
docs/operation/node-discovery.md |
Adds documentation for the new file-based discovery mode, including configuration examples, error handling, and usage guidance |
banyand/metadata/discovery/file/file.go |
Core implementation of file-based node discovery with YAML config parsing, periodic reloading, and gRPC metadata fetching |
banyand/metadata/discovery/file/metrics.go |
Defines metrics for monitoring file discovery operations (load count, failures, duration, node count) |
banyand/metadata/discovery/file/file_test.go |
Comprehensive test coverage for file discovery including configuration validation, node caching, handler notifications, and concurrent access |
banyand/metadata/discovery/dns/dns.go |
DNS discovery implementation moved to new package structure with TLS support and certificate reloading |
banyand/metadata/discovery/dns/metrics.go |
Metrics definitions for DNS discovery operations |
banyand/metadata/discovery/dns/export_test.go |
Test exports for DNS discovery internal methods |
banyand/metadata/discovery/dns/dns_test.go |
Updates import path to reflect new package structure |
banyand/metadata/discovery/dns/testdata/* |
TLS test certificates for DNS discovery testing |
banyand/metadata/client.go |
Integrates file discovery mode with metadata client, adds configuration flags and lifecycle management |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docs/operation/node-discovery.md
Outdated
| The service monitors the configuration file for changes and automatically updates the node registry when the file is modified. | ||
|
|
||
| ### How it Works | ||
|
|
||
| 1. Read node configurations from a YAML file on startup | ||
| 2. Attempt to connect to each node via gRPC to fetch full metadata | ||
| 3. Successfully connected nodes are added to the cache | ||
| 4. Failed nodes are tracked separately and retried periodically | ||
| 5. Watch the file for changes and reload automatically |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation describes retry behavior for failed nodes (line 296, 312, 364), but the actual implementation uses FetchInterval for periodic file reloading in periodicFetch function, not specifically for retrying failed nodes. Failed nodes are simply skipped and will be retried on the next periodic file load. Consider clarifying that this interval controls how often the file is reloaded and all nodes are re-attempted, rather than implying a specific retry queue mechanism.
| The service monitors the configuration file for changes and automatically updates the node registry when the file is modified. | |
| ### How it Works | |
| 1. Read node configurations from a YAML file on startup | |
| 2. Attempt to connect to each node via gRPC to fetch full metadata | |
| 3. Successfully connected nodes are added to the cache | |
| 4. Failed nodes are tracked separately and retried periodically | |
| 5. Watch the file for changes and reload automatically | |
| The service periodically reloads the configuration file and automatically updates the node registry when changes are detected. | |
| ### How it Works | |
| 1. Read node configurations from a YAML file on startup | |
| 2. Attempt to connect to each node via gRPC to fetch full metadata | |
| 3. Successfully connected nodes are added to the cache | |
| 4. Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload | |
| 5. Reload the file at a configured interval (FetchInterval), reprocessing all nodes (including previously failed ones) |
banyand/metadata/client.go
Outdated
| if s.dnsDiscovery != nil { | ||
| return s.dnsDiscovery | ||
| } | ||
| // If file discovery is enabled, use it instead of etcd | ||
| if s.fileDiscovery != nil { | ||
| return s.fileDiscovery | ||
| } | ||
| // Otherwise use etcd schema registry | ||
| return s.schemaRegistry |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of checking discovery services in NodeRegistry() is inconsistent with the order in RegisterHandler(). In RegisterHandler, file discovery is checked before DNS discovery (lines 373-379), but in NodeRegistry, DNS discovery is checked before file discovery (lines 414-422). This inconsistency could lead to confusion. Consider using a consistent priority order across all methods.
docs/operation/node-discovery.md
Outdated
| --node-discovery-grpc-timeout=5s # Timeout for metadata fetch (default: 5s) | ||
| # Retry settings | ||
| --node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value shown in the comment (default: 20s) doesn't match the flag name which suggests it should be for retry interval, but the configuration flag on line 312 is actually for retry-interval while line 306 shows file-path. The documentation should clarify that this is the periodic fetch/reload interval for checking file changes, not just retry interval for failed nodes.
| --node-discovery-file-retry-interval=20s # Retry interval for failed nodes (default: 20s) | |
| --node-discovery-file-retry-interval=20s # Interval to poll the discovery file and retry failed nodes (default: 20s) |
| assert.Equal(t, nodeName, nodes[0].GetMetadata().GetName()) | ||
| assert.Equal(t, serverAddr, nodes[0].GetGrpcAddress()) | ||
|
|
||
| nodeFromCache, getErr := svc.GetNode(ctx, serverAddr) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passes "addr" (an address like "127.0.0.1:12345") to GetNode but based on the interface contract established by the etcd and DNS implementations, GetNode should accept a node name, not an address. This test is passing because the file implementation incorrectly uses address as the cache key, but this is actually testing incorrect behavior. The test should be updated to call GetNode with the node name once the implementation is fixed to properly look up by name.
| nodeFromCache, getErr := svc.GetNode(ctx, serverAddr) | |
| nodeFromCache, getErr := svc.GetNode(ctx, nodeName) |
| // file mode doesn't support role filtering as roles are not stored in config | ||
| // return all nodes when ROLE_UNSPECIFIED, otherwise return empty list | ||
| if role == databasev1.Role_ROLE_UNSPECIFIED { |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states that file mode doesn't support role filtering and returns an empty list for non-UNSPECIFIED roles. However, nodes fetched via gRPC (fetchNodeMetadata) include role information in their metadata. Consider fetching and storing role information from the nodes, then implementing proper role filtering like DNS discovery does, rather than silently returning empty results which could be confusing for users expecting role-based filtering to work.
| // file mode doesn't support role filtering as roles are not stored in config | |
| // return all nodes when ROLE_UNSPECIFIED, otherwise return empty list | |
| if role == databasev1.Role_ROLE_UNSPECIFIED { | |
| // return all nodes when ROLE_UNSPECIFIED, otherwise filter by node role | |
| if role == databasev1.Role_ROLE_UNSPECIFIED || node.GetRole() == role { |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #928 +/- ##
==========================================
+ Coverage 45.97% 47.00% +1.02%
==========================================
Files 328 382 +54
Lines 55505 59106 +3601
==========================================
+ Hits 25520 27780 +2260
- Misses 27909 28727 +818
- Partials 2076 2599 +523
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
We need documents about three service discovery ways.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 11 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| s.cacheMutex.Lock() | ||
| if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded { | ||
| s.nodeCache[n.Address] = node | ||
|
|
||
| // notify handlers after releasing lock | ||
| s.notifyHandlers(schema.Metadata{ | ||
| TypeMeta: schema.TypeMeta{ | ||
| Kind: schema.KindNode, | ||
| Name: node.GetMetadata().GetName(), | ||
| }, | ||
| Spec: node, | ||
| }, true) | ||
|
|
||
| s.log.Debug(). | ||
| Str("address", n.Address). | ||
| Str("name", node.GetMetadata().GetName()). | ||
| Msg("New node discovered and added to cache") | ||
| } | ||
| s.cacheMutex.Unlock() |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notifyHandlers call at line 208-214 is made while holding the cacheMutex lock. If a handler performs slow operations or calls back into the service, this could cause deadlocks or performance issues. The handlers should be notified after releasing the lock, similar to how deletion notifications are handled at lines 250-258.
banyand/metadata/client.go
Outdated
| fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second, | ||
| "Fetch file interval for nodes in file discovery mode") |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag name uses "fetch-interval" but the documentation calls it "retry-interval". The flag should be renamed to "node-discovery-file-retry-interval" to match the documentation at line 311, or the documentation should be updated to use "fetch-interval".
| fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-fetch-interval", 20*time.Second, | |
| "Fetch file interval for nodes in file discovery mode") | |
| fs.DurationVar(&s.fileFetchInterval, "node-discovery-file-retry-interval", 20*time.Second, | |
| "Retry interval for reading node configuration file in file discovery mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be changed to this one: "Interval to poll the discovery file and retry failed nodes in file discovery mode".
| func TestConcurrentAccess(t *testing.T) { | ||
| listener, grpcServer, nodeServer := startMockGRPCServer(t) | ||
| defer grpcServer.Stop() | ||
| defer listener.Close() | ||
| addr := listener.Addr().String() | ||
| nodeServer.setNode(newTestNode("concurrent-node", addr)) | ||
|
|
||
| configFile := createTempConfigFile(t, fmt.Sprintf(` | ||
| nodes: | ||
| - name: concurrent-node | ||
| grpc_address: %s | ||
| `, addr)) | ||
| defer os.Remove(configFile) | ||
|
|
||
| svc, err := NewService(Config{ | ||
| FilePath: configFile, | ||
| GRPCTimeout: testGRPCTimeout, | ||
| FetchInterval: testFetchInterval, | ||
| }) | ||
| require.NoError(t, err) | ||
| defer svc.Close() | ||
|
|
||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| defer cancel() | ||
| require.NoError(t, svc.Start(ctx)) | ||
|
|
||
| var wg sync.WaitGroup | ||
| for i := 0; i < 10; i++ { | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| for j := 0; j < 100; j++ { | ||
| if _, errList := svc.ListNode(ctx, databasev1.Role_ROLE_UNSPECIFIED); errList != nil { | ||
| t.Errorf("list node: %v", errList) | ||
| } | ||
| if _, errGet := svc.GetNode(ctx, "concurrent-node"); errGet != nil { | ||
| t.Errorf("get node: %v", errGet) | ||
| } | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| wg.Wait() | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test uses small intervals (200ms) for fetchInterval, but doesn't verify that the file is actually reloaded at that interval. Consider adding a test that verifies the periodic reload behavior by modifying the file after initial load and checking that changes are detected within the expected interval.
| func TestStartWithInvalidConfig(t *testing.T) { | ||
| ctx := context.Background() | ||
|
|
||
| t.Run("invalid yaml", func(t *testing.T) { | ||
| configFile := createTempConfigFile(t, ` | ||
| nodes: | ||
| - name: node1 | ||
| grpc_address: [invalid | ||
| `) | ||
| defer os.Remove(configFile) | ||
|
|
||
| svc, err := NewService(Config{ | ||
| FilePath: configFile, | ||
| GRPCTimeout: testGRPCTimeout, | ||
| FetchInterval: testFetchInterval, | ||
| }) | ||
| require.NoError(t, err) | ||
| defer svc.Close() | ||
|
|
||
| err = svc.Start(ctx) | ||
| require.Error(t, err) | ||
| assert.Contains(t, err.Error(), "failed to parse YAML") | ||
| }) | ||
|
|
||
| t.Run("missing address", func(t *testing.T) { | ||
| configFile := createTempConfigFile(t, ` | ||
| nodes: | ||
| - name: node1 | ||
| `) | ||
| defer os.Remove(configFile) | ||
|
|
||
| svc, err := NewService(Config{ | ||
| FilePath: configFile, | ||
| GRPCTimeout: testGRPCTimeout, | ||
| FetchInterval: testFetchInterval, | ||
| }) | ||
| require.NoError(t, err) | ||
| defer svc.Close() | ||
|
|
||
| err = svc.Start(ctx) | ||
| require.Error(t, err) | ||
| assert.Contains(t, err.Error(), "missing required field: grpc_address") | ||
| }) | ||
|
|
||
| t.Run("tls enabled without ca cert", func(t *testing.T) { | ||
| configFile := createTempConfigFile(t, ` | ||
| nodes: | ||
| - name: node1 | ||
| grpc_address: 127.0.0.1:17912 | ||
| tls_enabled: true | ||
| `) | ||
| defer os.Remove(configFile) | ||
|
|
||
| svc, err := NewService(Config{ | ||
| FilePath: configFile, | ||
| GRPCTimeout: testGRPCTimeout, | ||
| FetchInterval: testFetchInterval, | ||
| }) | ||
| require.NoError(t, err) | ||
| defer svc.Close() | ||
|
|
||
| err = svc.Start(ctx) | ||
| require.Error(t, err) | ||
| assert.Contains(t, err.Error(), "missing ca_cert_path") | ||
| }) | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation at line 294 states "Nodes that fail to connect are skipped and will be attempted again on the next periodic file reload". However, there's no test coverage for this retry behavior. Consider adding a test that starts with an unreachable node, then makes it reachable, and verifies it gets added to the cache after the retry interval.
| for _, n := range newNodes { | ||
| s.cacheMutex.RLock() | ||
| _, exists := s.nodeCache[n.Address] | ||
| s.cacheMutex.RUnlock() | ||
|
|
||
| if !exists { | ||
| // fetch node metadata from gRPC | ||
| node, fetchErr := s.fetchNodeMetadata(ctx, n) | ||
| if fetchErr != nil { | ||
| s.log.Warn(). | ||
| Err(fetchErr). | ||
| Str("node", n.Name). | ||
| Str("address", n.Address). | ||
| Msg("Failed to fetch node metadata, will skip") | ||
| continue | ||
| } | ||
|
|
||
| s.cacheMutex.Lock() | ||
| if _, alreadyAdded := s.nodeCache[n.Address]; !alreadyAdded { | ||
| s.nodeCache[n.Address] = node | ||
|
|
||
| // notify handlers after releasing lock | ||
| s.notifyHandlers(schema.Metadata{ | ||
| TypeMeta: schema.TypeMeta{ | ||
| Kind: schema.KindNode, | ||
| Name: node.GetMetadata().GetName(), | ||
| }, | ||
| Spec: node, | ||
| }, true) | ||
|
|
||
| s.log.Debug(). | ||
| Str("address", n.Address). | ||
| Str("name", node.GetMetadata().GetName()). | ||
| Msg("New node discovered and added to cache") | ||
| } | ||
| s.cacheMutex.Unlock() |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race condition between the RLock check at line 187-189 and the Lock at line 203. Two goroutines could both pass the existence check and both attempt to fetch metadata for the same node. Consider using a single Lock instead of RLock-then-Lock, or implement proper double-checked locking with a second existence check after acquiring the write lock.
| // skip node registration if DNS/file mode is enabled or node registration is disabled | ||
| if !s.toRegisterNode || s.nodeDiscoveryMode == NodeDiscoveryModeDNS || | ||
| s.nodeDiscoveryMode == NodeDiscoveryModeFile { |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "skip node registration if DNS/file mode is enabled" but the actual condition only skips if the mode is DNS or File AND toRegisterNode is true. The comment should be more precise: "skip node registration if DNS/file mode is enabled (these modes don't support manual registration) or node registration is disabled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong comment, for now, the judgment is correct. If there is no need to register a node, or if it is DNS discovery, or if it is file discovery, the register would be ignored.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return errors.New("manual node update not supported in file discovery mode") | ||
| } | ||
|
|
||
| func (s *Service) periodicFetch(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use "github.com/fsnotify/fsnotify" to check the change of this file? We have examples to implement authN and TLS file reloading.
| return resp.GetNode(), nil | ||
| } | ||
|
|
||
| func (s *Service) updateNodeCache(ctx context.Context, newNodes []NodeConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems to be the same as the one used in DNS discovery. Could you extract the same logic to reuse in those cases?
CHANGESlog.