Skip to content

Leon/dev#1

Merged
zluudg merged 21 commits intomainfrom
leon/dev
Feb 11, 2026
Merged

Leon/dev#1
zluudg merged 21 commits intomainfrom
leon/dev

Conversation

@zluudg
Copy link
Collaborator

@zluudg zluudg commented Feb 10, 2026

Summary by CodeRabbit

  • New Features

    • Switched to NATS-driven observation processing and added LibTapir-based message generation.
    • API now reports "nats_in" counts (replacing previous "pings").
  • Documentation

    • Sample configuration converted to TOML and includes NATS and LibTapir sections.
  • Chores

    • Added CI workflows (format, test, container build), Ko build config, Go toolchain bump, module updates, and Dependabot config.
  • Tests

    • New unit tests added for NATS-related logic.

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds NATS-driven observation processing and a LibTapir message generator, refactors the app from TCP to NATS, introduces new NATS/LibTapir packages and common types, updates build/config (Go, Ko), adds CI workflows, and converts README sample config from JSON to TOML.

Changes

Cohort / File(s) Summary
CI/CD & Dependabot
.github/workflows/base-pipeline.yaml, .github/workflows/container.yaml, .github/dependabot.yml
Adds base CI (format + test), a ko-based container build workflow triggered after CI or on v* tags, and Dependabot gomod config.
Build config & modules
.ko.yaml, go.mod
Adds Ko defaults (platforms, ldflags) and bumps Go to 1.25.6; expands module dependencies (nats, dnstapir/tapir, transitive deps).
Documentation
README.md
Replaces example JSON with TOML and adds [nats] and [libtapir] sample config sections.
CLI / startup wiring
cmd/observation-encoder/main.go
Initializes NATS and LibTapir subsystems, creates per-component loggers, extends Conf with subsystem handles, and reorganizes startup/run sequence.
Core app (NATS-driven)
internal/app/app.go, internal/app/app_test.go
Replaces TCP listener with NATS-based workflow: job messages from NATS KV watch, subject parsing, observation aggregation, libtapir JSON generation, southbound publish, and renamed metrics/state (ping→nats_in). Test adjusted to non-fatal log.
API & cert config
internal/api/api.go, internal/cert/cert.go
Adds Debug to Conf; renames API method/route and metric from ping/GetPingCount → nats_in/GetNatsInCount and updates JSON key.
Common types & errors
internal/common/nats.go, internal/common/observations.go, internal/common/error.go
Adds NATS constants, NatsMsg type, observation constants/OBS_MAP, and new errors (ErrBadFlag, ErrBadKey).
NATS client package
internal/nats/nats.go, internal/nats/nats_test.go
New nats package with Conf and Create(): NATS/JetStream/KV init, WatchObservations, RemovePrefix, GetObservations (aggregate flags), SendSouthboundObservation, Shutdown; tests for key parsing and helpers.
LibTapir wrapper
internal/libtapir/libtapir.go
New libtapir package: Conf, Create(), and GenerateObservationMsg(domain, flags) producing JSON Tapir messages.

Sequence Diagram

sequenceDiagram
    participant KV as NATS KV Store
    participant App as Observation Encoder
    participant LibTapir as LibTapir
    participant SB as Southbound Subject

    KV->>App: WatchObservations() stream NatsMsg
    Note over App: receive NatsMsg (Headers, Subject, Data)
    App->>App: RemovePrefix & parse domain
    App->>KV: GetObservations(domain)
    KV-->>App: aggregated flags (uint32)
    App->>LibTapir: GenerateObservationMsg(domain, flags)
    LibTapir-->>App: JSON observation
    App->>KV: SendSouthboundObservation(json)
    KV->>SB: deliver published message
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Leon/dev #1 — Appears to contain the same NATS/LibTapir integration, CI/workflow additions, and file set changes as this PR.

Poem

🐇 I hopped through keys and subjects bright,

NATS sang softly in the night,
LibTapir wrapped each JSON byte,
Workflows woke to build and flight,
Carrots coded, everything just right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Leon/dev' is vague and does not convey meaningful information about the changeset. Use a descriptive title that summarizes the main change, such as 'Add NATS and LibTapir integration with GitHub Actions CI/CD workflows'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch leon/dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cmd/observation-encoder/main.go (2)

89-91: ⚠️ Potential issue | 🔴 Critical

Config decode error is silently discarded — app will run with zero-value config.

confDecoder.Decode(&mainConf) returns an error that is never checked. If the TOML file is malformed or has type mismatches, the application proceeds with default zero-values, leading to confusing downstream failures.

Proposed fix
 	confDecoder.DisallowUnknownFields()
-	confDecoder.Decode(&mainConf)
-	file.Close() // TODO okay to close here while also using defer above?
+	if err := confDecoder.Decode(&mainConf); err != nil {
+		log.Error("Error decoding config file '%s': %s", configFile, err)
+		os.Exit(-1)
+	}
+	file.Close()

285-289: ⚠️ Potential issue | 🟠 Major

Closing exitCh while goroutines may still write to it — potential panic.

After the timeout, close(exitCh) is called on line 287, but appHandle.Run, apiHandle.Run, and certHandle.Run goroutines may still be alive and attempting to send on exitCh, causing a panic on send to a closed channel.

Either remove the close(exitCh) (let the channel be GC'd) or ensure all goroutines have returned before closing.

internal/api/api.go (1)

111-112: ⚠️ Potential issue | 🟡 Minor

Context cancel function discarded — leaks timer resources.

context.WithTimeout returns a cancel func that should be deferred to release the timer immediately after the shutdown completes, rather than waiting for the timeout to expire.

Proposed fix
-	shutdownCtx, _ := context.WithTimeout(context.Background(), time.Second*2)
+	shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), time.Second*2)
+	defer shutdownCancel()
🤖 Fix all issues with AI agents
In @.github/workflows/container.yaml:
- Around line 18-19: The job-level if on the "container" job currently only
checks github.event.workflow_run.conclusion and therefore skips tag push
triggers; update the job's if expression on the container job to allow both
successful workflow_run triggers and push-tag triggers by combining checks for
github.event_name == 'workflow_run' && github.event.workflow_run.conclusion ==
'success' OR github.event_name == 'push' && startsWith(github.ref,
'refs/tags/'); modify the existing if that references
github.event.workflow_run.conclusion so it covers both cases.
- Around line 31-32: The workflow currently runs "ko build --bare
./cmd/observation-encoder" without registry config or auth so images are only
built locally; add a registry login step (e.g., use docker/login-action@v3)
before the ko steps, set the KO_DOCKER_REPO environment variable to your GHCR
registry (for example ghcr.io/${{ github.repository }}) in the job or step env,
and invoke ko build with the push option (or add --push) to ensure the image is
pushed to GHCR rather than discarded; update the existing uses:
ko-build/setup-ko@v0.7 and the ko build invocation accordingly.

In `@go.mod`:
- Line 3: go.mod declares "go 1.25.6" but .github/workflows/container.yaml sets
GO_VERSION to "1.25.5", causing a mismatch; update the workflow to use
GO_VERSION: "1.25.6" (or alternatively change the go directive in go.mod to
1.25.5) so the container build toolchain meets the module's minimum version;
locate the GO_VERSION variable in .github/workflows/container.yaml and make the
value match the go directive in go.mod.

In `@internal/app/app.go`:
- Around line 108-121: The main loop is spinning when natsInCh is closed because
the select's receive doesn't check for closure; change the natsInCh receive in
the MAIN_APP_LOOP to use the two-value receive (msg, ok := <-natsInCh) and if
ok==false handle graceful shutdown of that input (e.g., log and break
MAIN_APP_LOOP or stop sending jobs) so you don't push zero-value job structs
onto jobChan; update references to job creation (job{msg: msg}) and the
a.pm.natsInCount.Add increment to only occur when ok is true.
- Around line 138-151: The domain extraction in handleJob relies on RemovePrefix
leaving a leading delimiter and drops two elements via
domainSplit[:len(domainSplit)-2], which will break if RemovePrefix is changed;
update handleJob to explicitly remove only the observation-type component rather
than assuming an empty label: after calling a.natsHandle.RemovePrefix and
splitting into domainSplit, filter out any empty strings or explicitly pop the
single trailing observation-type element (use domainSplit[:len(domainSplit)-1]
after verifying non-empty), or locate the observation-type by index/name and
remove it, then join the remaining labels into domain; ensure this logic
references RemovePrefix, domainSplit, slices.Reverse and c_NATS_DELIM so it
stays correct if RemovePrefix is fixed.

In `@internal/common/observations.go`:
- Around line 6-9: OBS_MAP currently hardcodes numeric values for the
"globally_new" and "looptest" entries which can drift if the corresponding
constants change; update OBS_MAP to use the declared observation constants
instead of the literals (replace 1 and 1024 with the existing named constants
for the "globally_new" and "looptest" observation values), importing or
referencing those constant names in this file so the map stays in sync with the
canonical definitions.

In `@internal/nats/nats.go`:
- Around line 196-199: The Shutdown method on natsClient currently does nothing
and leaks connections; update natsClient.Shutdown to check if nc.conn is non-nil
and then gracefully close the connection (prefer nc.conn.Drain() if you want to
flush pending messages, falling back to nc.conn.Close()), return any error from
Drain/Close, and set nc.conn = nil (and optionally nc.opts/state) to avoid
double-close; ensure you reference the natsClient type and its nc.conn field and
propagate the error from Drain/Close instead of always returning nil.
- Around line 82-89: RemovePrefix currently leaves a leading delimiter when
nc.subjectPrefix is removed (e.g., "obs.foo" -> ".foo"), which causes downstream
splitting in handleJob to produce an empty first element; update
natsClient.RemovePrefix to, after calling strings.CutPrefix(subject,
nc.subjectPrefix), also strip a leading delimiter (e.g., '.' or the configured
delimiter) from subjectCut before returning it, preserving the existing warning
log when the prefix is missing so downstream code in handleJob gets a clean
subject string for domain extraction.
- Around line 139-143: Protect against an index panic by checking the length of
kSplit before accessing kSplit[1] (in the section that assigns flag) and skip or
handle entries that don't contain the delimiter; then include the flag variable
in the warning log call (the nc.log.Warning call that currently uses
"Unrecognized flag '%s', ignoring..." should be passed the flag value) so the
format string has its argument. Specifically, update the logic around kSplit,
flag, and the lookup into common.OBS_MAP to first verify len(kSplit) > 1 (and
continue or set a safe default if not), then perform the map lookup using flag,
and call nc.log.Warning with both the format string and flag.
- Around line 163-169: Replace the use of LimitMarkerTTL with TTL in the
jetstream.KeyValueConfig passed to js.CreateKeyValue so the actual KV entries
expire after nc.ttl; update the KeyValueConfig for the CreateKeyValue call
(referenced in js.CreateKeyValue and jetstream.KeyValueConfig, using nc.bucket
and nc.ttl) to set TTL to nc.ttl and remove LimitMarkerTTL unless you
intentionally want to control tombstone retention separately.
🧹 Nitpick comments (11)
internal/common/observations.go (1)

3-4: Non-idiomatic Go naming: OBS_GLOBALLY_NEW, OBS_LOOPTEST, OBS_MAP.

Go convention uses MixedCaps for exported identifiers (e.g., ObsGloballyNew, ObsLooptest, ObsMap). Screaming snake case is unconventional in Go.

internal/cert/cert.go (1)

20-20: Debug field is declared but never used within this package.

The Debug field is added to Conf on line 20 but is never extracted in Create() or used by certHandle. While this field is consistently present across multiple subsystem configs (libtapir, nats, api, app), it's not wired into the cert subsystem's logging. If the intent is to control debug-level logging for certificates, the field should be passed to certHandle and used to configure logging behavior; otherwise, it's dead configuration specific to this package.

internal/libtapir/libtapir.go (3)

12-15: Debug field in Conf is unused.

conf.Debug is declared and deserialized from TOML, but never read by Create or stored on the libtapir struct. Either remove it or wire it into the struct for conditional debug logging.


32-59: Hardcoded metadata and dual time.Now() calls.

Two concerns:

  1. time.Now() is called separately at lines 35 and 49, producing slightly different timestamps for TimeAdded vs TimeStamp. Capture a single now and reuse it for consistency.
  2. TTL, SrcName, Creator, MsgType, and ListType are all hardcoded. Consider making these configurable via Conf or at least defining them as named constants so they're discoverable and easy to change.
Proposed fix for consistent timestamps
 func (lt *libtapir) GenerateObservationMsg(domainStr string, flags uint32) (string, error) {
+	now := time.Now()
 	domain := tapir.Domain{
 		Name:         domainStr,
-		TimeAdded:    time.Now(),
+		TimeAdded:    now,
 		TTL:          3600,
 		TagMask:      tapir.TagMask(flags),
 		ExtendedTags: []string{},
 	}
 
 	tapirMsg := tapir.TapirMsg{
 		SrcName:   "dns-tapir",
 		Creator:   "tapir-analyse-new-qname",
 		MsgType:   "observation",
 		ListType:  "doubtlist",
 		Added:     []tapir.Domain{domain},
 		Removed:   []tapir.Domain{},
 		Msg:       "",
-		TimeStamp: time.Now(),
+		TimeStamp: now,
 		TimeStr:   "",
 	}

21-30: Exported function returns unexported type *libtapir.

Create returns *libtapir, which is unexported. While this works because callers use the libtapir interface defined in internal/app/app.go, it means callers cannot declare variables of the concrete type. This is an acceptable pattern within internal packages, but be aware it prevents direct type assertions outside this package.

internal/nats/nats.go (2)

151-180: NATS connection has no options — no timeouts, no reconnect policy, no auth.

nats.Connect(nc.url) uses bare defaults. Consider adding connect timeout, reconnect settings, and error/disconnect handlers for production resilience. Also, context.Background() on line 163 means KV creation has no timeout or cancellation — consider threading a context through from Create.


100-116: Unbuffered channel may block the KV watcher goroutine.

outCh is unbuffered (line 100). If downstream consumers (the MAIN_APP_LOOP in app.go) are busy or slow to drain, the goroutine will block on outCh <- natsMsg, stalling further KV updates. Consider adding a buffer to match or exceed c_N_HANDLERS.

cmd/observation-encoder/main.go (1)

100-106: Logger creation failures don't halt execution — leads to confusing errors downstream.

On lines 104-106, if logger.Create fails, the error is only logged but execution continues. natslog will be nil, causing nats.Create to fail with a generic "nil logger" error rather than the real root cause. The same pattern repeats for libtapir (lines 124-126), app (lines 144-146), cert (lines 166-168), and API (lines 186-188) loggers.

Consider failing fast on logger creation errors, consistent with how subsystem creation errors are handled.

Example fix for the nats logger block
 	natslog, err := logger.Create(
 		logger.Conf{
 			Debug: debugFlag || mainConf.Nats.Debug,
 		})
 	if err != nil {
 		log.Error("Error creating nats log: %s", err)
+		os.Exit(-1)
 	}
internal/app/app_test.go (1)

7-9: Placeholder test provides no coverage.

TestAppBasic does nothing beyond logging. With the new NATS and libtapir interfaces, you could inject mock implementations to test Create validation, handleJob domain extraction, and the observation aggregation flow without a real NATS server.

Would you like me to open an issue to track adding unit tests with mock NATS/libtapir implementations?

internal/api/api.go (1)

120-127: Manual JSON construction is fragile.

fmt.Sprintf("{\"nats_in\": %d}", n) works for an int64 but doesn't scale and bypasses proper encoding. Consider using json.NewEncoder(rw).Encode(...) or json.Marshal with a small struct/map for consistency and safety.

internal/app/app.go (1)

123-126: Pending jobs are discarded without processing on shutdown.

The drain loop on lines 123-125 reads and discards buffered jobs. Meanwhile, close(jobChan) signals workers to exit. Any in-flight observations are silently lost. If at-least-once processing matters, consider draining by letting workers finish rather than discarding.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/app/app.go (1)

88-98: ⚠️ Potential issue | 🟡 Minor

NATS watch failure sends exit but doesn't close the channel or clean up.

If WatchObservations fails, Run returns immediately after signaling on exitCh, but jobChan is never closed. The handler goroutines spawned at Lines 100–105 will leak, blocked on range jobChan forever.

🛡️ Proposed fix — close jobChan before early return
 	natsInCh, err := a.natsHandle.WatchObservations(ctx)
 	if err != nil {
 		a.log.Error("Error connecting to NATS: %s", err)
+		close(jobChan)
 		a.exitCh <- common.Exit{ID: a.id, Err: err}
 		return
 	}

Note: move the goroutine spawning after the watch call to avoid this entirely.

🤖 Fix all issues with AI agents
In `@internal/nats/nats.go`:
- Around line 155-185: initNats opens a connection with nats.Connect but doesn't
close it on subsequent failures (jetstream.New or js.CreateKeyValue), leaking
resources; modify initNats to ensure the conn is closed on any error after
nats.Connect succeeds (either by deferring conn.Close immediately after
successful nats.Connect and clearing the defer when storing nc.conn, or by
explicitly calling conn.Close() in each error branch after nats.Connect),
referencing the symbols nats.Connect, conn, jetstream.New, js.CreateKeyValue,
nc.conn and ensure successful path preserves the open connection by assigning
nc.conn before cancelling the close.
- Line 140: The log call uses the wrong receiver variable name: replace the
undefined `a.log` with the `natsClient` receiver `nc.log` in the method where
the warning is emitted (the natsClient method that currently contains the line
"Badly formatted key '%s'. Skipping..."); update the call to use
`nc.log.Warning(...)` so it references the correct receiver and resolves the
undefined identifier.
🧹 Nitpick comments (2)
internal/nats/nats.go (1)

100-100: Unbuffered channel may block the KV watcher goroutine.

outCh is unbuffered, so if the consumer (main loop → job channel → handlers) can't keep up, the watcher goroutine blocks and NATS KV updates stall. Consider adding a small buffer to decouple the watcher from downstream processing.

♻️ Suggested change
-	outCh := make(chan common.NatsMsg)
+	outCh := make(chan common.NatsMsg, 64)
internal/app/app.go (1)

127-130: Drain loop discards pending jobs on shutdown.

The for len(jobChan) > 0 loop races with the still-running handler goroutines and silently drops buffered jobs before closing the channel. If this is intentional (fast shutdown, discard in-flight work), a brief comment would clarify intent. If pending jobs should be processed, close the channel first and let the handlers drain via their range loop.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@internal/nats/nats.go`:
- Around line 64-66: The current check only rejects zero TTL but allows negative
values via conf.Ttl; change the validation around conf.Ttl to reject any
non-positive value (<= 0) and return a clearer error that includes the provided
TTL (use conf.Ttl in the message) so callers see the invalid value; update the
error returned from the block that currently returns errors.New("zero ttl") to
something like a formatted error stating the TTL is invalid/non-positive, and
ensure any downstream conversion to time.Duration remains safe.
- Around line 136-150: The loop that extracts flags from subscription keys
assumes the flag is always at kSplit[1], which breaks when the subject prefix
contains dots; update the extraction in the for loop over ls.Keys() to compute
the flag index relative to the prefix length instead of using a hardcoded 1:
split the key by c_NATS_DELIM, determine the number of segments in the
configured subjectPrefix (or count prefix segments from ls/lookup), then use
that offset to read the flag segment (before looking it up in common.OBS_MAP),
preserving current validation and accumulation into obs; adjust logging messages
in nc.log.Warning accordingly when key is malformed or flag is unrecognized.
🧹 Nitpick comments (3)
internal/nats/nats.go (3)

20-40: Unused fields: Conf.Debug and natsClient.queue.

Debug in Conf is declared and tagged but never read anywhere in this file. Similarly, queue in natsClient is never assigned or referenced. Consider removing them to avoid confusion, or wire them up if they're intended for future use.

#!/bin/bash
# Verify whether Debug or queue are used elsewhere in the codebase
echo "=== Searching for 'Debug' usage ==="
rg -n --type=go '\bDebug\b' -g '!vendor/**'
echo ""
echo "=== Searching for '.queue' usage ==="
rg -n --type=go '\.queue\b' -g '!vendor/**'

91-121: Watch goroutine lacks ctx.Done() select and the channel is unbuffered.

If w.Updates() blocks indefinitely (e.g., on a NATS reconnect or edge case where the context cancellation doesn't propagate to the watcher channel), the goroutine will leak. Adding a select on ctx.Done() inside the loop would make this more robust. The unbuffered outCh also means a slow consumer stalls the watch loop — a small buffer could help decouple them.

Proposed sketch
 	go func() {
+		defer close(outCh)
 		nc.log.Info("Starting NATS listener loop")
-		for val := range w.Updates() {
-			if val == nil {
-				continue
+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case val, ok := <-w.Updates():
+				if !ok {
+					return
+				}
+				if val == nil {
+					continue
+				}
+				nc.log.Debug("Incoming NATS KV update on '%s'!", val.Key())
+				natsMsg := common.NatsMsg{
+					Headers: nil,
+					Data:    val.Value(),
+					Subject: val.Key(),
+				}
+				outCh <- natsMsg
 			}
-			nc.log.Debug("Incoming NATS KV update on '%s'!", val.Key())
-			natsMsg := common.NatsMsg{
-				Headers: nil,
-				Data:    val.Value(),
-				Subject: val.Key(),
-			}
-			outCh <- natsMsg
 		}
-		close(outCh)
 	}()

187-199: Non-idiomatic else after early return.

In Go, when the if block ends with a return, the else is unnecessary. Removing it reduces nesting.

Proposed fix
 	err := nc.conn.Publish(nc.subjectSouthbound, outMsg)
 	if err != nil {
 		nc.log.Error("Couldn't publish %d bytes msg on %s", len(outMsg), nc.subjectSouthbound)
 		return err
-	} else {
-		nc.log.Debug("Successful publish on '%s'", nc.subjectSouthbound)
 	}
+	nc.log.Debug("Successful publish on '%s'", nc.subjectSouthbound)
 
 	return nil

@zluudg zluudg merged commit 82b27ef into main Feb 11, 2026
2 of 3 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant