refactor(platform): migrate tracer to lib-commons v4 and harden observability#58
refactor(platform): migrate tracer to lib-commons v4 and harden observability#58fredcamaral wants to merge 15 commits intodevelopfrom
Conversation
Move tracer to lib-auth v2.5.0-beta.7 and lib-commons v4.0.0 while refreshing module versions for compatibility with the new APIs.
Remove internal logging/otel adapters and switch handlers, services, and repositories to direct lib-commons logging, tracing, metrics, and postgres interfaces to reduce migration debt.
Align PROJECT_RULES import and usage snippets with the new lib-commons v4 module paths used by tracer.
Replace duplicated pairs/withTrace helpers with pkg/logging utilities to keep trace enrichment and structured field handling consistent across layers. Update coding rules examples to reflect logger.With(...).Log(...) usage with lib-commons v4 APIs.
Pass caller context through DB adapters/repositories so tenant resolution and cancellation are preserved, and add coverage for adapter constructors/context flow. Harden bootstrap and readiness behavior by adding migration/connect timeouts, conditional schema migration execution, and nil-DB readiness guards.
Replace high-cardinality struct span attributes with explicit safe fields, add metric-initialization failure logging, and avoid logging raw panic payloads. Strengthen middleware/migration/test utilities with new coverage for no-op metrics paths, route assertions, timing metrics behavior, and mock logger contract cleanup.
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR migrates the codebase from lib-commons v2 → v4, bumps Go to 1.25.7, and upgrades many dependencies. Logging is moved to pkgLogging.Pairs + Log(ctx, level, ...), tracing switches to value-based spans and SetSpanAttributesFromValue, and Postgres connection handling now uses a Client with GetDB(ctx). Multiple internal signatures, tests, mocks, metrics initialization, and migration helpers were updated; public business APIs are largely unchanged. Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTPServer as "HTTP Server"
participant Handler
participant Service
participant Repository
participant PostgresClient as "Postgres Client"
participant Logger
participant Telemetry
Client->>HTTPServer: send HTTP request
HTTPServer->>Handler: dispatch request (with ctx)
Handler->>Telemetry: start span (span)
Handler->>Logger: logger = pkgLogging.WithTrace(ctx, logger)
Handler->>Service: call operation(ctx, span, logger, ...)
Service->>Telemetry: SetSpanAttributesFromValue(span, ...)
Service->>Repository: repoMethod(ctx, span, ...)
Repository->>PostgresClient: GetDB(ctx)
PostgresClient-->>Repository: DB connection / error
alt DB error
Repository->>Telemetry: HandleSpanError(span, ...)
Repository->>Logger: Log(ctx, LevelError, "db error", ...)
Repository-->>Service: return error
else DB success
Repository-->>Service: return result
end
Service->>Logger: Log(ctx, LevelInfo, "operation complete", ...)
alt business error
Service->>Telemetry: HandleSpanBusinessErrorEvent(span, ...)
end
Service-->>Handler: response
Handler->>HTTPServer: send response
HTTPServer-->>Client: HTTP response
|
|
This PR is very large (91 files, 5911 lines changed). Consider breaking it into smaller PRs for easier review. |
🔒 Security Scan Results —
|
There was a problem hiding this comment.
Actionable comments posted: 28
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
pkg/resilience/circuit_breaker.go (1)
125-139:⚠️ Potential issue | 🟠 MajorThe recovered panic still leaks the raw payload.
The warning log is sanitized, but Line 131 still returns
fmt.Errorf("panic recovered in circuit breaker: %v", r). Ifrcontains secrets or a large struct, that value can bubble into upstream logs/responses and undo the hardening in this PR.🛡️ Suggested fix
go func() { defer func() { if r := recover(); r != nil { + panicType := fmt.Sprintf("%T", r) // Use select with default to avoid blocking if channel is full // (shouldn't happen with buffer of 1, but defensive) select { - case fnDone <- result{nil, fmt.Errorf("panic recovered in circuit breaker: %v", r)}: + case fnDone <- result{nil, fmt.Errorf("panic recovered in circuit breaker (%s)", panicType)}: default: - panicType := fmt.Sprintf("%T", r) c.logger.With(pkgLogging.Pairs(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/resilience/circuit_breaker.go` around lines 125 - 139, The recovered panic payload is being included verbatim in the error sent on fnDone (result{nil, fmt.Errorf("panic recovered in circuit breaker: %v", r)}), which can leak secrets; change that to a sanitized error that does not include the raw value — e.g. send result{nil, fmt.Errorf("panic recovered in circuit breaker: %T", r)} or a generic fmt.Errorf("panic recovered in circuit breaker")} in the goroutine's defer, keeping the existing detailed panic type/logging only in the c.logger.With(...).Log call (referencing fnDone, result, the goroutine, c.logger and circuit_breaker.execute) so no raw panic payload is propagated out of the circuit breaker.internal/services/command/update_rule.go (1)
55-64:⚠️ Potential issue | 🔴 CriticalReject nil update payloads before Line 81 dereferences them.
input.Expressionwill panic when callers pass a nil body.CreateLimitCommand.Executealready guards its pointer input; this path needs the same protection.Proposed fix
func (c *UpdateRuleCommand) Execute(ctx context.Context, id uuid.UUID, input *UpdateRuleInput) (*model.Rule, error) { logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx) ctx, span := tracer.Start(ctx, "service.rule.update") defer span.End() logger = pkgLogging.WithTrace(ctx, logger) + if input == nil { + err := errors.New("nil UpdateRuleInput") + libOpentelemetry.HandleSpanBusinessErrorEvent(span, "Nil input provided", err) + return nil, err + } logger.With(pkgLogging.Pairs("operation", "service.rule.update", "rule.id", id.String())...).Log(ctx, libLog.LevelInfo, "Updating rule")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/services/command/update_rule.go` around lines 55 - 64, Reject nil UpdateRuleInput at the start of UpdateRuleCommand.Execute to avoid panics when dereferencing fields like input.Expression: add a nil-check for the input pointer in UpdateRuleCommand.Execute (similar to CreateLimitCommand.Execute) and return a clear error (e.g., validation/invalid-argument) if input == nil before any use of input.Expression or other fields so the method never dereferences a nil pointer.internal/adapters/postgres/usage_counter_repository_test.go (1)
31-49: 🛠️ Refactor suggestion | 🟠 MajorAdd
ExpectationsWereMet()verification to the cleanup function.Tests throughout this file set SQL expectations via
mockSetupbut never verify them. A test can still pass when the repository skips an expected SQL call, which defeats the purpose of the mocked expectations. The cleanup function runs on every test viadefer cleanup(), making it the ideal place to assert all expectations were consumed.Suggested fix
cleanup := func() { + require.NoError(t, sqlMock.ExpectationsWereMet()) if err := db.Close(); err != nil { t.Logf("failed to close mock db: %v", err) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/adapters/postgres/usage_counter_repository_test.go` around lines 31 - 49, The cleanup function returned by setupUsageCounterRepositoryMockDB currently only closes the mock DB; update it to also call sqlMock.ExpectationsWereMet() and fail the test if it returns an error (e.g., using require.NoError(t, err) or t.Fatalf) so every test verifies all SQL expectations were consumed; modify the cleanup closure inside setupUsageCounterRepositoryMockDB to invoke sqlMock.ExpectationsWereMet() and handle any returned error before/after closing the DB.internal/adapters/cel/adapter.go (1)
108-113:⚠️ Potential issue | 🟠 MajorUse the injected adapter logger as the fallback.
The logger required by
NewAdapternever participates here;Compilereads only the tracking-context logger. Calls made with a plain context will therefore lose adapter logs, which makes the constructor contract misleading and weakens observability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/adapters/cel/adapter.go` around lines 108 - 113, Compile currently overwrites the adapter's injected logger with the context-derived logger from libCommons.NewTrackingFromContext, losing the adapter's logger when callers pass a plain context; update Compile (adapter.cel.compile) to prefer the adapter's injected logger as a fallback: call pkgLogging.WithTrace(ctx, a.logger) (or if the context logger exists, use it but fallback to a.logger) when creating the traced logger instead of unconditionally using the logger returned by NewTrackingFromContext, so NewAdapter's provided logger is used whenever the context has no richer logger.internal/adapters/postgres/rule_repository.go (1)
460-467:⚠️ Potential issue | 🔴 CriticalNormalize
filter.Limitbefore slicing.Line 464 only fixes the SQL fetch size.
filter.Limititself stays negative, so Line 503 makeshasMoretrue and Line 505 slicesrules[:filter.Limit], which panics for any negative input.🛡️ Proposed fix
- fetchLimit := filter.Limit + 1 - if fetchLimit <= 0 { - // This should never happen due to upstream validation, but protect against - // potential bypass or refactoring. Use default limit + 1 as safe fallback. - fetchLimit = constant.DefaultPaginationLimit + 1 - } + limit := filter.Limit + if limit <= 0 { + limit = constant.DefaultPaginationLimit + } + + fetchLimit := limit + 1 query = query.Limit(uint64(fetchLimit)) // `#nosec` G115 - fetchLimit validated positive above @@ - hasMore := len(rules) > filter.Limit + hasMore := len(rules) > limit if hasMore { - rules = rules[:filter.Limit] + rules = rules[:limit] }Also applies to: 503-505
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/adapters/postgres/rule_repository.go` around lines 460 - 467, Normalize filter.Limit after computing fetchLimit so downstream logic doesn't use a negative limit: after calculating fetchLimit := filter.Limit + 1 (and applying the fallback when fetchLimit <= 0), set filter.Limit = int(fetchLimit - 1) (or the safe default constant.DefaultPaginationLimit if you used that fallback) so that subsequent hasMore calculation and the rules[:filter.Limit] slice use a non-negative, bounded value; update the code around variables fetchLimit, filter.Limit, hasMore and rules to use this normalized filter.Limit before slicing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/CODING_STANDARDS.md`:
- Line 242: Update the example log comment to use the actual field names emitted
by pkg/logging/trace.go (trace.id and span.id) instead of the old snake_case
names; locate the example call to logger.Log (the line showing "operation
started") and replace the comment references to trace_id and span_id with
trace.id and span.id, and make the same change at the other occurrence noted
(around the second example on the page).
In `@go.mod`:
- Around line 7-8: The go.mod currently pins github.com/LerianStudio/lib-auth/v2
to a beta release (v2.5.0-beta.7); update that requirement to the latest stable
release (v2.4.0) or, if the beta is deliberately required, add a short comment
in the repository (e.g., in README or a new docs/DEPENDENCIES.md) explaining why
the pre-release is necessary, the specific features or fixes it provides, and
the risk/rollback plan; locate the lib-auth dependency line in go.mod (the
github.com/LerianStudio/lib-auth/v2 entry) and either change the version string
to v2.4.0 or add the documentation and leave a clear inline comment near that
dependency referencing the docs.
In `@internal/adapters/http/in/middleware/apikey_test.go`:
- Around line 543-555: The test helper mockCounterBuilder.Add is currently
storing MetricCall.Labels as a reference to the builder’s mutable c.labels map,
allowing later WithLabels/CounterAdder reuse to mutate previously recorded
entries; fix by making a shallow copy of c.labels inside Add and assign that
copy to MetricCall.Labels before appending to c.recorder.Calls so each recorded
MetricCall captures its own snapshot of labels (refer to mockCounterBuilder.Add,
MetricCall.Labels, c.labels, c.recorder.Calls, and the CounterAdder/WithLabels
usage to locate the change).
In `@internal/adapters/http/in/transaction_validation_handler.go`:
- Around line 165-169: The log currently passes input.Limit as a *int which can
yield inconsistent serialization; since SetDefaults() has populated it, change
the logger.With call that uses "operation" =
"handler.transaction-validation.list" to pass the dereferenced value for
"list.limit" (use *input.Limit) instead of input.Limit so the page size is
explicit and unambiguous when logging; keep the other fields (list.cursor,
list.sort_by, list.sort_order) unchanged.
In `@internal/adapters/http/in/validation_handler.go`:
- Around line 95-100: The current logging call includes err.Error()
(request-derived content) which may leak raw payload fragments; replace the
usage of err.Error() in the logger.With(pkgLogging.Pairs(...)) call with a
sanitized, low-cardinality value (e.g., a constant like
"request_body_parse_error" or the error type string determined via
reflection/errors.As) and keep the raw err only for internal/tracing through
libOpentelemetry.HandleSpanError(span, ... , err). Update the logger invocation
that currently references pkgLogging.Pairs("operation",
"handler.validations.validate", "error.message", err.Error()) so it logs a
sanitized message or error class instead, while leaving the response generation
call to h.parseErrorToUserMessage(err) unchanged.
- Around line 73-76: The tracing context created by tracer.Start in
handler.validations.validate (inside Validate()) isn't being written back to the
Fiber context, so downstream error handlers (handleValidationInputError which
reads c.UserContext()) lose trace/span IDs; fix by calling c.SetUserContext(ctx)
immediately after tracer.Start() (and before logger = pkgLogging.WithTrace(ctx,
logger)) so the derived context with the span is propagated to c.UserContext()
for handlers that log warnings.
In `@internal/adapters/postgres/rule_repository_test.go`:
- Line 39: Create a sentinel test context (e.g., via context.WithValue) and
replace the gomock.Any() matcher on mockConn.EXPECT().GetDB with an exact match
to that sentinel (use gomock.Eq(testCtx)) so the test asserts the repository
forwards the caller's ctx; update the expectations in setupMockDB and the direct
connection-error test (the second occurrence around line 129) to use this
sentinel context and keep the existing Return(db, nil) / Return(nil,
ErrNilContext) behavior so nil-context regressions are caught.
In `@internal/adapters/postgres/rule_repository.go`:
- Around line 58-60: NewRepository currently constructs a Repository even when
the incoming libPostgres.Client is nil; add an explicit nil-check at the start
of NewRepository and fail fast by returning nil (or otherwise returning an error
if you change the signature) when conn == nil so you don't wrap a nil client
with pgdb.NewPostgresConnectionAdapter; reference the NewRepository function,
the conn parameter, and pgdb.NewPostgresConnectionAdapter to locate where to add
the check and early return.
In `@internal/adapters/postgres/rule_sync_repository.go`:
- Around line 33-37: NewRuleSyncRepository currently allows a nil
libPostgres.Client to produce a nil connection adapter (via
pgdb.NewPostgresConnectionAdapter) which leads to runtime ErrNilConnection in
methods like GetAllActiveRules and GetRulesUpdatedSince; change the constructor
to perform fail-fast validation: check the incoming conn and/or the result of
pgdb.NewPostgresConnectionAdapter(conn) and return an error when either is nil
(i.e., change signature to NewRuleSyncRepository(conn *libPostgres.Client)
(*RuleSyncRepository, error) or explicitly panic if your codebase prefers), so
callers get an explicit error instead of later ErrNilConnection, and update call
sites to handle the returned error accordingly.
In `@internal/adapters/postgres/transaction_validation_repository.go`:
- Around line 309-312: The code path in applyCursorFilter builds a keyset
predicate using cursor.ID without validating it, so a tampered cursor.ID can
cause a UUID cast error downstream; update applyCursorFilter (and any caller
sites that rely on it) to explicitly validate cursor.ID (e.g., parse it as a
UUID) before constructing the SQL predicate and if parsing fails return
ErrInvalidCursor and call libOtel.HandleSpanBusinessErrorEvent(span, "Invalid
cursor", err) so the function (applyCursorFilter) fails early with
ErrInvalidCursor rather than letting QueryContext hit a UUID cast error.
In `@internal/adapters/postgres/usage_counter_repository.go`:
- Around line 46-49: In NewUsageCounterRepository, validate the incoming conn
before creating an adapter: if conn == nil return nil (or otherwise reject) so
you don't call pgdb.NewPostgresConnectionAdapter(nil) and construct an invalid
UsageCounterRepository; add the nil-check at the top of
NewUsageCounterRepository, referencing the function name and the call to
pgdb.NewPostgresConnectionAdapter(conn).
In `@internal/services/cache/warmup.go`:
- Around line 69-71: Remove the raw compiler error string from the structured
fields: update the call that currently builds logger.With(pkgLogging.Pairs(...,
"error.message", compileErr.Error())...) in the warmup logic so it no longer
includes "error.message" or compileErr.Error() as a field; keep only stable
identifiers like "operation" and "rule.id" (rule.ID.String()) and pass
compileErr as the error payload to the logger (e.g. as the logged
error/exception when calling Log) so the wrapped error carries details while
telemetry fields remain low-cardinality (affects logger.With, pkgLogging.Pairs,
compileErr, Log, libLog.LevelError).
In `@internal/services/command/activate_limit.go`:
- Around line 78-81: The call to libOpentelemetry.SetSpanAttributesFromValue
currently discards its error; capture the returned error and pass it to
HandleSpanError along with the span to preserve the repository's observability
pattern (i.e., replace the `_ =
libOpentelemetry.SetSpanAttributesFromValue(span, ...)` line with code that
assigns the error and calls HandleSpanError(span, err)). Ensure you keep the
same span and the same attribute map ("activate_input" with "limit_id" and
"operation") when invoking SetSpanAttributesFromValue.
In `@internal/services/command/create_limit.go`:
- Around line 103-108: The span currently records high-cardinality request
fields ("name" and "max_amount") via libOpentelemetry.SetSpanAttributesFromValue
for the "create_limit_input" span; remove those raw fields (normalizedInput.Name
and normalizedInput.MaxAmount) from the attribute map and only record
low-cardinality fields such as normalizedInput.LimitType and
normalizedInput.Currency; if presence of those values is important for
debugging, replace the removed entries with boolean flags (e.g., "has_name":
true/false and "has_max_amount": true/false) or a sanitized enum instead,
ensuring the call site (SetSpanAttributesFromValue, span, "create_limit_input")
is updated accordingly.
In `@internal/services/command/deactivate_rule.go`:
- Around line 160-162: Logging uses the key "error" here which is inconsistent
with other statements that use "error.message"; update the logger.With call in
deactivate_rule.go (the pkgLogging.Pairs invocation used when recording the
audit event for "service.rule.deactivate.audit" and referencing updatedRule.ID
and err) to use the key "error.message" and keep err.Error() as the value so
logs match the other entries for consistent search/parsing.
- Around line 75-76: Extract the repeated operation identifier
"service.rule.deactivate" into a local constant (e.g. const operation =
"service.rule.deactivate") at the top of the function in deactivate_rule.go,
then replace all inline occurrences in logger.With(...) calls (the
logger.With(pkgLogging.Pairs(... "operation", "service.rule.deactivate",
...))...Log(...) usages) with the new operation constant so every log statement
in the function uses operation instead of the literal string.
In `@internal/services/command/draft_rule.go`:
- Around line 72-87: After calling s.repository.GetByID(ctx, ruleID) (and the
second GetByID later that assigns updatedRule) defensively check whether the
returned rule/updatedRule is nil before accessing fields like rule.Status,
calling RuleToMap(rule), or using updatedRule.ID; if nil, treat it as
constant.ErrRuleNotFound and return the same validated business error path used
above (call libOpentelemetry.HandleSpanBusinessErrorEvent, log the warning with
logger.With(...).Log(..., "Rule not found"), and return
libCommons.ValidateBusinessError(constant.ErrRuleNotFound, "Rule")). Ensure both
repository call sites (the initial rule and the updatedRule branch) implement
this nil-guard early so no dereference occurs.
- Around line 64-67: Remove the high-cardinality "rule_id" entry from the map
passed to libOpentelemetry.SetSpanAttributesFromValue in the draft_rule.go span
instrumentation and properly handle its returned error instead of discarding it;
call SetSpanAttributesFromValue with only non-sensitive keys (e.g., "operation":
"draft"), capture the error result and log/record it or return it consistent
with the pattern used in create_limit.go (inspect how SetSpanAttributesFromValue
is handled there around the span variable and replicate the same error-checking
and reporting logic for the span in draft_rule.go).
In `@internal/services/command/update_rule.go`:
- Around line 66-75: The repo calls c.repo.GetByID and any subsequent use of
rule or result (e.g., RuleToMap(rule) and result.ID) must be guarded against a
(nil, nil) return to avoid panics; update the logic in the update flow to check
if rule == nil (and similarly result == nil in the later block) after the
GetByID call and treat that as a not-found or technical error (call
libOpentelemetry.HandleSpanBusinessErrorEvent or HandleSpanError accordingly and
return an error such as constant.ErrRuleNotFound or a wrapped error) before
dereferencing or passing the value to RuleToMap or accessing result.ID.
- Around line 129-135: Remove high-cardinality attributes from the OpenTelemetry
span by editing the SetSpanAttributesFromValue call for "rule_update": drop
"rule.id" and "rule.name" from the attributes map and keep only low-cardinality
fields such as "rule.action" (string(rule.Action)), "rule.status"
(string(rule.Status)), and "rule.scopes_count" (len(rule.Scopes)); keep any
identifier logging in application logs instead of span attributes and leave the
span and function name (libOpentelemetry.SetSpanAttributesFromValue, span,
"rule_update") unchanged.
In `@internal/services/limit_service.go`:
- Around line 139-145: The info-level log in the GetLimitUsage path is emitting
sensitive numeric snapshot fields (snapshot.CurrentUsage, snapshot.LimitAmount,
snapshot.UtilizationPercent, snapshot.NearLimit); remove those from the info log
to avoid durable leakage and keep only low-sensitivity fields (e.g., "operation"
and "limit_id") in the logger.With(...) call. For the numeric snapshot, emit it
to a narrower channel (e.g., use libLog.LevelDebug or an audit-specific logger)
by creating a separate log call that includes the pkgLogging.Pairs for
"current_usage", "limit_amount", "utilization_percent", and "near_limit" (or
forward them to the audit logger) so the main info log only contains operation
and limit_id; update the existing logger.With(...) invocation around the
retrieved snapshot accordingly.
In `@internal/services/query/get_transaction_validation.go`:
- Around line 57-80: After calling q.repo.GetByID, defensively check if
validation == nil before accessing validation.ID or validation.Decision; if nil,
call libOpentelemetry.HandleSpanBusinessErrorEvent(span, "Transaction validation
not found", constant.ErrTransactionValidationNotFound) and return nil,
constant.ErrTransactionValidationNotFound (or wrap/return an appropriate
not-found error), and only call libOpentelemetry.SetSpanAttributesFromValue and
logger.With(...) when validation is non-nil to avoid panics. Ensure the
nil-check is placed immediately after the GetByID call and uses the same
validation variable and span/logger calls referenced in this function.
- Around line 69-73: Remove the client-provided request_id from the span
attributes by editing the call to libOpentelemetry.SetSpanAttributesFromValue
inside get_transaction_validation (where span and validation are used) so that
the map no longer includes the "validation.request_id" key; keep "validation.id"
and "validation.decision" as before to preserve server-generated ID and decision
context for span-level aggregation.
In `@internal/services/query/limit_checker.go`:
- Around line 468-476: The code currently ignores the error returned by
counter.Add in the rollback paths; update both places where
metricsFactory.Counter(MetricRollbackFailures) is used (and the resulting
counter.Add(ctx, int64(len(failedLimits))) call) to check and handle the
returned error: after calling counter.Add check if err != nil and emit a
logger.With(...).Log(... LevelWarn) including "operation" set to
"service.limit_checker.rollback_incremented_counters", "metric" set to
MetricRollbackFailures.Name and the error message (err.Error()), so metric write
failures are logged; keep the existing branch that logs when
metricsFactory.Counter(...) fails and add this additional error handling for
counter.Add in both locations that currently discard its error.
In `@internal/services/query/list_transaction_validations.go`:
- Around line 87-103: The code calls q.repo.List and assumes result is non-nil;
normalize a nil page before reading its fields by checking if result == nil
after the List call and replacing it with an empty page value (so that
result.TransactionValidations is an empty slice and result.HasMore is false)
prior to calling libOpentelemetry.SetSpanAttributesFromValue and
logger.With(...).Log; update the block around the q.repo.List call and afterward
references to result.TransactionValidations and result.HasMore to use this
normalized result to avoid panics.
In `@internal/services/validation_service.go`:
- Around line 348-356: The branch that handles transactionValidationRepo.Insert
should use the validation-specific metric instead of MetricAuditPersistFailures:
replace MetricAuditPersistFailures with the metric defined for validation
persistence failures (e.g., MetricValidationPersistFailures) when calling
metricsFactory.Counter and when attaching "metric" to logger pairs; also update
the warnLogger.Log message text to indicate "Failed to initialize validation
persist failure metric" (and keep the existing persistCtx/ nolint behavior).
This ensures the counter and log correspond to transactionValidationRepo.Insert
failures rather than audit persistence.
In `@internal/services/workers/rule_sync_worker.go`:
- Around line 375-385: logMetricInitFailure currently logs with the passed
logger (often w.logger) which drops the trace.id/span.id present on the
contextual runSyncCycle logger; modify logMetricInitFailure(ctx, logger,
metricName, err) to prefer the logger from the context (e.g.,
libLog.FromContext(ctx) or equivalent) if available and fall back to the
provided logger, then attach the same Pairs and call Log so trace fields are
preserved; update references in emitSuccessMetrics/emitSkipMetrics only if they
actually need to pass a different logger, but prefer fixing logMetricInitFailure
to derive the contextual logger so callers like runSyncCycle do not need
changes.
- Around line 288-294: The call to libOtel.SetSpanAttributesFromValue(span,
"sync_result", ...) currently ignores its returned error; change it to capture
the returned error and handle failures instead of discarding them. Specifically,
update the SetSpanAttributesFromValue invocation (the one passing span,
"sync_result", compileErrors, w.cache.Size(), len(changes.New/Updated/Deleted))
to check the error return and then record it (e.g., span.RecordError(err) and/or
set span status) and emit a log via the worker logger (e.g., w.logger.Errorf or
similar) including the error and context ("sync_result", compileErrors, cache
size, counts) so failures are visible. Ensure you use the existing span and
logger fields rather than introducing new globals.
---
Outside diff comments:
In `@internal/adapters/cel/adapter.go`:
- Around line 108-113: Compile currently overwrites the adapter's injected
logger with the context-derived logger from libCommons.NewTrackingFromContext,
losing the adapter's logger when callers pass a plain context; update Compile
(adapter.cel.compile) to prefer the adapter's injected logger as a fallback:
call pkgLogging.WithTrace(ctx, a.logger) (or if the context logger exists, use
it but fallback to a.logger) when creating the traced logger instead of
unconditionally using the logger returned by NewTrackingFromContext, so
NewAdapter's provided logger is used whenever the context has no richer logger.
In `@internal/adapters/postgres/rule_repository.go`:
- Around line 460-467: Normalize filter.Limit after computing fetchLimit so
downstream logic doesn't use a negative limit: after calculating fetchLimit :=
filter.Limit + 1 (and applying the fallback when fetchLimit <= 0), set
filter.Limit = int(fetchLimit - 1) (or the safe default
constant.DefaultPaginationLimit if you used that fallback) so that subsequent
hasMore calculation and the rules[:filter.Limit] slice use a non-negative,
bounded value; update the code around variables fetchLimit, filter.Limit,
hasMore and rules to use this normalized filter.Limit before slicing.
In `@internal/adapters/postgres/usage_counter_repository_test.go`:
- Around line 31-49: The cleanup function returned by
setupUsageCounterRepositoryMockDB currently only closes the mock DB; update it
to also call sqlMock.ExpectationsWereMet() and fail the test if it returns an
error (e.g., using require.NoError(t, err) or t.Fatalf) so every test verifies
all SQL expectations were consumed; modify the cleanup closure inside
setupUsageCounterRepositoryMockDB to invoke sqlMock.ExpectationsWereMet() and
handle any returned error before/after closing the DB.
In `@internal/services/command/update_rule.go`:
- Around line 55-64: Reject nil UpdateRuleInput at the start of
UpdateRuleCommand.Execute to avoid panics when dereferencing fields like
input.Expression: add a nil-check for the input pointer in
UpdateRuleCommand.Execute (similar to CreateLimitCommand.Execute) and return a
clear error (e.g., validation/invalid-argument) if input == nil before any use
of input.Expression or other fields so the method never dereferences a nil
pointer.
In `@pkg/resilience/circuit_breaker.go`:
- Around line 125-139: The recovered panic payload is being included verbatim in
the error sent on fnDone (result{nil, fmt.Errorf("panic recovered in circuit
breaker: %v", r)}), which can leak secrets; change that to a sanitized error
that does not include the raw value — e.g. send result{nil, fmt.Errorf("panic
recovered in circuit breaker: %T", r)} or a generic fmt.Errorf("panic recovered
in circuit breaker")} in the goroutine's defer, keeping the existing detailed
panic type/logging only in the c.logger.With(...).Log call (referencing fnDone,
result, the goroutine, c.logger and circuit_breaker.execute) so no raw panic
payload is propagated out of the circuit breaker.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e08a6673-770c-4fbf-9b81-f306747bf54a
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (90)
docs/CODING_STANDARDS.mddocs/PROJECT_RULES.mdgo.modinternal/adapters/cel/adapter.gointernal/adapters/cel/compile_test.gointernal/adapters/cel/evaluate_test.gointernal/adapters/http/in/audit_event_handler.gointernal/adapters/http/in/handlers.gointernal/adapters/http/in/health.gointernal/adapters/http/in/health_handler_test.gointernal/adapters/http/in/limit_handler.gointernal/adapters/http/in/middleware/apikey.gointernal/adapters/http/in/middleware/apikey_test.gointernal/adapters/http/in/middleware/auth_guard_test.gointernal/adapters/http/in/middleware/metrics.gointernal/adapters/http/in/routes.gointernal/adapters/http/in/routes_test.gointernal/adapters/http/in/rule_handler.gointernal/adapters/http/in/rule_handler_test.gointernal/adapters/http/in/swagger.gointernal/adapters/http/in/transaction_validation_handler.gointernal/adapters/http/in/validation_handler.gointernal/adapters/postgres/audit_event_repository.gointernal/adapters/postgres/audit_event_repository_test.gointernal/adapters/postgres/db/adapter.gointernal/adapters/postgres/db/adapter_test.gointernal/adapters/postgres/db/interfaces.gointernal/adapters/postgres/db/mocks/interfaces_mock.gointernal/adapters/postgres/limit_repository.gointernal/adapters/postgres/limit_repository_test.gointernal/adapters/postgres/repository_constructors_test.gointernal/adapters/postgres/rule_repository.gointernal/adapters/postgres/rule_repository_test.gointernal/adapters/postgres/rule_sync_repository.gointernal/adapters/postgres/transaction_validation_repository.gointernal/adapters/postgres/transaction_validation_repository_test.gointernal/adapters/postgres/usage_counter_repository.gointernal/adapters/postgres/usage_counter_repository_test.gointernal/bootstrap/config.gointernal/bootstrap/config_test.gointernal/bootstrap/http_server.gointernal/bootstrap/http_server_test.gointernal/bootstrap/service.gointernal/services/cache/warmup.gointernal/services/cache/warmup_test.gointernal/services/command/activate_limit.gointernal/services/command/activate_rule.gointernal/services/command/create_limit.gointernal/services/command/create_rule.gointernal/services/command/deactivate_limit.gointernal/services/command/deactivate_rule.gointernal/services/command/delete_limit.gointernal/services/command/delete_rule.gointernal/services/command/draft_limit.gointernal/services/command/draft_rule.gointernal/services/command/update_limit.gointernal/services/command/update_rule.gointernal/services/limit_service.gointernal/services/metrics.gointernal/services/query/complete_evaluator.gointernal/services/query/evaluate_rules.gointernal/services/query/get_active_rules.gointernal/services/query/get_limit.gointernal/services/query/get_rule.gointernal/services/query/get_transaction_validation.gointernal/services/query/limit_checker.gointernal/services/query/list_limits.gointernal/services/query/list_rules.gointernal/services/query/list_transaction_validations.gointernal/services/query/metrics.gointernal/services/query/rule_evaluator.gointernal/services/transaction_validation_service.gointernal/services/validation_service.gointernal/services/validation_service_processing_time_test.gointernal/services/workers/metrics.gointernal/services/workers/rule_sync_circuit_breaker_test.gointernal/services/workers/rule_sync_polling_integration_test.gointernal/services/workers/rule_sync_worker.gointernal/services/workers/usage_cleanup_worker.gointernal/services/workers/usage_cleanup_worker_test.gointernal/testutil/db.gointernal/testutil/mock_logger.gointernal/testutil/mock_logger_test.gointernal/testutil_integration/testcontainer_suite.gopkg/logging/trace.gopkg/logging/trace_test.gopkg/migration/functions.gopkg/migration/functions_test.gopkg/net/http/with_body.gopkg/resilience/circuit_breaker.go
|
This PR is very large (94 files, 5920 lines changed). Consider breaking it into smaller PRs for easier review. |
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
This PR is very large (97 files, 6268 lines changed). Consider breaking it into smaller PRs for easier review. |
|
Reviewed the outside-diff CodeRabbit findings from review Implemented in
Not changing in this PR:
I also replied inline on every actionable CodeRabbit comment with the applied resolution or rationale. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/adapters/postgres/transaction_validation_repository.go (1)
65-70:⚠️ Potential issue | 🟡 MinorValidate
connin the constructor to fail fast.Similar to
rule_repository.go, this constructor allows a nil*libPostgres.Clientto be wrapped, deferring failure untilGetDB(ctx)is called at runtime. Consider adding a nil check for fail-fast behavior.Suggested fix
func NewTransactionValidationRepository(conn *libPostgres.Client) *TransactionValidationRepository { + if conn == nil { + return nil + } return &TransactionValidationRepository{ conn: pgdb.NewPostgresConnectionAdapter(conn), tableName: "transaction_validations", } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/adapters/postgres/transaction_validation_repository.go` around lines 65 - 70, NewTransactionValidationRepository currently accepts a nil *libPostgres.Client and defers failure; add a nil check at the start of NewTransactionValidationRepository to return an error or panic (consistent with other constructors like rule_repository.go) when conn == nil, ensuring TransactionValidationRepository is not constructed with a nil conn (the repository wraps conn via pgdb.NewPostgresConnectionAdapter and later calls GetDB(ctx)); update the constructor to validate conn and fail fast with a clear message referencing NewTransactionValidationRepository and TransactionValidationRepository.
♻️ Duplicate comments (1)
docs/CODING_STANDARDS.md (1)
264-264:⚠️ Potential issue | 🟡 MinorDocumentation uses incorrect field names.
Line 264 refers to
trace_id/span_idbut the actual fields emitted bypkg/logging/trace.goaretrace.idandspan.id. This mismatch will confuse readers verifying log-trace correlation.📝 Proposed fix
-- Logger entries miss trace_id/span_id → broken log-trace correlation +- Logger entries miss trace.id/span.id → broken log-trace correlation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/CODING_STANDARDS.md` at line 264, Documentation incorrectly references `trace_id/span_id`; update the wording in the CODING_STANDARDS.md entry to use the actual emitted field names `trace.id` and `span.id` (as defined in pkg/logging/trace.go) so readers can correctly verify log-trace correlation; search for the phrase `trace_id/span_id` in docs and replace it with `trace.id` / `span.id`, keeping surrounding explanatory text intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/services/query/limit_checker.go`:
- Around line 142-143: The info-level success logs in the hot path (e.g., the
logger.With(...).Log calls in check_limits and inside processLimitAtomic) are
too verbose; change non-error/no-op success logs (including the "Found
applicable limits" and per-limit pass logs that include high-cardinality fields
like limit_id, transaction_amount, period_key) to debug level or apply sampling,
and reserve Info/Warn for actual exceedances and rollback failures; update
logger.With(...) invocations in check_limits, processLimitAtomic, and the other
referenced blocks (around lines 217-218, 270-275, 382-388, 714-717) to use
libLog.LevelDebug or a sampled logger, keeping error/warn logging unchanged for
exceed and rollback error paths.
- Around line 407-410: The rollback code creates detached contexts from
context.Background() which strip v4 DB routing values, causing r.conn.GetDB(ctx)
(used by GetForUpdate and DecrementAtomic) to pick the wrong DB; change the
rollback context creation to derive from the original ctx so DB bindings are
preserved (for example use context.WithTimeout(context.WithoutCancel(ctx),
rollbackTimeout) or otherwise propagate the DB binding into the new context) and
keep reattaching trace spans with trace.ContextWithSpan() so compensation
operations (the rollback calls that invoke GetForUpdate and DecrementAtomic on
r.conn) resolve the correct Postgres connection.
- Around line 114-119: The map keys passed to libOtel.SetSpanAttributesFromValue
are incorrectly prefixed with "input." which, combined with the caller prefix
"input", produces "input.input.*"; update the key names in the map literal used
in the call to libOtel.SetSpanAttributesFromValue (the call that sets attributes
on span) to be unprefixed: use "transaction_type", "sub_type", "currency", and
"amount" instead of "input.transaction_type", "input.sub_type",
"input.currency", "input.amount"; apply the same fix to the other call
referenced (the similar map at lines 240-245) so the function correctly emits
"input.transaction_type", etc.
In `@pkg/resilience/circuit_breaker.go`:
- Around line 126-142: The defer's recover currently only records the panic type
(panicType) and drops the actual panic value; update the panic handling inside
the defer so the recovered value r is included in the error sent to fnDone and
in the c.logger metadata: construct an error using fmt.Errorf("panic recovered
in circuit breaker (%s): %v", panicType, r) (or equivalent) when sending into
fnDone (result{nil, ...}) and add a log field like "panic.value" with the
recovered value to the c.logger.With(...) call in the circuit_breaker.execute
defer to ensure both type and value are logged and sent.
---
Outside diff comments:
In `@internal/adapters/postgres/transaction_validation_repository.go`:
- Around line 65-70: NewTransactionValidationRepository currently accepts a nil
*libPostgres.Client and defers failure; add a nil check at the start of
NewTransactionValidationRepository to return an error or panic (consistent with
other constructors like rule_repository.go) when conn == nil, ensuring
TransactionValidationRepository is not constructed with a nil conn (the
repository wraps conn via pgdb.NewPostgresConnectionAdapter and later calls
GetDB(ctx)); update the constructor to validate conn and fail fast with a clear
message referencing NewTransactionValidationRepository and
TransactionValidationRepository.
---
Duplicate comments:
In `@docs/CODING_STANDARDS.md`:
- Line 264: Documentation incorrectly references `trace_id/span_id`; update the
wording in the CODING_STANDARDS.md entry to use the actual emitted field names
`trace.id` and `span.id` (as defined in pkg/logging/trace.go) so readers can
correctly verify log-trace correlation; search for the phrase `trace_id/span_id`
in docs and replace it with `trace.id` / `span.id`, keeping surrounding
explanatory text intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 60d31212-813d-4849-9cac-4a8fb7ce2ab8
📒 Files selected for processing (27)
docs/CODING_STANDARDS.mdinternal/adapters/cel/adapter.gointernal/adapters/http/in/middleware/apikey_test.gointernal/adapters/http/in/transaction_validation_handler.gointernal/adapters/http/in/validation_handler.gointernal/adapters/postgres/rule_repository.gointernal/adapters/postgres/rule_repository_test.gointernal/adapters/postgres/transaction_validation_repository.gointernal/adapters/postgres/transaction_validation_repository_test.gointernal/adapters/postgres/usage_counter_repository_test.gointernal/services/cache/warmup.gointernal/services/command/activate_limit.gointernal/services/command/create_limit.gointernal/services/command/deactivate_rule.gointernal/services/command/draft_rule.gointernal/services/command/draft_rule_test.gointernal/services/command/update_rule.gointernal/services/command/update_rule_test.gointernal/services/limit_service.gointernal/services/metrics.gointernal/services/metrics_test.gointernal/services/query/get_transaction_validation.gointernal/services/query/limit_checker.gointernal/services/query/list_transaction_validations.gointernal/services/validation_service.gointernal/services/workers/rule_sync_worker.gopkg/resilience/circuit_breaker.go
Correct limit-checker telemetry so hot paths stay quiet while rollback compensation keeps DB bindings intact. Include recovered panic values in circuit breaker errors and warning logs for better production diagnosis.
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
This PR is very large (97 files, 6278 lines changed). Consider breaking it into smaller PRs for easier review. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/services/query/limit_checker.go`:
- Around line 240-252: The span attribute map in the
libOtel.SetSpanAttributesFromValue call is passing limit.MaxAmount as a decimal
value inconsistently with other places that use .String(); update the entry for
"maxAmount" to use limit.MaxAmount.String() so serialization matches
input.Amount.String() and ensure the change is applied in the process where
libOtel.SetSpanAttributesFromValue, span, limit.ID and limit.Name are used
(e.g., in service.limit_checker.process_limit_atomic) to keep decimal values
consistently serialized as strings in span attributes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b0f9851d-56ab-489c-a1f8-f69a518944ba
📒 Files selected for processing (2)
internal/services/query/limit_checker.gopkg/resilience/circuit_breaker.go
When a function executed by the circuit breaker panics with an error value, the original error is now wrapped. This allows callers to inspect the root cause using `errors.Is` or `errors.As`, improving debuggability and error handling. Additionally, this fixes a logging issue in the limit checker where a decimal type for `maxAmount` was not serialized correctly. It is now converted to a string before logging to prevent panics and ensure the value is human-readable.
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
This PR is very large (98 files, 6300 lines changed). Consider breaking it into smaller PRs for easier review. |
The logic for handling a recovered panic within the circuit breaker is extracted into a dedicated `recoveredPanicError` function. This change improves code readability by giving a descriptive name to the error creation process. It also isolates this logic, making the defer block cleaner and easier to understand at a glance.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/resilience/circuit_breaker_test.go`:
- Around line 253-254: The test currently pins to the stdlib's unexported
concrete error type by asserting the full error string contains
"*errors.errorString"; instead, keep the existing assert.ErrorIs(t, err,
sentinelErr) and replace the brittle assert.Equal check on err.Error() with one
or two assert.Contains assertions that verify stable message fragments (e.g.,
assert.Contains(t, err.Error(), "panic recovered in circuit breaker") and
assert.Contains(t, err.Error(), "sentinel panic")) so the test checks semantics
of the message without depending on Go's concrete error implementation.
In `@pkg/resilience/circuit_breaker.go`:
- Around line 128-131: Insert a blank line between the panicErr assignment and
the type-assertion branch to satisfy the linter: after the line that sets
panicErr (the fmt.Errorf call using panicType and r) add an empty line before
the if recoveredErr, ok := r.(error) { ... } block; this change affects the
variables panicType and panicErr in the panic recovery logic inside the circuit
breaker (the recovered-panic branch) and does not alter the existing error
handling or formatting.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 25f9f4eb-a3f1-47ac-97cd-1f9463bd6170
📒 Files selected for processing (3)
internal/services/query/limit_checker.gopkg/resilience/circuit_breaker.gopkg/resilience/circuit_breaker_test.go
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
This PR is very large (98 files, 6305 lines changed). Consider breaking it into smaller PRs for easier review. |
gandalf-at-lerian
left a comment
There was a problem hiding this comment.
Review: lib-commons v4 migration + hardening
98 files, ~3.4K additions / ~2.9K deletions. This is a significantly cleaner v4 migration compared to the reporter PR — well done.
What stands out ✅
Proper structured logging — uses logger.With(pairs("key", val, ...)).Log(ctx, level, "message") pattern throughout. No fmt.Sprintf wrapping in log calls. This is the idiomatic v4 approach.
Context propagation — 380 out of 413 .Log() calls pass the actual ctx. The ~24 context.Background() usages are in bootstrap/config initialization code where no request context exists — perfectly acceptable.
Full DI migration — zero NewTrackingFromContext calls. Logger and tracer are injected via struct fields.
Zero v3 remnants — no lib-commons/v3 imports, no old-style Infof/Errorf logging (only in mock implementations for test compatibility).
Span error handling — all migrated to value receiver (span not &span).
Minor observations
pairs() helper duplicated 12 times — the func pairs(fields ...any) []libLog.Field helper is defined in 12 different files. Consider extracting to a shared pkg/log/fields.go or similar. Not blocking, but would reduce duplication.
Auth hardening — the startup validation tightening and nil DB rejection are good defensive additions beyond the migration scope.
Summary
Clean migration. Structured logging done right, full DI, proper context propagation. The pairs() duplication is the only thing worth a follow-up cleanup. LGTM.
style: format code to adhere to modern Go conventions style: use explicit '0o' prefix for octal file permission literals
|
Consider updating CHANGELOG.md to document this change. If this change doesn't need a changelog entry, add the |
|
This PR is very large (106 files, 6385 lines changed). Consider breaking it into smaller PRs for easier review. |
| attrs := attributesToMap(compileSpan.Attributes) | ||
| assert.Contains(t, attrs, "compile_input", "Should have compile_input attribute") | ||
|
|
||
| // The attribute value is a JSON string containing the struct data |
There was a problem hiding this comment.
@gandalf-at-lerian , porque este trecho de teste foi removido?
Summary
lib-commons/v4andlib-auth/v2APIs, including context-aware Postgres access, bootstrap/config updates, tracing helpers, and compatibility updates across handlers, services, workers, and repositoriesTesting
make lintmake testmake test-integration