Skip to content

FFL-1721 Emit flagevaluation EVP Track#2646

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 27 commits intodevelopfrom
sameerank/FFL-1721/emit-flagevaluation-evp-track
Feb 6, 2026
Merged

FFL-1721 Emit flagevaluation EVP Track#2646
gh-worker-dd-mergequeue-cf854d[bot] merged 27 commits intodevelopfrom
sameerank/FFL-1721/emit-flagevaluation-evp-track

Conversation

@sameerank
Copy link
Contributor

@sameerank sameerank commented Jan 22, 2026

What and why?

Introduces evaluation logging for feature flags.

Evaluation logging vs Exposure logging:

  • Exposure logging captures when a flag evaluation actually impacts the user experience (doLog=true). This is used for experimentation metrics and analysis where you need to know which users were exposed to which variants.
  • Evaluation logging captures all flag evaluations regardless of doLog. This is for operational visibility: for monitoring and debugging.

Evaluation data is aggregated (count, first/last timestamps) rather than logged individually, to minimize overhead.

Useful references:

How?

  • EvaluationAggregator: Aggregates flag evaluations by a composite key (flag key, variant key, allocation key, targeting key, error message, and context hash). Tracks evaluationCount, firstEvaluation, and lastEvaluation timestamps per unique combination.
  • Flush behavior: Aggregated evaluations are flushed to the backend either:
    • On a configurable timer interval (evaluationFlushInterval)
    • When the max aggregations threshold is reached
    • On aggregator deallocation
  • FlagEvaluationEvent: Event model matching the EVP schema. Fields like variant, allocation, and runtimeDefaultUsed are conditionally included based on EVALLOG spec requirements (e.g., variant/allocation omitted for runtime defaults per EVALLOG.8, runtimeDefaultUsed omitted when false per EVALLOG.13).
  • Integration: FlagsClient records evaluations to the aggregator for all flag resolutions when evaluation logging is enabled, regardless of the doLog flag (unlike exposure logging).

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes
  • Add Objective-C interface for public APIs - see our guidelines (internal)
  • Run make api-surface when adding new APIs

@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch from f653b90 to f11525d Compare January 22, 2026 05:05
leoromanovsky
leoromanovsky previously approved these changes Jan 23, 2026
Copy link
Contributor

@leoromanovsky leoromanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving FFE related code; I generally understand the synchronization changes but didn't do a deep dive of them.

I asked claude about some race condition but I'm not going to parrot it back here as my own work; try it out yourself to see if it's relevant.

I appreciate that:

  • You are testing with references to the EVAL spec
  • In your tests you are controlling the advance of time to keep them fast

let service: String
let version: String
let env: String
let rum: RUMInfo?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the contextual informal that's embedded in rum relevant (or possible) to annotate from swift? The desired field is url which feels very web centric. It's optional so we don't need to add anything in this PR but wondering if you have insight to take back to augment the schema.

Copy link
Contributor Author

@sameerank sameerank Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, url is not required in the schema and actually neither is the entire rum object, which is why I had previously set it to nil. I don't see anything comparable to url among those available in RUMCoreContext for url:

public struct RUMCoreContext: AdditionalContext, Equatable {
/// RUM key in core additional context.
public static let key = "rum"
/// Current RUM application ID - standard UUID string, lowercased.
public let applicationID: String
/// Current RUM session ID - standard UUID string, lowercased.
public let sessionID: String
/// Current RUM view ID - standard UUID string, lowercased. It can be empty when view is being loaded.
public let viewID: String?
/// The ID of current RUM action (standard UUID `String`, lowercased).
public let userActionID: String?
/// Current view related server time offset
public let viewServerTimeOffset: TimeInterval?

What I can do with what's currently available is to partially populate RUMInfo with the applicationID, which would still be compliant with the schema: acba889.

This deviates slightly from the Android SDK which is populating both.

My hunch is that a field like path would be a bit more appropriate for mobile, though it's worth noting that the field is referred to as "url" in the Android SDK. This activeViewPath field looks potentially useable to this end, but it's currently internal to DatadogRUM:

RUMCoreContext in DatadogInternal is the bridge that would make activeViewPath accessible to DatadogFlags but I would need a second opinion about making activeViewPath available publicly before I can add it there.

@datadog-official

This comment has been minimized.

@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch 5 times, most recently from 5e36ce8 to a90c741 Compare January 27, 2026 02:29
for: key,
assignment: defaultAssignment,
evaluationContext: .empty,
flagError: "PROVIDER_NOT_READY"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, with just exposure tracking, we didn't log anything if context was missing, but for evaluation tracking, we need to handle error cases. The Android SDK does something similar for PROVIDER_NOT_READY: https://github.com/DataDog/dd-sdk-android/blob/9aac19f5083fd137d6a052e6d4fc23a12d6f998f/features/dd-sdk-android-flags/src/main/kotlin/com/datadog/android/flags/internal/DatadogFlagsClient.kt#L258

}
return EvaluationContext.RUMInfo(
application: EvaluationContext.RUMInfo.ApplicationInfo(id: rum.applicationID),
view: nil // viewURL not available in RUMCoreContext
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a TODO to use activeViewPath? Or should we settle for partially populating the RUM info?

telemetry: telemetry
)
messageReceiver = NOPFeatureMessageReceiver()
performanceOverride = PerformancePresetOverride(maxObjectsInFile: 50)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was set to match

performanceOverride = PerformancePresetOverride(maxObjectsInFile: 50)

The default is 500, and so my understanding is that 50 means that evaluations are batched into smaller files = more frequent uploads. I don't recall what the exact reason was for overriding the default

@sameerank sameerank marked this pull request as ready for review January 27, 2026 05:17
@sameerank sameerank requested a review from a team as a code owner January 27, 2026 05:17
@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch 2 times, most recently from 46328aa to 95198df Compare January 27, 2026 18:22
Copy link
Contributor

@gonzalezreal gonzalezreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding evaluation logging. There are a few issues to address before merging:

Concurrency:

  • Replace DispatchQueue with @ReadWriteLock for dictionary synchronization (see inline comments). This is the established pattern in the codebase.
  • Current deinit has a race condition - reads aggregations without synchronization.
  • Flushable conformance is misused - the protocol is for draining async work, not sending data.

Tests:

  • Replace Thread.sleep with XCTestExpectation to avoid flaky tests.
  • Consider consolidating low-value tests (default value checks, single-assertion tests, repetitive runtimeDefaultUsed tests).

Checklist:

  • CHANGELOG entry

private let maxAggregations: Int
private let dateProvider: any DateProvider
private let featureScope: any FeatureScope
private let queue = DispatchQueue(label: "com.datadoghq.flags.evaluation-aggregator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serial dispatch queue is being used purely for synchronization, which is overkill for protecting a dictionary. The codebase already provides ReadWriteLock, a pthread_rwlock_t-based property wrapper optimized for this exact use case.

Current approach has unnecessary overhead:

  • DispatchWorkItem allocation on every recordEvaluation call
  • GCD scheduler overhead for trivial dictionary operations
  • .barrier flag in deinit is meaningless on a serial queue
  • queue.sync in flush() introduces deadlock risk

Suggested change:

@ReadWriteLock private var aggregations: [AggregationKey: AggregatedEvaluation] = [:]

This is a recurring pattern. If this code was AI-generated, please update the agent's instructions to emphasize following existing codebase conventions for concurrency (see CLAUDE.md which mentions ReadWriteLock). The agent should search for existing patterns before introducing new synchronization mechanisms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance! I saw both patterns (DispatchQueue and @ReadWriteLock) and wasn't sure which one to use. The new CLAUDE.md is helpful as a best practices guide for humans too 😆. Updated: de532eb

// Timer uses [weak self], so it becomes a no-op after deallocation.
// Copy values before async dispatch - self will be deallocated before the closure runs.
let scope = featureScope
let snapshot = aggregations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Race condition: aggregations is read here without synchronization, before the async dispatch. If there's a pending recordEvaluation block in the queue, this is a data race. Using ReadWriteLock would fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in de532eb

}
}

extension EvaluationAggregator: Flushable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flushable protocol is documented as "Awaits completion of all asynchronous operations", other implementations use queue.sync {} with an empty block just to drain pending work.

This implementation uses it to send buffered data, which is a different concern. With ReadWriteLock, there are no async operations to await, making Flushable conformance misleading. Consider removing the conformance and using a dedicated method like sendAggregatedEvaluations() for the timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the Flushable conformance, and the timer and tests now call sendEvaluations() directly! Since all the method calls look like aggregator.sendEvaluations() it seemed self-evident that the evaluations would be aggregated, so I chose to not rename the method: b0e3095

Copy link
Contributor Author

@sameerank sameerank Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing Flushable from EvaluationAggregator, I discovered that it is useful for flushing evaluation data without crashing on shutdown: calling sendEvaluations() in deinit caused a "exclusivity conflict" because deinit runs inside stop() while it's modifying the features dictionary. I described the conflict a little more in #2646 (comment)

Whereas Flushable.flush() runs prior to .stop() so it acts as a pre-shutdown hook that we can use to clear the ReadWriteLock dictionary, so I've added it here 74c7257

It's possible that I'm still misunderstanding Flushable's purpose so if there are other shutdown-related hooks that I haven't yet considered, I am happy to try those instead.

)

// Then
Thread.sleep(forTimeInterval: 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Using Thread.sleep for async synchronization in tests is fragile and can lead to flaky tests, especially under CI load. Prefer using XCTestExpectation with wait(for:timeout:) to properly synchronize on async completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since the ReadWriteLock operations are synchronous, I can remove these Thread.sleep calls! The previous DispatchQueue approach for managing aggregations was more complicated to test. 832e171

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage is extensive but shows signs of over-testing. Consider consolidating:

  • Tests for trivial default values (testDefaultEvaluationFlushInterval, testEvaluationLoggingEnabledByDefaultAndCanBeDisabled) provide minimal value
  • The three testRuntimeDefaultUsed* tests could be one parameterized test
  • testTimestampEqualsFirstEvaluation is a single assertion that could be part of another test

Focus on testing behavior, not spec checkboxes. A smaller set of well-designed tests is easier to maintain and provides better signal when they fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I went with a spec checkbox-style of testing to help ensure we stay aligned in behaviors across SDKs. I saw a similar pattern in the OpenFeature SDKs and found it useful to understand how each SDK's approach to aligning with the specs, e.g. Ruby does context "Requirement 1.1.2.1" do.

That said, I haven't seen this among the Datadog SDKs, so I see how this is an unusual pattern. I applied your recommended consolidation but haven't removed the spec-to-test mapping (but am open to removing and consolidating further if this still feels unwieldy to maintain): e55c9ef

@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch 11 times, most recently from 9254d27 to 7921cc8 Compare February 3, 2026 16:09

deinit {
flushTimer?.invalidate()
// Note: We don't flush here due to exclusivity conflict with DatadogCore.stop()
Copy link
Contributor Author

@sameerank sameerank Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the conflict I'm referring to

Screenshot 2026-02-02 at 4 26 42 PM

Also saw it on CI, e.g. here where FlagsRUMIntegrationTests.testWhenFlagIsEvaluated_itAddsFeatureFlagToRUMView() failed due to the test crashing

Test Suite 'FlagsRUMIntegrationTests' started at 2026-01-31 08:28:26.829.
note: [DatadogSDKTesting] Crash detected! Exiting...

During SDK teardown if we call sendEvaluations() here, an exclusivity conflict occurs when EvaluationAggregator.deinit tries to flush evaluations while DatadogCore.stop() is modifying the features dictionary. The chain is:

  1. deinitsendEvaluations()eventWriteContextfeature(named:type:) → reads features[name]
  2. Meanwhile, stop() is writing features = [:]

So I ended up needing to bring back Flushable conformance, so we have access to the Flushable.flush() pre-shutdown hook.

// Next, flush flushable Features - finish current data collection to open "event write contexts":
features.forEach { $0.flush() }

which is called before .stop.

    func flushAndTearDown() {
        flush()
        ...
        stop()
    }

Comment on lines +92 to +94
func flush() {
evaluationAggregator?.sendEvaluations()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most Flushable implementations use this pattern to drain a DispatchQueue:

func flush() {
    queue.sync {}
}

The implementation here differs because EvaluationAggregator uses @ReadWriteLock, whose operations are all synchronous. Since sendEvaluations() is synchronous, calling it directly in flush() achieves the same goal.

}
}

extension EvaluationAggregator: Flushable {
Copy link
Contributor Author

@sameerank sameerank Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing Flushable from EvaluationAggregator, I discovered that it is useful for flushing evaluation data without crashing on shutdown: calling sendEvaluations() in deinit caused a "exclusivity conflict" because deinit runs inside stop() while it's modifying the features dictionary. I described the conflict a little more in #2646 (comment)

Whereas Flushable.flush() runs prior to .stop() so it acts as a pre-shutdown hook that we can use to clear the ReadWriteLock dictionary, so I've added it here 74c7257

It's possible that I'm still misunderstanding Flushable's purpose so if there are other shutdown-related hooks that I haven't yet considered, I am happy to try those instead.

- Implements EVAL.1-14 specs with client-side aggregation and batched JSON format
- Adds EvaluationAggregator with 3 flush triggers (10s timer, 1000 items, shutdown)
- Introduces public API: trackEvaluations (default: true), customEvaluationEndpoint
- Tracks ALL flag evaluations via separate FlagsEvaluationFeature to /api/v2/flagevaluation
…ead.sleep

- Merge EVALLOG.10 assertion into testTracksAggregationFields
- Combine EVALLOG.4 and EVALLOG.12 default config tests
- Consolidate three EVALLOG.13 runtime_default_used tests
- Remove Thread.sleep from test helper method
Flushing from deinit conflicts with DatadogCore.stop() modifying the
features dictionary. Evaluations since last timer flush are lost on
shutdown (EVALLOG.4 compliance gap documented in tests).
@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch 2 times, most recently from ff8f01a to 38b1172 Compare February 3, 2026 19:35
@sameerank
Copy link
Contributor Author

Thank you for the thoughtful review @gonzalezreal !

I left some comments to explain my implementation decisions (particularly why I continued to use Flushable), but I am happy to iterate further if there are alternative approaches that I haven't yet considered or other opportunities for improvement.

@sameerank sameerank force-pushed the sameerank/FFL-1721/emit-flagevaluation-evp-track branch from 38b1172 to 1e84ef7 Compare February 3, 2026 20:19
gonzalezreal
gonzalezreal previously approved these changes Feb 4, 2026
Copy link
Contributor

@gonzalezreal gonzalezreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous feedback!

The move to @ReadWriteLock is exactly what I was hoping for, and I appreciate the detailed explanation of the exclusivity conflict that led you back to Flushable, good catch.

A few minor things, none of which are blockers:

  1. Test naming use a different style than the SDK's typical testGivenWhenThen pattern. I am ok with it if you prefer the spec-aligned approach you mentioned. Just flagging for consistency.
  2. Small nit, the hardcoded "PROVIDER_NOT_READY", "FLAG_NOT_FOUND", "TYPE_MISMATCH" strings could be constants to avoid typos, but not a big deal.
  3. AnyValue: Hashable. This is now part of the public API. Totally fine, just making sure it's intentional.

Otherwise, this looks good to me!

for: key,
assignment: defaultAssignment,
evaluationContext: .empty,
flagError: "PROVIDER_NOT_READY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤏🏽 minor: Consider extracting these to constants.

private enum EvaluationError {
   static let providerNotReady = "PROVIDER_NOT_READY"
   static let flagNotFound = "FLAG_NOT_FOUND"
   static let typeMismatch = "TYPE_MISMATCH"
}

Notice that the enum above is just used as a namespace.

@sameerank
Copy link
Contributor Author

All good points!

  1. The given-when-then convention is nice and so I’ve renamed the tests: 34b4672
  2. Enum added: eb68983
  3. AnyValue’s being Hashable and Equatable are both used internally but exposed publicly. The alternative would be to convert a non-hashable public enum AnyValue: Equatable to an internal wrapper internal struct HashableAnyValue: Hashable before using evaluationContext as an aggregation key, but I don’t think the extra code is worth it. We could have also made Equatable internal-only, but similarly didn’t. I realized though that I needed to run make api-surface: 52181ac

The changes dismissed your approval, so please take a look again 🙏 @gonzalezreal

gonzalezreal
gonzalezreal previously approved these changes Feb 5, 2026
Copy link
Contributor

@gonzalezreal gonzalezreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 LGTM!

@sameerank
Copy link
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 6, 2026

View all feedbacks in Devflow UI.

2026-02-06 12:08:49 UTC ℹ️ Start processing command /merge


2026-02-06 12:08:54 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in develop is approximately 1h (p90).


2026-02-06 12:56:44 UTC ℹ️ MergeQueue: This merge request was merged

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit 3029b12 into develop Feb 6, 2026
18 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot deleted the sameerank/FFL-1721/emit-flagevaluation-evp-track branch February 6, 2026 12:56
@mariedm mariedm mentioned this pull request Feb 6, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants