FFL-1721 Emit flagevaluation EVP Track#2646
FFL-1721 Emit flagevaluation EVP Track#2646gh-worker-dd-mergequeue-cf854d[bot] merged 27 commits intodevelopfrom
flagevaluation EVP Track#2646Conversation
f653b90 to
f11525d
Compare
leoromanovsky
left a comment
There was a problem hiding this comment.
Approving FFE related code; I generally understand the synchronization changes but didn't do a deep dive of them.
I asked claude about some race condition but I'm not going to parrot it back here as my own work; try it out yourself to see if it's relevant.
I appreciate that:
- You are testing with references to the EVAL spec
- In your tests you are controlling the advance of time to keep them fast
| let service: String | ||
| let version: String | ||
| let env: String | ||
| let rum: RUMInfo? |
There was a problem hiding this comment.
Is the contextual informal that's embedded in rum relevant (or possible) to annotate from swift? The desired field is url which feels very web centric. It's optional so we don't need to add anything in this PR but wondering if you have insight to take back to augment the schema.
There was a problem hiding this comment.
Yes, url is not required in the schema and actually neither is the entire rum object, which is why I had previously set it to nil. I don't see anything comparable to url among those available in RUMCoreContext for url:
dd-sdk-ios/DatadogInternal/Sources/Models/RUM/RUMCoreContext.swift
Lines 10 to 22 in 5467415
What I can do with what's currently available is to partially populate RUMInfo with the applicationID, which would still be compliant with the schema: acba889.
This deviates slightly from the Android SDK which is populating both.
My hunch is that a field like path would be a bit more appropriate for mobile, though it's worth noting that the field is referred to as "url" in the Android SDK. This activeViewPath field looks potentially useable to this end, but it's currently internal to DatadogRUM:
RUMCoreContext in DatadogInternal is the bridge that would make activeViewPath accessible to DatadogFlags but I would need a second opinion about making activeViewPath available publicly before I can add it there.
This comment has been minimized.
This comment has been minimized.
5e36ce8 to
a90c741
Compare
| for: key, | ||
| assignment: defaultAssignment, | ||
| evaluationContext: .empty, | ||
| flagError: "PROVIDER_NOT_READY" |
There was a problem hiding this comment.
Previously, with just exposure tracking, we didn't log anything if context was missing, but for evaluation tracking, we need to handle error cases. The Android SDK does something similar for PROVIDER_NOT_READY: https://github.com/DataDog/dd-sdk-android/blob/9aac19f5083fd137d6a052e6d4fc23a12d6f998f/features/dd-sdk-android-flags/src/main/kotlin/com/datadog/android/flags/internal/DatadogFlagsClient.kt#L258
| } | ||
| return EvaluationContext.RUMInfo( | ||
| application: EvaluationContext.RUMInfo.ApplicationInfo(id: rum.applicationID), | ||
| view: nil // viewURL not available in RUMCoreContext |
There was a problem hiding this comment.
Should this be a TODO to use activeViewPath? Or should we settle for partially populating the RUM info?
| telemetry: telemetry | ||
| ) | ||
| messageReceiver = NOPFeatureMessageReceiver() | ||
| performanceOverride = PerformancePresetOverride(maxObjectsInFile: 50) |
There was a problem hiding this comment.
This was set to match
The default is 500, and so my understanding is that 50 means that evaluations are batched into smaller files = more frequent uploads. I don't recall what the exact reason was for overriding the default
46328aa to
95198df
Compare
gonzalezreal
left a comment
There was a problem hiding this comment.
Thanks for adding evaluation logging. There are a few issues to address before merging:
Concurrency:
- Replace
DispatchQueuewith@ReadWriteLockfor dictionary synchronization (see inline comments). This is the established pattern in the codebase. - Current
deinithas a race condition - reads aggregations without synchronization. Flushableconformance is misused - the protocol is for draining async work, not sending data.
Tests:
- Replace
Thread.sleepwithXCTestExpectationto avoid flaky tests. - Consider consolidating low-value tests (default value checks, single-assertion tests, repetitive
runtimeDefaultUsedtests).
Checklist:
- CHANGELOG entry
| private let maxAggregations: Int | ||
| private let dateProvider: any DateProvider | ||
| private let featureScope: any FeatureScope | ||
| private let queue = DispatchQueue(label: "com.datadoghq.flags.evaluation-aggregator") |
There was a problem hiding this comment.
This serial dispatch queue is being used purely for synchronization, which is overkill for protecting a dictionary. The codebase already provides ReadWriteLock, a pthread_rwlock_t-based property wrapper optimized for this exact use case.
Current approach has unnecessary overhead:
DispatchWorkItemallocation on every recordEvaluation call- GCD scheduler overhead for trivial dictionary operations
.barrierflag in deinit is meaningless on a serial queuequeue.syncin flush() introduces deadlock risk
Suggested change:
@ReadWriteLock private var aggregations: [AggregationKey: AggregatedEvaluation] = [:]This is a recurring pattern. If this code was AI-generated, please update the agent's instructions to emphasize following existing codebase conventions for concurrency (see CLAUDE.md which mentions ReadWriteLock). The agent should search for existing patterns before introducing new synchronization mechanisms.
There was a problem hiding this comment.
Thanks for the guidance! I saw both patterns (DispatchQueue and @ReadWriteLock) and wasn't sure which one to use. The new CLAUDE.md is helpful as a best practices guide for humans too 😆. Updated: de532eb
| // Timer uses [weak self], so it becomes a no-op after deallocation. | ||
| // Copy values before async dispatch - self will be deallocated before the closure runs. | ||
| let scope = featureScope | ||
| let snapshot = aggregations |
There was a problem hiding this comment.
aggregations is read here without synchronization, before the async dispatch. If there's a pending recordEvaluation block in the queue, this is a data race. Using ReadWriteLock would fix this.
| } | ||
| } | ||
|
|
||
| extension EvaluationAggregator: Flushable { |
There was a problem hiding this comment.
The Flushable protocol is documented as "Awaits completion of all asynchronous operations", other implementations use queue.sync {} with an empty block just to drain pending work.
This implementation uses it to send buffered data, which is a different concern. With ReadWriteLock, there are no async operations to await, making Flushable conformance misleading. Consider removing the conformance and using a dedicated method like sendAggregatedEvaluations() for the timer.
There was a problem hiding this comment.
Removed the Flushable conformance, and the timer and tests now call sendEvaluations() directly! Since all the method calls look like aggregator.sendEvaluations() it seemed self-evident that the evaluations would be aggregated, so I chose to not rename the method: b0e3095
There was a problem hiding this comment.
After removing Flushable from EvaluationAggregator, I discovered that it is useful for flushing evaluation data without crashing on shutdown: calling sendEvaluations() in deinit caused a "exclusivity conflict" because deinit runs inside stop() while it's modifying the features dictionary. I described the conflict a little more in #2646 (comment)
Whereas Flushable.flush() runs prior to .stop() so it acts as a pre-shutdown hook that we can use to clear the ReadWriteLock dictionary, so I've added it here 74c7257
It's possible that I'm still misunderstanding Flushable's purpose so if there are other shutdown-related hooks that I haven't yet considered, I am happy to try those instead.
| ) | ||
|
|
||
| // Then | ||
| Thread.sleep(forTimeInterval: 0.1) |
There was a problem hiding this comment.
Thread.sleep for async synchronization in tests is fragile and can lead to flaky tests, especially under CI load. Prefer using XCTestExpectation with wait(for:timeout:) to properly synchronize on async completion.
There was a problem hiding this comment.
Actually, since the ReadWriteLock operations are synchronous, I can remove these Thread.sleep calls! The previous DispatchQueue approach for managing aggregations was more complicated to test. 832e171
There was a problem hiding this comment.
The test coverage is extensive but shows signs of over-testing. Consider consolidating:
- Tests for trivial default values (
testDefaultEvaluationFlushInterval,testEvaluationLoggingEnabledByDefaultAndCanBeDisabled) provide minimal value - The three
testRuntimeDefaultUsed*tests could be one parameterized test testTimestampEqualsFirstEvaluationis a single assertion that could be part of another test
Focus on testing behavior, not spec checkboxes. A smaller set of well-designed tests is easier to maintain and provides better signal when they fail.
There was a problem hiding this comment.
In this case, I went with a spec checkbox-style of testing to help ensure we stay aligned in behaviors across SDKs. I saw a similar pattern in the OpenFeature SDKs and found it useful to understand how each SDK's approach to aligning with the specs, e.g. Ruby does context "Requirement 1.1.2.1" do.
That said, I haven't seen this among the Datadog SDKs, so I see how this is an unusual pattern. I applied your recommended consolidation but haven't removed the spec-to-test mapping (but am open to removing and consolidating further if this still feels unwieldy to maintain): e55c9ef
9254d27 to
7921cc8
Compare
|
|
||
| deinit { | ||
| flushTimer?.invalidate() | ||
| // Note: We don't flush here due to exclusivity conflict with DatadogCore.stop() |
There was a problem hiding this comment.
This is the conflict I'm referring to
Also saw it on CI, e.g. here where FlagsRUMIntegrationTests.testWhenFlagIsEvaluated_itAddsFeatureFlagToRUMView() failed due to the test crashing
Test Suite 'FlagsRUMIntegrationTests' started at 2026-01-31 08:28:26.829.
note: [DatadogSDKTesting] Crash detected! Exiting...
During SDK teardown if we call sendEvaluations() here, an exclusivity conflict occurs when EvaluationAggregator.deinit tries to flush evaluations while DatadogCore.stop() is modifying the features dictionary. The chain is:
deinit→sendEvaluations()→eventWriteContext→feature(named:type:)→ readsfeatures[name]- Meanwhile,
stop()is writingfeatures = [:]
So I ended up needing to bring back Flushable conformance, so we have access to the Flushable.flush() pre-shutdown hook.
dd-sdk-ios/DatadogCore/Sources/Core/DatadogCore.swift
Lines 597 to 598 in f9119f5
which is called before .stop.
func flushAndTearDown() {
flush()
...
stop()
}| func flush() { | ||
| evaluationAggregator?.sendEvaluations() | ||
| } |
There was a problem hiding this comment.
Most Flushable implementations use this pattern to drain a DispatchQueue:
func flush() {
queue.sync {}
}The implementation here differs because EvaluationAggregator uses @ReadWriteLock, whose operations are all synchronous. Since sendEvaluations() is synchronous, calling it directly in flush() achieves the same goal.
| } | ||
| } | ||
|
|
||
| extension EvaluationAggregator: Flushable { |
There was a problem hiding this comment.
After removing Flushable from EvaluationAggregator, I discovered that it is useful for flushing evaluation data without crashing on shutdown: calling sendEvaluations() in deinit caused a "exclusivity conflict" because deinit runs inside stop() while it's modifying the features dictionary. I described the conflict a little more in #2646 (comment)
Whereas Flushable.flush() runs prior to .stop() so it acts as a pre-shutdown hook that we can use to clear the ReadWriteLock dictionary, so I've added it here 74c7257
It's possible that I'm still misunderstanding Flushable's purpose so if there are other shutdown-related hooks that I haven't yet considered, I am happy to try those instead.
- Implements EVAL.1-14 specs with client-side aggregation and batched JSON format - Adds EvaluationAggregator with 3 flush triggers (10s timer, 1000 items, shutdown) - Introduces public API: trackEvaluations (default: true), customEvaluationEndpoint - Tracks ALL flag evaluations via separate FlagsEvaluationFeature to /api/v2/flagevaluation
…ead.sleep - Merge EVALLOG.10 assertion into testTracksAggregationFields - Combine EVALLOG.4 and EVALLOG.12 default config tests - Consolidate three EVALLOG.13 runtime_default_used tests - Remove Thread.sleep from test helper method
Flushing from deinit conflicts with DatadogCore.stop() modifying the features dictionary. Evaluations since last timer flush are lost on shutdown (EVALLOG.4 compliance gap documented in tests).
ff8f01a to
38b1172
Compare
|
Thank you for the thoughtful review @gonzalezreal ! I left some comments to explain my implementation decisions (particularly why I continued to use |
38b1172 to
1e84ef7
Compare
gonzalezreal
left a comment
There was a problem hiding this comment.
Thanks for addressing the previous feedback!
The move to @ReadWriteLock is exactly what I was hoping for, and I appreciate the detailed explanation of the exclusivity conflict that led you back to Flushable, good catch.
A few minor things, none of which are blockers:
- Test naming use a different style than the SDK's typical
testGivenWhenThenpattern. I am ok with it if you prefer the spec-aligned approach you mentioned. Just flagging for consistency. - Small nit, the hardcoded
"PROVIDER_NOT_READY","FLAG_NOT_FOUND","TYPE_MISMATCH"strings could be constants to avoid typos, but not a big deal. AnyValue: Hashable. This is now part of the public API. Totally fine, just making sure it's intentional.
Otherwise, this looks good to me!
| for: key, | ||
| assignment: defaultAssignment, | ||
| evaluationContext: .empty, | ||
| flagError: "PROVIDER_NOT_READY" |
There was a problem hiding this comment.
🤏🏽 minor: Consider extracting these to constants.
private enum EvaluationError {
static let providerNotReady = "PROVIDER_NOT_READY"
static let flagNotFound = "FLAG_NOT_FOUND"
static let typeMismatch = "TYPE_MISMATCH"
}Notice that the enum above is just used as a namespace.
|
All good points!
The changes dismissed your approval, so please take a look again 🙏 @gonzalezreal |
…/emit-flagevaluation-evp-track
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
What and why?
Introduces evaluation logging for feature flags.
Evaluation logging vs Exposure logging:
doLog=true). This is used for experimentation metrics and analysis where you need to know which users were exposed to which variants.doLog. This is for operational visibility: for monitoring and debugging.Evaluation data is aggregated (count, first/last timestamps) rather than logged individually, to minimize overhead.
Useful references:
How?
EvaluationAggregator: Aggregates flag evaluations by a composite key (flag key, variant key, allocation key, targeting key, error message, and context hash). TracksevaluationCount,firstEvaluation, andlastEvaluationtimestamps per unique combination.evaluationFlushInterval)FlagEvaluationEvent: Event model matching the EVP schema. Fields likevariant,allocation, andruntimeDefaultUsedare conditionally included based on EVALLOG spec requirements (e.g.,variant/allocationomitted for runtime defaults per EVALLOG.8,runtimeDefaultUsedomitted when false per EVALLOG.13).FlagsClientrecords evaluations to the aggregator for all flag resolutions when evaluation logging is enabled, regardless of thedoLogflag (unlike exposure logging).Review checklist
make api-surfacewhen adding new APIs