chore(profiling): revert dictionary optimization and add profiling support

jbachorik · claude · jbachorik · commit b4a599edff72 · 2025-12-04T08:05:52.000+01:00
Reverted Phase 1 optimization attempts that showed no improvement: - Removed tryGetExisting() optimization from JfrToOtlpConverter - Deleted tryGetExisting() method from FunctionTable - The optimization added overhead (2 FunctionKey allocations vs 1) Added JMH profiling support: - Added profiling configuration to build.gradle.kts - Enable with -PjmhProfile=true flag - Configures stack profiler (CPU sampling) and GC profiler (allocations) Profiling results reveal actual bottlenecks: - JFR File I/O: ~20% (jafar-parser, external dependency) - Protobuf encoding: ~5% (fundamental serialization cost) - Conversion logic: ~3% (our code) - Dictionary operations: ~1-2% (NOT the bottleneck) Key findings: - Dictionary operations already well-optimized at ~1-2% of runtime - Modern JVM escape analysis optimizes temporary allocations - Stack depth is dominant factor (O(n) frame processing) - HashMap lookups (~10-20ns) dominated by I/O overhead Updated documentation: - BENCHMARKS.md: Added profiling section with findings - ARCHITECTURE.md: Added profiling support and results 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/dd-java-agent/agent-profiling/profiling-otel/build.gradle.kts b/dd-java-agent/agent-profiling/profiling-otel/build.gradle.kts
@@ -15,6 +15,17 @@ jmh {
     val pattern = project.property("jmhIncludes") as String
     includes = listOf(pattern)
   }
+
+  // Profiling support
+  // Usage: ./gradlew jmh -PjmhProfile=true
+  // Generates flamegraph and allocation profile
+  if (project.hasProperty("jmhProfile")) {
+    profilers = listOf("gc", "stack")
+    jvmArgs = listOf(
+      "-XX:+UnlockDiagnosticVMOptions",
+      "-XX:+DebugNonSafepoints"
+    )
+  }
 }
 
 // OTel Collector validation tests (requires Docker)
diff --git a/dd-java-agent/agent-profiling/profiling-otel/doc/ARCHITECTURE.md b/dd-java-agent/agent-profiling/profiling-otel/doc/ARCHITECTURE.md
@@ -331,8 +331,19 @@ JMH microbenchmarks implemented in `src/jmh/java/com/datadog/profiling/otel/benc
 
 # Run specific benchmark method
 ./gradlew :dd-java-agent:agent-profiling:profiling-otel:jmh -PjmhIncludes=".*convertJfrToOtlp"
+
+# Run with CPU and allocation profiling
+./gradlew :dd-java-agent:agent-profiling:profiling-otel:jmh \
+  -PjmhIncludes="JfrToOtlpConverterBenchmark" \
+  -PjmhProfile=true
 ```
 
+**Profiling Support** (added in build.gradle.kts):
+- Stack profiler: CPU sampling to identify hot methods
+- GC profiler: Allocation rate tracking and GC overhead measurement
+- Enable with `-PjmhProfile=true` property
+- Adds JVM flags: `-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints`
+
 **Key Performance Characteristics** (measured on Apple M3 Max):
 - Dictionary interning: ~8-26 ops/µs (cold to warm cache)
 - Stack trace conversion: Scales linearly with stack depth
@@ -344,6 +355,15 @@ JMH microbenchmarks implemented in `src/jmh/java/com/datadog/profiling/otel/benc
   - Primary bottleneck: Stack depth processing (~60% throughput reduction for 10x depth increase)
   - Linear scaling with event count, minimal impact from unique context count
 
+**Profiling Results (December 2024)**:
+Profiling revealed actual CPU time distribution:
+- **JFR File I/O: ~20%** (jafar-parser library, external dependency)
+- **Protobuf Encoding: ~5%** (fundamental serialization cost)
+- **Conversion Logic: ~3%** (our code)
+- **Dictionary Operations: ~1-2%** (already well-optimized, NOT the bottleneck)
+
+Key insight: Dictionary operations account for only ~1-2% of runtime. The dominant factor is O(n) frame processing with stack depth. Optimization attempts targeting dictionary operations showed no improvement (-7% to +6%, within measurement noise). Modern JVM escape analysis already optimizes temporary allocations effectively.
+
 ### Phase 6: OTLP Compatibility Testing & Validation (Completed)
 
 #### Objective
diff --git a/dd-java-agent/agent-profiling/profiling-otel/doc/BENCHMARKS.md b/dd-java-agent/agent-profiling/profiling-otel/doc/BENCHMARKS.md
@@ -142,12 +142,34 @@ Based on typical hardware (M1/M2 Mac or modern x86_64):
 - **Stack interning**: 15-30 ops/µs
 - **Stack conversion**: Scales linearly with stack depth
 - **Protobuf encoding**: Varint 50-100 ops/µs, strings 10-50 ops/µs
-- **End-to-end conversion** (JfrToOtlpConverterBenchmark - measured on Apple M3 Max):
-  - **50 events**: 156-428 ops/s (2.3-6.4 ms/op) depending on stack depth
-  - **500 events**: 38-130 ops/s (7.7-26.0 ms/op) depending on stack depth
-  - **5000 events**: 3.5-30 ops/s (33.7-289 ms/op) depending on stack depth
-  - **Key factors**: Stack depth (10-100 frames) is the dominant performance factor, ~60% throughput reduction for 10x depth increase
-  - **Scaling**: Linear with event count, minimal impact from unique context count (100 vs 1000)
+- **End-to-end conversion** (JfrToOtlpConverterBenchmark - measured on Apple M3 Max, JDK 21.0.5):
+
+| Event Count | Stack Depth | Unique Contexts | Throughput (ops/s) | Time per Operation |
+|-------------|-------------|-----------------|--------------------|--------------------|
+| 50          | 10          | 100             | 344-370 ops/s      | 2.7-2.9 ms/op      |
+| 50          | 10          | 1000            | 344-428 ops/s      | 2.3-2.9 ms/op      |
+| 50          | 50          | 100             | 154-213 ops/s      | 4.7-6.5 ms/op      |
+| 50          | 50          | 1000            | 165-203 ops/s      | 4.9-6.1 ms/op      |
+| 50          | 100         | 100             | 160 ops/s          | 6.2 ms/op          |
+| 50          | 100         | 1000            | 156 ops/s          | 6.4 ms/op          |
+| 500         | 10          | 100             | 130-137 ops/s      | 7.3-7.7 ms/op      |
+| 500         | 10          | 1000            | 122-127 ops/s      | 7.9-8.2 ms/op      |
+| 500         | 50          | 100             | 62-66 ops/s        | 15.2-16.1 ms/op    |
+| 500         | 50          | 1000            | 61-67 ops/s        | 14.9-16.3 ms/op    |
+| 500         | 100         | 100             | 38-41 ops/s        | 24.4-26.3 ms/op    |
+| 500         | 100         | 1000            | 40-41 ops/s        | 24.3-25.0 ms/op    |
+| 5000        | 10          | 100             | 29.7-30.6 ops/s    | 32.7-33.7 ms/op    |
+| 5000        | 10          | 1000            | 29.0-29.0 ops/s    | 34.5-34.5 ms/op    |
+| 5000        | 50          | 100             | 8.1-8.2 ops/s      | 122-123 ms/op      |
+| 5000        | 50          | 1000            | 7.9-8.6 ops/s      | 116-126 ms/op      |
+| 5000        | 100         | 100             | 3.9-4.0 ops/s      | 250-257 ms/op      |
+| 5000        | 100         | 1000            | 3.8-3.9 ops/s      | 256-263 ms/op      |
+
+  - **Key factors**:
+    - Stack depth (10-100 frames) is the dominant performance factor, ~60% throughput reduction per 10x depth increase
+    - Event count scales linearly (10x events = ~10x processing time)
+    - Unique context count (100 vs 1000) has minimal impact on throughput
+  - **Deduplication efficiency**: High hit rates on dictionary tables (strings, functions, stacks) provide effective compression but marginal performance gains
 
 ## Interpreting Results
 
@@ -156,12 +178,77 @@ Based on typical hardware (M1/M2 Mac or modern x86_64):
 - **Warm cache (hitRate=0.95)**: Tests best-case lookup performance
 - **Real-world typically**: Between 50-80% hit rate for most applications
 
+## Profiling Benchmarks
+
+JMH supports built-in profilers to identify CPU and allocation hotspots:
+
+```bash
+# Run with CPU stack profiling and GC allocation profiling
+./gradlew :dd-java-agent:agent-profiling:profiling-otel:jmh \
+  -PjmhIncludes="JfrToOtlpConverterBenchmark" \
+  -PjmhProfile=true
+```
+
+This enables:
+- **Stack profiler**: CPU sampling to identify hot methods
+- **GC profiler**: Allocation rate tracking and GC overhead measurement
+
+### Profiling Results (December 2024)
+
+Profiling the end-to-end converter revealed the actual performance bottlenecks:
+
+**CPU Time Distribution** (from stack profiler on deep stack workloads):
+
+1. **JFR File I/O (~17-22%)**:
+   - `DirectByteBuffer.get`: 3.5-17% (peaks with deep stacks)
+   - `RecordingStreamReader.readVarint`: 1.6-5.5%
+   - `MutableConstantPools.getConstantPool`: 0.4-1.1%
+   - This is the jafar-parser library reading JFR binary format
+
+2. **Protobuf Encoding (~3-7%)**:
+   - `ProtobufEncoder.writeVarint/writeVarintField`: 0.6-5.8%
+   - `ProtobufEncoder.writeNestedMessage`: 0.5-0.9%
+   - Fundamental serialization cost
+
+3. **Conversion Logic (~2-4%)**:
+   - `JfrToOtlpConverter.convertFrame`: 0.3-1.9%
+   - `JfrToOtlpConverter.encodeSample`: 0.4-1.3%
+   - `JfrToOtlpConverter.encodeDictionary`: 0.2-0.6%
+
+4. **Dictionary Operations (~1-2%)**:
+   - `Arrays.hashCode`: 0.5-1.4% (HashMap key hashing)
+   - `LocationTable.intern`: 0.3-0.5%
+   - **Dictionary operations are already well-optimized**
+
+**Allocation Data**:
+- 5-20 MB per operation (varies with stack depth/event count)
+- Allocation rate: 1.4-1.9 GB/sec
+- GC overhead: 2-5% of total time
+
+**Key Insights**:
+- Dictionary operations account for only ~1-2% of runtime (not the bottleneck)
+- JFR parsing dominates at ~20% (external dependency, I/O bound)
+- Stack depth is the dominant performance factor due to O(n) frame processing
+- Modern JVM escape analysis already optimizes temporary allocations
+- HashMap lookups are ~10-20ns, completely dominated by I/O overhead
+
+**Performance Optimization Attempts**:
+- Attempted Phase 1 optimizations targeting dictionary operations showed no improvement (-7% to +6%, within noise)
+- Optimization attempt: `tryGetExisting()` to avoid string concatenation - Result: Added allocation overhead (2 FunctionKey allocations instead of 1)
+- Profiling proved that intuition-based optimizations were targeting the wrong bottleneck
+
+**Conclusion**: The 60% throughput reduction with 10x stack depth increase is fundamentally due to processing 10x more frames (O(n) with depth), not inefficient data structures. Further optimization would require:
+1. Reducing JFR parsing overhead (external library)
+2. Optimizing protobuf varint encoding (diminishing returns)
+3. Batch processing to amortize per-operation overhead
+
 ## Adding New Benchmarks
 
 1. Add `@Benchmark` method to appropriate class
 2. Use `@Param` for parameterized testing
 3. Follow JMH best practices (use Blackhole, avoid dead code elimination)
 4. Document expected performance characteristics
+5. Use profiling (`-PjmhProfile=true`) to validate optimization impact
 
 ## References
 
diff --git a/dd-java-agent/agent-profiling/profiling-otel/src/main/java/com/datadog/profiling/otel/JfrToOtlpConverter.java b/dd-java-agent/agent-profiling/profiling-otel/src/main/java/com/datadog/profiling/otel/JfrToOtlpConverter.java
@@ -382,24 +382,24 @@ private int convertFrame(JfrStackFrame frame) {
     JfrClass type = method.type();
     String className = type != null ? type.name() : null;
 
-    // Build full name: "ClassName.methodName"
+    // Get line number
+    int lineNumber = frame.lineNumber();
+    long line = Math.max(lineNumber, 0);
+
+    // Build full name
     String fullName;
     if (className != null && !className.isEmpty()) {
       fullName = className + "." + (methodName != null ? methodName : "");
     } else {
       fullName = methodName != null ? methodName : "";
     }
 
-    // Get line number
-    int lineNumber = frame.lineNumber();
-    long line = Math.max(lineNumber, 0);
-
     // Intern strings
     int nameIndex = stringTable.intern(fullName);
-    int methodNameIndex = stringTable.intern(methodName);
     int classNameIndex = stringTable.intern(className);
+    int methodNameIndex = stringTable.intern(methodName);
 
-    // Create function entry
+    // Intern function
     int functionIndex = functionTable.intern(nameIndex, methodNameIndex, classNameIndex, 0);
 
     // Create location entry