Add cassandra profile command with async-profiler pass-through to Pyroscope#565
Add cassandra profile command with async-profiler pass-through to Pyroscope#565rustyrazorblade wants to merge 7 commits intomainfrom
cassandra profile command with async-profiler pass-through to Pyroscope#565Conversation
…oscope - Add flamegraph-to-pyroscope packer script that runs asprof and pushes collapsed stacks to Pyroscope via HTTP ingest API - Add flamegraph-cassandra.service SystemD one-shot unit for default 30s profiles - Add install_flamegraph_service.sh packer install script - Add Event.Profiling.Starting and Event.Profiling.Complete events - Add ProfilingService and DefaultProfilingService - Add Profile parent command and ProfileCassandra subcommand - ProfileCassandra uses @unmatched to pass all extra args directly to asprof Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
|
test |
Code Coverage
|
Code Review - Full Review Posted Below |
|
Part 1: Shell argument injection - The @unmatched arguments are joined into a raw command string sent over SSH. A user-supplied value like semicolon-rm-rf-slash would execute on the remote host. Args should be validated or risk documented. |
|
Part 2: No tests - ProfilingService has real logic: it emits events, builds a command string, calls remote ops, and propagates errors via Result. Per project conventions, non-trivial code requires tests. A DefaultProfilingServiceTest could verify events order, Result.failure propagation, and command string assembly. |
|
NOTE: Ignore the partial comments below -- full review follows above (comment was split due to tooling constraints). |
|
Part 3: Missing Event.Profiling.Error - runCatching returns Result.failure with no error event emitted. The caller calls .getOrThrow(), so failures surface as unformatted stack traces. Add Event.Profiling.Error(host, message) and emit it in the catch path. |
|
Part 4: error() bypasses event bus - Per CLAUDE.md all user-facing output must go through eventBus.emit(). error() throws IllegalStateException with a raw exception message. This should be an Event.Profiling.Error or shared cluster-level event. Part 5: Sequential execution - profiling blocks per node (default 30s), 3+ minutes on a 6-node cluster. Consider parallel execution as done elsewhere in the codebase. |
|
Part 6: Shell script unquoted variable in trap - OUTFILE is expanded at trap definition time when using double quotes. Use single quotes so the variable is expanded at trap execution time. Part 7: flamegraph-cassandra.service missing [Install] section - no [Install] stanza means systemctl enable silently does nothing. Add a comment if exclusive CLI trigger is intended. Minor: install_flamegraph_service.sh uses emoji in echo output (inconsistent with other scripts). Profile parent command KDoc lists subcommands inline which will drift; the annotation is the authoritative list. |
Code ReviewOverall this is a clean, well-structured addition. Layering is right (Command to Service to RemoteOps), events are properly typed, Koin registration follows established patterns. A few issues worth addressing before merge: 1. Shell argument injection (ProfilingService.kt lines 39-40) 2. No tests 3. Missing Event.Profiling.Error event 4. error() in ProfileCassandra bypasses the event bus 5. Sequential execution 6. Shell script: unquoted variable in trap 7. flamegraph-cassandra.service missing [Install] section Minor: install_flamegraph_service.sh uses an emoji in echo output, inconsistent with other install scripts. The Profile parent command KDoc lists subcommands inline -- this will drift as more are added; the subcommands annotation is the authoritative list. |
|
@claude |
|
Claude finished @rustyrazorblade's task in 2m 54s —— View job Restructuring
|
Restructures from top-level `profile cassandra` to `cassandra profile`. Removes the `profile/` package; Profile command now lives in `commands/cassandra/` alongside other Cassandra subcommands. Each db type can add its own `profile` subcommand independently as profiling support is added. Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
profile cassandra command with async-profiler pass-through to Pyroscopecassandra profile command with async-profiler pass-through to Pyroscope
|
Code Review: Overall this is a clean well-structured addition. A few issues worth addressing before merging. Bug: curl silently swallows HTTP errors - curl -s does not fail on non-2xx responses so a 400 or 500 still exits 0. Add -f flag. Security: Shell injection in ProfilingService.kt - user args are joined and interpolated into remote command string without sanitization. Missing Pyroscope timestamp parameters - ingest URL needs from/until Unix seconds. APP_NAME does not distinguish nodes - all nodes share the same label. No tests for ProfilingService. Sequential profiling scales poorly for large clusters. Minor: [Install] section missing from systemd unit. Minor: profileCassandra method name should be profileNode or profileDatabase per CLAUDE.md naming guidance. |
|
@claude address the review feedback |
|
@claude address the review feedback and make sure this runs as part of bin/end-to-end-test |
|
Claude finished @rustyrazorblade's task in 6m 50s —— View job Review Feedback AddressedAll items from the code review have been fixed in commit Shell script (
|
- flamegraph-to-pyroscope: add -f to curl (fail on HTTP errors), fix trap quoting (single quotes for deferred expansion), add from/until timestamps to Pyroscope ingest URL, include hostname in APP_NAME for per-node profiles - flamegraph-cassandra.service: add comment explaining missing [Install] section - install_flamegraph_service.sh: remove emoji from echo output - Event.kt: add Event.Profiling.Error(host, message) with isError=true - ProfilingService: rename profileCassandra -> profileNode per CLAUDE.md naming guidance, shell-quote args before building remote command to prevent injection, emit Event.Profiling.Error on failure - Profile.kt: replace error() call with eventBus.emit(Event.Profiling.Error), inject HostOperationsService and run profiling in parallel across nodes - ProfilingServiceTest: add tests for event ordering, error propagation, command assembly, and shell-quoting behaviour - bin/end-to-end-test: add step_cassandra_profile step (-d 10) under --cassandra flag Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
|
PR Review: cassandra profile command Overall this is a clean, well-structured addition. The layered design (shell script to service to Kotlin service to command) is solid, events follow project conventions, and the tests cover meaningful behavior. A few things to address: Security: Shell command construction in ProfilingServiceThe shellQuote() implementation is correct and handles single-quote escaping and the empty-string case properly. However, if RemoteOperationsService.executeRemotely accepts an argument list rather than a shell string, prefer that -- it eliminates the quoting step entirely and removes the risk of edge-case gaps in the allowlist. Worth checking whether that API surface exists or could be added. If the string approach stays, the allowlist regex [a-zA-Z0-9_./:=@%+,-]+ is reasonable for asprof args. Minor readability: the hyphen is fine at the end of the character class, but moving it to the front would make intent clearer. Duration inconsistency between CLI and SystemD unitflamegraph-cassandra.service hardcodes -d 30, but when the CLI is invoked as cassandra profile with no args the duration falls back to asprof built-in default (10 seconds). A user running cassandra profile with no -d flag gets a 10 s profile; the standalone service gives 30 s. Consider:
Result from profileNode is not used in Profile.execute()The Result is discarded -- errors are handled via event emission in onFailure, which follows project conventions. But if every node fails the command exits 0 with no final summary. Consider emitting a summary event when any nodes failed, so the user knows the overall outcome without scanning per-node output. TestsThe test suite is solid -- it covers event ordering, command string assembly, shell quoting, and error propagation. Mocking RemoteOperationsService is appropriate given the real SSH side effects. One potential fragility: additionalTestModules() is called during Koin setup (base-class @beforeeach), which runs before setupMocks() sets mockRemoteOps. The single lambda works correctly only because Koin evaluates single lazily and the first get() call happens after setupMocks(). This is fine in practice but worth a brief comment to make the ordering assumption explicit. Minor
|
|
@claude it doesn't look like this will work for continuous profiling. Script uploads after everything is done. I need this to be started then run for hours, and get the data pushed to pyroscope once a minute |
|
Claude finished @rustyrazorblade's task in 7m 18s —— View job Redesigned for Continuous ProfilingAll tasks complete — commit
What changed
Command structure —
|
…yroscope uploads - flamegraph-to-pyroscope: rewrite as infinite loop (profile INTERVAL seconds, upload to Pyroscope, repeat until SIGTERM). Interval and extra asprof args configurable via /etc/default/flamegraph-cassandra. - flamegraph-cassandra.service: change from Type=oneshot to Type=simple with Restart=on-failure so the service stays running until explicitly stopped. - cassandra profile: convert from leaf command to parent command with start/stop subcommands (cassandra profile start / cassandra profile stop). - ProfilingService: replace profileNode with startProfiling (writes env file + systemctl start) and stopProfiling (systemctl stop). - Event.kt: replace Profiling.Complete with Started/Stopping/Stopped events. - ProfilingServiceTest: update tests for new start/stop interface. - bin/end-to-end-test: update step to start, wait 65s for one cycle, then stop. Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
PR Review: Add
|
|
@claude address the review feedback |
|
Claude finished @rustyrazorblade's task in 6m 21s —— View job Review Feedback AddressedAll items from the latest code review have been fixed in commit
Mutable list in parallel ops (low)
Trivial test removed (low)
E2E hardcoded sleep (low)
|
- ProfilingService now extends SystemDServiceManager via AbstractSystemDServiceManager, gaining isRunning/getStatus/restart for free; stop() replaces stopProfiling() - ProfileStart: add --interval option so upload cycle duration is configurable; pass profilerArgs.toList() to avoid sharing mutable list with parallel coroutines - flamegraph-to-pyroscope: wrap curl with || echo so a transient Pyroscope outage doesn't kill the profiling loop under set -e - ProfilingServiceTest: remove trivial isError() constant test; add coverage for --interval env-file write (alone and combined with args) - end-to-end-test: use cassandra profile start --interval 5 and sleep 15 instead of hardcoded 65-second wait Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
There was a problem hiding this comment.
This is not using continuous mode correctly. It should call async profiler with --loop as per the doc:
https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilerOptions.md
There was a problem hiding this comment.
Claude finished @rustyrazorblade's task in 4m 23s —— View job
Fixed in commit c33db2d. Here's what changed in flamegraph-to-pyroscope:
Before: The script called asprof -d $INTERVAL in a shell while true loop. This blocked for the full interval each time and didn't use async-profiler's continuous profiling support.
After: asprof is started once with --loop ${INTERVAL}s, which is async-profiler's native continuous profiling mode — it profiles for the interval, writes the output file, resets, and immediately starts the next interval without stopping. The script runs a polling loop to pick up completed files and upload them to Pyroscope.
Key design points:
--loop ${INTERVAL}s -o collapsed -f ${OUTDIR}/profile-%t.collapsed— async-profiler writes one file per interval, each with a unique%ttimestamp in the name- File mtime is used as the
untiltimestamp for Pyroscope;from = mtime - INTERVAL cleanup()sends SIGTERM to asprof and uploads any remaining files before the service exitsEXTRA_ARGSarray now safely handles the case whereFLAMEGRAPH_EXTRA_ARGSis unset (avoids empty element in previousEXTRA_ARGS=($VAR)form)
| Branch
Replace the shell while-loop that called asprof with -d <interval> repeatedly
with asprof's native --loop mode. asprof now runs in the background and
produces one collapsed-stack file per interval autonomously. A polling loop
detects completed files and uploads them to Pyroscope, then removes them.
Key changes:
- asprof invoked with --loop ${INTERVAL}s instead of -d ${INTERVAL} in a shell loop
- asprof runs as a background process; polling loop uploads completed files
- cleanup() stops the background asprof and uploads any remaining files on exit
- File mtime used as the interval end timestamp; from = mtime - INTERVAL
- EXTRA_ARGS array handled safely when FLAMEGRAPH_EXTRA_ARGS is unset
Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
|
@claude like other services, this should have a start / stop and since it uses the loop, run in the background. The end to end test should do a 5 second profile only, using |
|
Claude finished @rustyrazorblade's task in 2m 44s —— View job Changes applied
Commit
The |
Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
Adds a
profile cassandracommand that runs async-profiler on all Cassandra nodes and sends flamegraph data to Pyroscope.Changes
packer/cassandra/bin/flamegraph-to-pyroscope: script wrapping asprof + Pyroscope HTTP ingestpacker/cassandra/services/flamegraph-cassandra.service: SystemD one-shot unit for standalone usepacker/cassandra/install/install_flamegraph_service.sh: packer install scriptEvent.Profiling.*events added to Event.ktProfilingServiceandDefaultProfilingServiceprofile cassandra [asprof-options]command with full pass-through via @unmatchedCloses #159
Generated with Claude Code