composewell · harendra-kumar · May 10, 2026 · May 8, 2026 · May 8, 2026 · May 8, 2026
diff --git a/README.md b/README.md
@@ -49,9 +49,10 @@ With these tools you can find:
 
 ## Detailed documents
 
-For more details on each of the performance analysis methods see the following
-documents:
-* [GHC RTS Stats](docs/ghc-rts-performance-analysis.md)
-* [threadCPUTime# RTS primitive](docs/thread-cputime-primop.md)
-* [GHC Event logging](docs/eventlog-performance-analysis.md)
+For more details on performance measurement and each of the performance
+analysis methods see the following documents:
+* [Haskell performance measurement](docs/haskell-perf-measurement.md)
+* [Using RTS Stats](docs/ghc-rts-performance-analysis.md)
+* [Using threadCPUTime RTS primitive](docs/thread-cputime-primop.md)
+* [Using Event log based measurement](docs/eventlog-performance-analysis.md)
 * [GHC patches details](dev/ghc-work.md)
diff --git a/docs/ghc-rts-performance-analysis.md b/docs/ghc-rts-performance-analysis.md
@@ -26,46 +26,7 @@ threads) spent between two points, but we cannot tell which Haskell
 thread spent how much time or how much time was actually spent by the
 instructions between those two points.
 
-## Components of a Haskell Process
-
-* An OS level process
-* Multiple OS level threads in the OS process
-* Multiple Haskell green threads that are scheduled on the OS threads. Haskell
-  threads can run on any of the available OS threads every time it is ready to
-  run.
-
-## A prototypical program
-
-[This example](examples/console-loop-multi-thread.hs) is a
-a simple yet comprehensive program to understand different components of
-performance analysis and stats. You can play with this to understand how things
-work, how the stats add up and what they mean.
-
-## How many OS threads do we have?
-
-To see how many OS threads a Haskell process is using on Linux.  Run
-[this example](examples/console-loop-multi-thread.hs), note its pid
-printed in the output.  All of its OS threads can be printed by:
-```
-ls /proc/<PID>/task
-```
-
-Even when compiled without the `-threaded` option we might see two threads
-because the RTS still uses a separate thread for GC and for forking async
-cleanup threads via GC.
-
-One of the tasks will have the same pid as the process pid, this is the main OS
-thread. You can try changing the number of capabilities using +RTS -N and see
-the effect.
-
-GHC may also use independent OS threads for ffi, for GC, for IO manager,
-however it will guarantee that only as many user threads can run at a time as
-specified with the -N rts option.
-
-Usually we see 3 threads plus 2 threads per capability when compiled
-with `-threaded` option.
-
-## GHC RTS stats
+## getRTSStats
 
 The getRTSStats call gives us the CPU time (essentially get_clocktime
 or getrusage under the hood to get the CPU time) of the process and
@@ -121,40 +82,6 @@ in this category.
 Note that the GC cpu time should be computed by adding the `gc_cpu_ns`
 and `nonmoving_gc_cpu_ns` when the non-moving gc is enabled.
 
-## Variability of Measurements
-
-Performance measurement is tricky and there are many factors to take care of if
-you want to get reliable results:
-
-* Disable CPU frequency scaling, can cause run-to-run or variability in the
-  same run.
-* Do not run other things on the same machine. interrupts, kernel
-  activity, background daemons can also affect:
-  * Memory contention can affect the measurement.
-  * cache effects due to context switching can affect it.
-* Discard first runs, first runs are usually outliers because of warm up effects,
-  instruction cache cold, data cache cold, page faults, branch predictor
-  not trained.
-* Use thread affinity. Thread migration to another CPU: causes cache
-  invalidation, different core state, timing noise.
-* Use larger measurements. In smaller one measurement overhead and
-  variance may dominate: timing calls (clock_gettime), counters, RTS
-  stats.
-* Different CPUs running at different frequencies can make the results
-  unpredictable.
-* The clocks of different CPUs may not be perfectly in sync.
-
-To counter the last two factors we should use instruction count or
-allocation count rather than time as a more reliable measure. Even
-the instruction count might vary because of measurement overhead adds
-instruction count, which can vary depending on how many times the thread
-is context switched.
-
-## Haskell specific variability
-
-* Lazy evaluation, may defer work which might get evaluated later in the
-  context of some other measurement window.
-
 ## Using getRTSStats with haskell-perf
 
 In the `haskell-perf` library we do have a convenient way to wrap a

diff --git a/docs/haskell-perf-measurement.md b/docs/haskell-perf-measurement.md
@@ -0,0 +1,200 @@
+## Components of a Haskell Process
+
+* An OS level process
+* Multiple OS level threads within the OS process
+* Multiple Haskell green threads that are scheduled on the OS threads. Haskell
+  threads can run on any of the available OS threads every time it is ready to
+  run.
+
+## How many OS threads do we have?
+
+To see how many OS threads a Haskell process is using on Linux.  Run
+[this example](test/snippets/console-loop-multi-thread.hs), note its pid
+printed in the output.  All of its OS threads can be printed by:
+```
+ls /proc/<PID>/task
+```
+
+## OS threads without -threaded option
+
+Even when compiled without the `-threaded` option we can see two OS
+threads running (ghc 9.10.3 on Linux) because the RTS uses one more
+thread for some other async tasks.  But only one OS thread is used for
+running the user program.
+
+Not using the -threaded option may be more CPU efficient but it can
+lead to higher latencies because of FFI calls blocking the thread. When
+compiled with -threaded we can run with a single capability using -N1
+and take adavantage of offloading FFI calls to other threads.
+
+One of the tasks will have the same pid as the process pid, this is the
+main OS thread used for executing Haskell threads.
+
+## OS threads with -threaded build and -N1 RTS option
+
+When compiled with -threaded option, GHC may also use multiple independent OS
+threads for ffi, for GC, for IO manager.
+
+With -N1 even though only one capability is used for running the Haskell
+code we may see more threads being used by the RTS.  Using ghc 9.10.3 on
+Linux we see five OS threads running with -N1.
+
+## OS threads with -threaded build
+
+You can try changing the number of capabilities using +RTS -N and see
+the effect.  Usually we can see 3 threads plus 2 threads per capability
+when compiled with `-threaded` option.  However, GHC guarantees that
+only as many user threads can run at a time as specified with the -N rts
+option.
+
+## Process Level View
+
+If we consider the entire Haskell process, the elapsed wall-clock time
+of each OS thread consists of:
+
+* CPU execution time,
+* time spent runnable but waiting on the OS scheduler,
+* time spent blocked waiting for I/O or synchronization events.
+
+When a process uses multiple OS threads, `ProcessCPUTime` may exceed elapsed
+wall-clock time because multiple threads can execute simultaneously on
+different CPUs.
+
+As a result, `ProcessCPUTime` is generally not suitable as a delta metric for
+measuring the CPU consumption of a specific piece of user code in a
+multithreaded program. In such cases, `ThreadCPUTime` is usually more
+appropriate, since it measures CPU usage attributable to a single OS thread.
+In a single-threaded process, however, `ProcessCPUTime` can safely be used for
+this purpose.
+
+`ProcessCPUTime` is nevertheless useful as an aggregate CPU utilization
+metric. Its value can approach elapsed wall-clock time multiplied by the
+degree of CPU parallelism available to the process. Lower values indicate
+that the process spent more time not executing, for example waiting for
+scheduling, synchronization, I/O, or other runtime stalls.
+
+Useful reporting metric:
+
+* Total Elapsed time
+* Entire process CPU Time
+    * User time (getrusage)
+    * System time (getrusage)
+* For each OS thread - ThreadCPUTime
+* Overall rusage stats for the process
+* OS Sched run-queue wait time (using sched_wakeup, sched_switch trace events
+  via  perf, libperf, libtraceevent)
+
+The total elapsed time for any OS thread can be decomposed as:
+```
+elapsed =
+    on_cpu
+  + runnable_wait
+  + off_cpu_wait(reason)
+```
+
+```
+reason in {
+    futex,
+    disk_io,
+    network_io,
+    epoll,
+    sleep,
+    paging,
+    pipe_wait,
+    signal_wait,
+    ...
+}
+```
+
+## Useful Metrics
+
+Let's consider the following cases.
+
+### Single Capability, Single Haskell Thread
+
+This is the simplest execution model for performance analysis. In this
+configuration, a single OS thread executes the measured code. If there
+is also only one Haskell thread, then when that thread yields there is
+no other Haskell thread that can run on the same capability. Likewise,
+no parallel runtime worker executes on that capability. Aside from brief
+RTS scheduler bookkeeping, the only substantial additional work that may
+execute on that capability is garbage collection or foreign-function
+(FFI) code.
+
+In this model, the OS `ThreadCPUTime` delta can be used to measure the
+total CPU time consumed by the execution thread between two points. This
+measurement includes both user-code execution and RTS activity such as
+garbage collection and scheduler overhead. To estimate the CPU time
+spent in user code alone, the CPU time attributable to RTS activity must
+be subtracted. Since there is no other Haskell thread, this entire CPU
+time can be attributed to the single Haskell thread.
+
+The CPU time between two points on the OS thread (GHC capability) can be
+decomposed as follows:
+
+* User-code execution time
+* CPU time spent in FFI code
+* Haskell GC CPU time
+* RTS scheduler and bookkeeping CPU time
+
+### Single Capability, Multiple Haskell Threads
+
+In this configuration, `ThreadCPUTime` for the OS thread can no longer be
+used to directly measure the CPU time consumed by a particular Haskell
+thread. A Haskell thread may yield execution between the measurement
+points, allowing another Haskell thread to run on the same capability and
+consume CPU time on the same OS thread.
+
+To measure CPU usage attributable to an individual Haskell thread, the
+measurement must instead be performed at the RTS level. This can be done
+using RTS eventlog tracing or by recording timestamps when the Haskell
+thread is scheduled onto and descheduled from a capability.
+
+The mutator CPU time reported by GHC is the cumulative CPU time spent
+outside garbage collection. It therefore includes both Haskell thread
+execution time and RTS overhead incurred during mutator execution, such
+as scheduler bookkeeping and allocation management.
+
+If we separately measure the cumulative CPU time attributable to
+individual Haskell threads, then the difference between the total
+mutator CPU time and the cumulative Haskell thread CPU time provides an
+estimate of RTS overhead excluding garbage collection.
+
+### Multiple capabilities
+
+Same as above applies in this case as well.
+
+## Variability of Measurements
+
+Performance measurement is tricky and there are many factors to take care of if
+you want to get reliable results:
+
+* Disable CPU frequency scaling, can cause run-to-run or variability in the
+  same run.
+* Do not run other things on the same machine. interrupts, kernel
+  activity, background daemons can also affect:
+  * Memory contention can affect the measurement.
+  * cache effects due to context switching can affect it.
+* Discard first runs, first runs are usually outliers because of warm up effects,
+  instruction cache cold, data cache cold, page faults, branch predictor
+  not trained.
+* Use thread affinity. Thread migration to another CPU: causes cache
+  invalidation, different core state, timing noise.
+* Use larger measurements. In smaller one measurement overhead and
+  variance may dominate: timing calls (clock_gettime), counters, RTS
+  stats.
+* Different CPUs running at different frequencies can make the results
+  unpredictable.
+* The clocks of different CPUs may not be perfectly in sync.
+
+To counter the last two factors we should use instruction count or
+allocation count rather than time as a more reliable measure. Even
+the instruction count might vary because of measurement overhead adds
+instruction count, which can vary depending on how many times the thread
+is context switched.
+
+## Haskell specific variability
+
+* Lazy evaluation, may defer work which might get evaluated later in the
+  context of some other measurement window.
+
diff --git a/examples/traceEventIO.hs b/examples/traceEventIO.hs
@@ -2,13 +2,13 @@
 
 import Control.Monad.IO.Class (MonadIO(..))
 import Debug.Trace (traceEventIO)
-import Foreign.C.Types
-import System.Posix.Signals
+import Foreign.C.Types ( CUInt(..) )
+import GHC.Conc (myThreadId, labelThread)
+import System.Posix.Signals ( blockSignals, fullSignalSet )
 
 foreign import ccall unsafe "unistd.h sleep"
     c_sleep :: CUInt -> IO CUInt
 
-
 {-# INLINE withTracingFlow #-}
 withTracingFlow :: MonadIO m => String -> m a -> m a
 withTracingFlow tag action = do
@@ -17,26 +17,29 @@ withTracingFlow tag action = do
     liftIO $ traceEventIO ("END:" ++ tag)
     pure res
 
+emptyBlock :: IO ()
+emptyBlock = return ()
+
 sleepBlock :: IO ()
 sleepBlock = do
     -- So that signals do not interrupt the sleep
     blockSignals fullSignalSet
     _ <- c_sleep 10
     return ()
 
-{-
-sleepWithEvents :: IO ()
-sleepWithEvents = do
-    traceEventIO "before sleep"
-    sleepBlock
-    traceEventIO "after sleep"
--}
-
-emptyBlock :: IO ()
-emptyBlock = return ()
+{-# INLINE printSumLoop #-}
+printSumLoop :: Int -> Int -> Int -> IO ()
+printSumLoop _ _ 0 = print "All Done!"
+printSumLoop chunksOf from times = do
+    withTracingFlow "SUM" $ print $ sum [from..(from + chunksOf)]
+    printSumLoop chunksOf (from + chunksOf) (times - 1)
 
 main :: IO ()
 main = do
+    tid <- myThreadId
+    labelThread tid "main-thread"
+
     withTracingFlow "EMPTY" emptyBlock
     withTracingFlow "SLEEP" sleepBlock
-    return ()
+    withTracingFlow "LOOP" $ do
+         printSumLoop 10000 1 100
diff --git a/haskell-perf.cabal b/haskell-perf.cabal
@@ -33,8 +33,6 @@ extra-doc-files:
 extra-source-files:
     examples/*.hs
     test/snippets/*.hs
-    lib/Streamly/Metrics/File.hs
-    lib/Streamly/Metrics/Console.hs
 
 source-repository head
     type: git