Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,10 @@ With these tools you can find:

## Detailed documents

For more details on each of the performance analysis methods see the following
documents:
* [GHC RTS Stats](docs/ghc-rts-performance-analysis.md)
* [threadCPUTime# RTS primitive](docs/thread-cputime-primop.md)
* [GHC Event logging](docs/eventlog-performance-analysis.md)
For more details on performance measurement and each of the performance
analysis methods see the following documents:
* [Haskell performance measurement](docs/haskell-perf-measurement.md)
* [Using RTS Stats](docs/ghc-rts-performance-analysis.md)
* [Using threadCPUTime RTS primitive](docs/thread-cputime-primop.md)
* [Using Event log based measurement](docs/eventlog-performance-analysis.md)
* [GHC patches details](dev/ghc-work.md)
75 changes: 1 addition & 74 deletions docs/ghc-rts-performance-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,46 +26,7 @@ threads) spent between two points, but we cannot tell which Haskell
thread spent how much time or how much time was actually spent by the
instructions between those two points.

## Components of a Haskell Process

* An OS level process
* Multiple OS level threads in the OS process
* Multiple Haskell green threads that are scheduled on the OS threads. Haskell
threads can run on any of the available OS threads every time it is ready to
run.

## A prototypical program

[This example](examples/console-loop-multi-thread.hs) is a
a simple yet comprehensive program to understand different components of
performance analysis and stats. You can play with this to understand how things
work, how the stats add up and what they mean.

## How many OS threads do we have?

To see how many OS threads a Haskell process is using on Linux. Run
[this example](examples/console-loop-multi-thread.hs), note its pid
printed in the output. All of its OS threads can be printed by:
```
ls /proc/<PID>/task
```

Even when compiled without the `-threaded` option we might see two threads
because the RTS still uses a separate thread for GC and for forking async
cleanup threads via GC.

One of the tasks will have the same pid as the process pid, this is the main OS
thread. You can try changing the number of capabilities using +RTS -N and see
the effect.

GHC may also use independent OS threads for ffi, for GC, for IO manager,
however it will guarantee that only as many user threads can run at a time as
specified with the -N rts option.

Usually we see 3 threads plus 2 threads per capability when compiled
with `-threaded` option.

## GHC RTS stats
## getRTSStats

The getRTSStats call gives us the CPU time (essentially get_clocktime
or getrusage under the hood to get the CPU time) of the process and
Expand Down Expand Up @@ -121,40 +82,6 @@ in this category.
Note that the GC cpu time should be computed by adding the `gc_cpu_ns`
and `nonmoving_gc_cpu_ns` when the non-moving gc is enabled.

## Variability of Measurements

Performance measurement is tricky and there are many factors to take care of if
you want to get reliable results:

* Disable CPU frequency scaling, can cause run-to-run or variability in the
same run.
* Do not run other things on the same machine. interrupts, kernel
activity, background daemons can also affect:
* Memory contention can affect the measurement.
* cache effects due to context switching can affect it.
* Discard first runs, first runs are usually outliers because of warm up effects,
instruction cache cold, data cache cold, page faults, branch predictor
not trained.
* Use thread affinity. Thread migration to another CPU: causes cache
invalidation, different core state, timing noise.
* Use larger measurements. In smaller one measurement overhead and
variance may dominate: timing calls (clock_gettime), counters, RTS
stats.
* Different CPUs running at different frequencies can make the results
unpredictable.
* The clocks of different CPUs may not be perfectly in sync.

To counter the last two factors we should use instruction count or
allocation count rather than time as a more reliable measure. Even
the instruction count might vary because of measurement overhead adds
instruction count, which can vary depending on how many times the thread
is context switched.

## Haskell specific variability

* Lazy evaluation, may defer work which might get evaluated later in the
context of some other measurement window.

## Using getRTSStats with haskell-perf

In the `haskell-perf` library we do have a convenient way to wrap a
Expand Down
200 changes: 200 additions & 0 deletions docs/haskell-perf-measurement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
## Components of a Haskell Process

* An OS level process
* Multiple OS level threads within the OS process
* Multiple Haskell green threads that are scheduled on the OS threads. Haskell
threads can run on any of the available OS threads every time it is ready to
run.

## How many OS threads do we have?

To see how many OS threads a Haskell process is using on Linux. Run
[this example](test/snippets/console-loop-multi-thread.hs), note its pid
printed in the output. All of its OS threads can be printed by:
```
ls /proc/<PID>/task
```

## OS threads without -threaded option

Even when compiled without the `-threaded` option we can see two OS
threads running (ghc 9.10.3 on Linux) because the RTS uses one more
thread for some other async tasks. But only one OS thread is used for
running the user program.

Not using the -threaded option may be more CPU efficient but it can
lead to higher latencies because of FFI calls blocking the thread. When
compiled with -threaded we can run with a single capability using -N1
and take adavantage of offloading FFI calls to other threads.

One of the tasks will have the same pid as the process pid, this is the
main OS thread used for executing Haskell threads.

## OS threads with -threaded build and -N1 RTS option

When compiled with -threaded option, GHC may also use multiple independent OS
threads for ffi, for GC, for IO manager.

With -N1 even though only one capability is used for running the Haskell
code we may see more threads being used by the RTS. Using ghc 9.10.3 on
Linux we see five OS threads running with -N1.

## OS threads with -threaded build

You can try changing the number of capabilities using +RTS -N and see
the effect. Usually we can see 3 threads plus 2 threads per capability
when compiled with `-threaded` option. However, GHC guarantees that
only as many user threads can run at a time as specified with the -N rts
option.

## Process Level View

If we consider the entire Haskell process, the elapsed wall-clock time
of each OS thread consists of:

* CPU execution time,
* time spent runnable but waiting on the OS scheduler,
* time spent blocked waiting for I/O or synchronization events.

When a process uses multiple OS threads, `ProcessCPUTime` may exceed elapsed
wall-clock time because multiple threads can execute simultaneously on
different CPUs.

As a result, `ProcessCPUTime` is generally not suitable as a delta metric for
measuring the CPU consumption of a specific piece of user code in a
multithreaded program. In such cases, `ThreadCPUTime` is usually more
appropriate, since it measures CPU usage attributable to a single OS thread.
In a single-threaded process, however, `ProcessCPUTime` can safely be used for
this purpose.

`ProcessCPUTime` is nevertheless useful as an aggregate CPU utilization
metric. Its value can approach elapsed wall-clock time multiplied by the
degree of CPU parallelism available to the process. Lower values indicate
that the process spent more time not executing, for example waiting for
scheduling, synchronization, I/O, or other runtime stalls.

Useful reporting metric:

* Total Elapsed time
* Entire process CPU Time
* User time (getrusage)
* System time (getrusage)
* For each OS thread - ThreadCPUTime
* Overall rusage stats for the process
* OS Sched run-queue wait time (using sched_wakeup, sched_switch trace events
via perf, libperf, libtraceevent)

The total elapsed time for any OS thread can be decomposed as:
```
elapsed =
on_cpu
+ runnable_wait
+ off_cpu_wait(reason)
```

```
reason in {
futex,
disk_io,
network_io,
epoll,
sleep,
paging,
pipe_wait,
signal_wait,
...
}
```

## Useful Metrics

Let's consider the following cases.

### Single Capability, Single Haskell Thread

This is the simplest execution model for performance analysis. In this
configuration, a single OS thread executes the measured code. If there
is also only one Haskell thread, then when that thread yields there is
no other Haskell thread that can run on the same capability. Likewise,
no parallel runtime worker executes on that capability. Aside from brief
RTS scheduler bookkeeping, the only substantial additional work that may
execute on that capability is garbage collection or foreign-function
(FFI) code.

In this model, the OS `ThreadCPUTime` delta can be used to measure the
total CPU time consumed by the execution thread between two points. This
measurement includes both user-code execution and RTS activity such as
garbage collection and scheduler overhead. To estimate the CPU time
spent in user code alone, the CPU time attributable to RTS activity must
be subtracted. Since there is no other Haskell thread, this entire CPU
time can be attributed to the single Haskell thread.

The CPU time between two points on the OS thread (GHC capability) can be
decomposed as follows:

* User-code execution time
* CPU time spent in FFI code
* Haskell GC CPU time
* RTS scheduler and bookkeeping CPU time

### Single Capability, Multiple Haskell Threads

In this configuration, `ThreadCPUTime` for the OS thread can no longer be
used to directly measure the CPU time consumed by a particular Haskell
thread. A Haskell thread may yield execution between the measurement
points, allowing another Haskell thread to run on the same capability and
consume CPU time on the same OS thread.

To measure CPU usage attributable to an individual Haskell thread, the
measurement must instead be performed at the RTS level. This can be done
using RTS eventlog tracing or by recording timestamps when the Haskell
thread is scheduled onto and descheduled from a capability.

The mutator CPU time reported by GHC is the cumulative CPU time spent
outside garbage collection. It therefore includes both Haskell thread
execution time and RTS overhead incurred during mutator execution, such
as scheduler bookkeeping and allocation management.

If we separately measure the cumulative CPU time attributable to
individual Haskell threads, then the difference between the total
mutator CPU time and the cumulative Haskell thread CPU time provides an
estimate of RTS overhead excluding garbage collection.

### Multiple capabilities

Same as above applies in this case as well.

## Variability of Measurements

Performance measurement is tricky and there are many factors to take care of if
you want to get reliable results:

* Disable CPU frequency scaling, can cause run-to-run or variability in the
same run.
* Do not run other things on the same machine. interrupts, kernel
activity, background daemons can also affect:
* Memory contention can affect the measurement.
* cache effects due to context switching can affect it.
* Discard first runs, first runs are usually outliers because of warm up effects,
instruction cache cold, data cache cold, page faults, branch predictor
not trained.
* Use thread affinity. Thread migration to another CPU: causes cache
invalidation, different core state, timing noise.
* Use larger measurements. In smaller one measurement overhead and
variance may dominate: timing calls (clock_gettime), counters, RTS
stats.
* Different CPUs running at different frequencies can make the results
unpredictable.
* The clocks of different CPUs may not be perfectly in sync.

To counter the last two factors we should use instruction count or
allocation count rather than time as a more reliable measure. Even
the instruction count might vary because of measurement overhead adds
instruction count, which can vary depending on how many times the thread
is context switched.

## Haskell specific variability

* Lazy evaluation, may defer work which might get evaluated later in the
context of some other measurement window.

31 changes: 17 additions & 14 deletions examples/traceEventIO.hs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
import Foreign.C.Types
import System.Posix.Signals
import Foreign.C.Types ( CUInt(..) )
import GHC.Conc (myThreadId, labelThread)
import System.Posix.Signals ( blockSignals, fullSignalSet )

foreign import ccall unsafe "unistd.h sleep"
c_sleep :: CUInt -> IO CUInt


{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
Expand All @@ -17,26 +17,29 @@ withTracingFlow tag action = do
liftIO $ traceEventIO ("END:" ++ tag)
pure res

emptyBlock :: IO ()
emptyBlock = return ()

sleepBlock :: IO ()
sleepBlock = do
-- So that signals do not interrupt the sleep
blockSignals fullSignalSet
_ <- c_sleep 10
return ()

{-
sleepWithEvents :: IO ()
sleepWithEvents = do
traceEventIO "before sleep"
sleepBlock
traceEventIO "after sleep"
-}

emptyBlock :: IO ()
emptyBlock = return ()
{-# INLINE printSumLoop #-}
printSumLoop :: Int -> Int -> Int -> IO ()
printSumLoop _ _ 0 = print "All Done!"
printSumLoop chunksOf from times = do
withTracingFlow "SUM" $ print $ sum [from..(from + chunksOf)]
printSumLoop chunksOf (from + chunksOf) (times - 1)

main :: IO ()
main = do
tid <- myThreadId
labelThread tid "main-thread"

withTracingFlow "EMPTY" emptyBlock
withTracingFlow "SLEEP" sleepBlock
return ()
withTracingFlow "LOOP" $ do
printSumLoop 10000 1 100
2 changes: 0 additions & 2 deletions haskell-perf.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ extra-doc-files:
extra-source-files:
examples/*.hs
test/snippets/*.hs
lib/Streamly/Metrics/File.hs
lib/Streamly/Metrics/Console.hs

source-repository head
type: git
Expand Down
Loading
Loading