You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 3, 2025. It is now read-only.
* Update recipes.mdx
* Update creating.mdx
Line 172: Is this change correct? If not, what is currently available?
* Update enabling.mdx
Lines 115, 116, and 118: Should "modifying ops" be "modifying operations"?
* Update onnx-export.mdx
* Update deepsparse-engine.mdx
Lines 27-29: This link goes to a 404 page.
* Update scheduler.mdx
Line 30: The link goes to a 404 page.
* Update benchmarking.mdx
Line 12: Consistency issue--CLI is spelled out here, but not in other articles.
Line 116: What is the meaning of the icon? How is it entered in Markdown?
Line 120 & 126: Is the "engine" the DeepSparse Engine? If so, it should be spelled out as such or "engine" should have an initial cap (Engine).
* Update numactl-utility.mdx
* Update benchmarking.mdx
* Update src/content/user-guide/deepsparse-engine.mdx
Co-authored-by: Robert Shaw <[email protected]>
* Update src/content/user-guide/deepsparse-engine/benchmarking.mdx
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Jeannie Finks <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model.
30
+
In most cases, good performance will be found in the default options so usage can be as simple as running the command with a SparseZoo model stub or your local ONNX model.
31
31
However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options:
32
32
33
33
CLI Arguments:
@@ -91,23 +91,23 @@ $ deepsparse.benchmark --help
91
91
> -x EXPORT_PATH, --export_path EXPORT_PATH
92
92
> Store results into a JSON file.
93
93
```
94
-
💡**PRO TIP**💡: save your benchmark results in a convenient JSON file!
94
+
**PRO TIP:** Save your benchmark results in a convenient JSON file.
95
95
96
-
Example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
96
+
The following is an example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
Copy file name to clipboardExpand all lines: src/content/user-guide/deepsparse-engine/numactl-utility.mdx
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,20 +35,20 @@ For more fine-grained control, **numactl** can be used to bind the process runni
35
35
36
36
Similarly, for a multi-socket system with N sockets and C physical CPUs per socket, the CPUs located on a single socket will range from K*C to ((K+1)*C)-1 where 0<=K<N. For multi-socket, multi-thread systems, the logical threads are separated by N*C. For example, for a two socket, two thread per CPU system with 8 cores per CPU, the logical threads for socket 0 would be numbered 0-7 and 16-23, and the threads for socket 1 would be numbered 8-15 and 24-31.
37
37
38
-
Given the architecture above, to run the DeepSparse Engine on the first four CPUs on the second socket, you would use the following:
38
+
Given the architecture above, to run the DeepSparse Engine on the first four CPUs on the second socket, you would use:
Appending `--preferred 1` is needed here since the DeepSparse Engine is being bound to CPUs on the second socket.
45
45
46
-
Note: When running on multiple sockets using a batch size that is evenly divisible by the number of sockets will yield the best performance.
46
+
**Note:** When running on multiple sockets, using a batch size that is evenly divisible by the number of sockets will yield the best performance.
47
47
48
48
49
49
## DeepSparse Engine and Thread Pinning
50
50
51
-
When using **numactl** to specify which CPUs/sockets the engine is allowed to run on, there is no restriction as to which CPU a particular computation thread is executed on. A single thread of computation may run on one or more CPUs during the course of execution. This is desirable if the system is being shared between multiple processes so that idle CPU threads are not prevented from doing other work.
51
+
When using **numactl** to specify the CPUs/sockets on which the engine is allowed to run, there is no restriction as to the CPU on which a particular computation thread is executed. A single thread of computation may run on one or more CPUs during the course of execution. This is desirable if the system is being shared between multiple processes so that idle CPU threads are not prevented from doing other work.
52
52
53
53
However, the engine works best when threads are pinned (i.e., not allowed to migrate from one CPU to another). Thread pinning can be enabled using the `NM_BIND_THREADS_TO_CORES` environment variable. For example:
54
54
@@ -58,20 +58,20 @@ However, the engine works best when threads are pinned (i.e., not allowed to mig
`NM_BIND_THREADS_TO_CORES` should be used with care since it forces the DeepSparse Engine to run on only the threads it has been allocated at startup. If any other process ends up running on the same threads, it could result in a major degradation of performance.
61
+
Use `NM_BIND_THREADS_TO_CORES` with care since it forces the DeepSparse Engine to run on only the threads it has been allocated at startup. If any other process ends up running on the same threads, it could result in a major degradation of performance.
62
62
63
-
**Note:** The threads-to-cores mappings described above are specific to Intel only. AMD has a different mapping. For AMD, all the threads for a single core are consecutive, i.e., if each core has two threads and there are N cores, the threads for a particular core K are 2*K and 2*K+1. The mapping of cores to sockets is also straightforward, for a N socket system with C cores per socket, the cores for a particular socket S are numbered S*C to ((S+1)*C)-1.
63
+
**Note:** The threads-to-cores mappings described above are specific to Intel only. AMD has a different mapping. For AMD, all the threads for a single core are consecutive; that is, if each core has two threads and there are N cores, the threads for a particular core K are 2*K and 2*K+1. The mapping of cores to sockets is also straightforward. For an N socket system with C cores per socket, the cores for a particular socket S are numbered S*C to ((S+1)*C)-1.
64
64
65
65
## Additional Notes
66
66
67
+
This displays the inventory of available sockets/CPUs on a system:
68
+
67
69
`numactl --hardware`
68
70
69
-
Displays the inventory of available sockets/CPUs on a system.
71
+
This displays the resources available to the current process:
70
72
71
73
`numactl --show`
72
74
73
-
Displays the resources available to the current process.
74
-
75
75
For further details about these and other parameters, see the man page on **numactl**:
Copy file name to clipboardExpand all lines: src/content/user-guide/deepsparse-engine/scheduler.mdx
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,38 +9,38 @@ index: 2000
9
9
10
10
This page explains the various settings for DeepSparse, which enable you to tune the performance to your workload.
11
11
12
-
Schedulers are special system software which handle the distribution of work across cores in parallel computation.
13
-
The goal of a good scheduler is to ensure that while work is available, cores aren’t sitting idle.
12
+
Schedulers are special system software, which handle the distribution of work across cores in parallel computation.
13
+
The goal of a good scheduler is to ensure that, while work is available, cores are not sitting idle.
14
14
On the contrary, as long as parallel tasks are available, all cores should be kept busy.
15
15
16
16
## Single Stream (Default)
17
17
In most use cases, the default scheduler is the preferred choice when running inferences with the DeepSparse Engine.
18
-
It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
18
+
The default scheduler is highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
19
19
Often, particularly when working with large batch sizes, the scheduler is able to distribute the workload of a single request across as many cores as it's provided.
20
20
21
21
*Single-stream scheduling; requests execute serially by default:*
However, there are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little to apply it to.
26
+
There are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little to apply it to.
27
27
28
-
An alternative, "multi-stream" scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work.
28
+
An alternative, multi-stream scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work.
29
29
30
-
If increasing core count doesn't decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API,](https://docs.neuralmagic.com/deepsparse/api/deepsparse.html) the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently.
30
+
If increasing core count does not decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API,](https://docs.neuralmagic.com/deepsparse/api/deepsparse.html) the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently.
31
31
32
-
*Multi-stream scheduling; requests execute in parallel and may utilize HW resources better:*
32
+
*Multi-stream scheduling; requests execute in parallel and may better utilize hardware resources:*
Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler allows multiple requests to be run in parallel. The `num_streams` argument to the Engine/Context classes controls how the multi-streams scheduler partitions up the machine. Each stream maps to a contiguous set of hardware threads. By default, only one hyperthread per core is used. There is no sharing amongst the partitions and it is generally good practice make sure that the `num_streams` value evenly divides into your number of cores. By default `num_streams` is set to multiplex requests across L3 caches.
37
+
Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler allows multiple requests to be run in parallel. The `num_streams` argument to the Engine/Context classes controls how the multi-streams scheduler partitions up the machine. Each stream maps to a contiguous set of hardware threads. By default, only one hyperthread per core is used. There is no sharing amongst the partitions and it is generally good practice to make sure the `num_streams` value evenly divides into your number of cores. By default `num_streams` is set to multiplex requests across L3 caches.
38
38
39
-
Here's an example: Consider a machine with 2 sockets, each with 8 cores. In this case the multi-stream scheduler will create two streams, one per socket by default. The first stream will contain cores 0-7 and the second stream will contain cores 8-15.
39
+
Here's an example. Consider a machine with 2 sockets, each with 8 cores. In this case, the multi-stream scheduler will create two streams, one per socket by default. The first stream will contain cores 0-7 and the second stream will contain cores 8-15.
40
40
41
-
Manually increasing `num_streams` to 3 will result in the following stream breakdown: threads 0-5 in the first stream, 6-10 in the second, and 11-15 in the last. This is problematic for our two socket system. The second stream (threads 6-10) is straddling both sockets, meaning that each request being serviced by that stream is going to incur a performance penalty each time one of its threads makes a remote memory access. The impact of this penalty will depend on the workload, but it will likely be significant.
41
+
Manually increasing `num_streams` to 3 will result in the following stream breakdown: threads 0-5 in the first stream, 6-10 in the second, and 11-15 in the last. This is problematic for our 2-socket system. The second stream (threads 6-10) is straddling both sockets, meaning that each request being serviced by that stream is going to incur a performance penalty each time one of its threads makes a remote memory access. The impact of this penalty will depend on the workload, but it will likely be significant.
42
42
43
-
Manually increasing `num_streams` to 4 is interesting. Here's the stream breakdown: threads 0-3 in the first stream, 4-7 in the second, 8-11 in the third, and 12-15 in the fourth. Each stream is only making memory accesses that are local to its socket which is good. However, the first two and last two streams are sharing the same L3 cache which can result in worse performance due to cache thrashing. Depending on the workload, the performance gain from the increased parallelism may negate this penalty, though.
43
+
Manually increasing `num_streams` to 4 is interesting. Here's the stream breakdown: threads 0-3 in the first stream, 4-7 in the second, 8-11 in the third, and 12-15 in the fourth. Each stream is only making memory accesses that are local to its socket, which is good. However, the first two and last two streams are sharing the same L3 cache, which can result in worse performance due to cache thrashing. Depending on the workload, though, the performance gain from the increased parallelism may negate this penalty.
44
44
45
45
The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them. Implementing a model server may fit such a scenario and be ideal for using multi-stream scheduling.
46
46
@@ -52,12 +52,12 @@ Depending on your engine execution strategy, enable one of these options by runn
0 commit comments