Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Commit 999a946

Browse files
Beth-Kosisjeanniefinksrobertgshaw2-redhat
authored
B kosis user guide (#132)
* Update recipes.mdx * Update creating.mdx Line 172: Is this change correct? If not, what is currently available? * Update enabling.mdx Lines 115, 116, and 118: Should "modifying ops" be "modifying operations"? * Update onnx-export.mdx * Update deepsparse-engine.mdx Lines 27-29: This link goes to a 404 page. * Update scheduler.mdx Line 30: The link goes to a 404 page. * Update benchmarking.mdx Line 12: Consistency issue--CLI is spelled out here, but not in other articles. Line 116: What is the meaning of the icon? How is it entered in Markdown? Line 120 & 126: Is the "engine" the DeepSparse Engine? If so, it should be spelled out as such or "engine" should have an initial cap (Engine). * Update numactl-utility.mdx * Update benchmarking.mdx * Update src/content/user-guide/deepsparse-engine.mdx Co-authored-by: Robert Shaw <[email protected]> * Update src/content/user-guide/deepsparse-engine/benchmarking.mdx Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Jeannie Finks <[email protected]> Co-authored-by: Robert Shaw <[email protected]>
1 parent 8063962 commit 999a946

File tree

8 files changed

+135
-135
lines changed

8 files changed

+135
-135
lines changed

src/content/user-guide/deepsparse-engine.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,28 @@ index: 4000
77

88
# User Guides for the DeepSparse Engine
99

10-
This user guide offers more information for exploring additional and advanced functionality for the DeepSparse Engine.
10+
This user guide offers information for exploring additional and advanced functionality for the DeepSparse Engine.
1111

1212
## Guides
1313

1414
<LinkCards>
1515
<LinkCard href="./hardware-support" heading="Supported Hardware">
16-
Supported hardware for the DeepSparse Engine, including CPU types and instruction sets.
16+
Lists supported hardware for the DeepSparse Engine, including CPU types and instruction sets.
1717
</LinkCard>
1818

1919
<LinkCard href="./scheduler" heading="Inference Types">
20-
Inference types and the tradeoffs with the DeepSparse Scheduler, such as single and multi-stream.
20+
Describes inference types and tradeoffs with the DeepSparse Scheduler, such as single and multi-stream.
2121
</LinkCard>
2222

2323
<LinkCard href="./benchmarking" heading="Benchmarking">
24-
Benchmarking ONNX models in the DeepSparse Engine.
24+
Explains how to benchmark ONNX models in the DeepSparse Engine.
2525
</LinkCard>
2626

2727
<LinkCard href="./diagnostics-debugging" heading="Diagnostics/Debugging">
28-
Logging guidance for diagnosing and debugging any issues.
28+
Provides logging guidance for diagnosing and debugging any issues.
2929
</LinkCard>
3030

3131
<LinkCard href="./numactl-utility" heading="numactl Utility">
32-
Controlling resource utilization with the DeepSparse Engine using the numactl utility.
32+
Explains how to use the numactl utility for controlling resource utilization with the DeepSparse Engine.
3333
</LinkCard>
3434
</LinkCards>

src/content/user-guide/deepsparse-engine/benchmarking.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,19 @@ execute the model depending on the chosen scenario. By default, it will choose a
1515

1616
## Installation Requirements
1717

18-
This page requires the [DeepSparse General Install](/get-started/install/deepsparse).
18+
Use of the DeepSparse Benchmarking utilities requires installation of the [DeepSparse Community](/get-started/install/deepsparse).
1919

2020
## Quickstart
2121

22-
To benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset (which is identified by its SparseZoo stub), run the following:
22+
To benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset (which is identified by its SparseZoo stub), run:
2323

2424
```bash
2525
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none
2626
```
2727

2828
## Usage
2929

30-
In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model.
30+
In most cases, good performance will be found in the default options so usage can be as simple as running the command with a SparseZoo model stub or your local ONNX model.
3131
However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options:
3232

3333
CLI Arguments:
@@ -91,23 +91,23 @@ $ deepsparse.benchmark --help
9191
> -x EXPORT_PATH, --export_path EXPORT_PATH
9292
> Store results into a JSON file.
9393
```
94-
💡**PRO TIP**💡: save your benchmark results in a convenient JSON file!
94+
**PRO TIP:** Save your benchmark results in a convenient JSON file.
9595

96-
Example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
96+
The following is an example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:
9797

9898
```bash
9999
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none -x benchmark.json
100100
```
101101

102102
### Sample CLI Argument Configurations
103103

104-
To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests:
104+
To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests, use:
105105

106106
```bash
107107
deepsparse.benchmark zoo:cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate --batch_size 16 --time 10 --scenario async --num_streams 8
108108
```
109109

110-
To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency:
110+
To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency, use:
111111

112112
```bash
113113
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant_6layers-aggressive_96 --batch_size 1 --scenario sync
@@ -131,7 +131,7 @@ The throughput value reported comes from measuring the number of finished infere
131131

132132
**BERT 3-layer FP32 Sparse Throughput**
133133

134-
No need to add *scenario* argument since `async` is the default option:
134+
There is no need to add a *scenario* argument since `async` is the default option:
135135
```bash
136136
$ deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
137137

src/content/user-guide/deepsparse-engine/numactl-utility.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -35,20 +35,20 @@ For more fine-grained control, **numactl** can be used to bind the process runni
3535

3636
Similarly, for a multi-socket system with N sockets and C physical CPUs per socket, the CPUs located on a single socket will range from K*C to ((K+1)*C)-1 where 0&lt;=K&lt;N. For multi-socket, multi-thread systems, the logical threads are separated by N*C. For example, for a two socket, two thread per CPU system with 8 cores per CPU, the logical threads for socket 0 would be numbered 0-7 and 16-23, and the threads for socket 1 would be numbered 8-15 and 24-31.
3737

38-
Given the architecture above, to run the DeepSparse Engine on the first four CPUs on the second socket, you would use the following:
38+
Given the architecture above, to run the DeepSparse Engine on the first four CPUs on the second socket, you would use:
3939

4040
```bash
4141
numactl --physcpubind 8-11 --preferred 1 <deepsparseengine-process>
4242
```
4343

4444
Appending `--preferred 1` is needed here since the DeepSparse Engine is being bound to CPUs on the second socket.
4545

46-
Note: When running on multiple sockets using a batch size that is evenly divisible by the number of sockets will yield the best performance.
46+
**Note:** When running on multiple sockets, using a batch size that is evenly divisible by the number of sockets will yield the best performance.
4747

4848

4949
## DeepSparse Engine and Thread Pinning
5050

51-
When using **numactl** to specify which CPUs/sockets the engine is allowed to run on, there is no restriction as to which CPU a particular computation thread is executed on. A single thread of computation may run on one or more CPUs during the course of execution. This is desirable if the system is being shared between multiple processes so that idle CPU threads are not prevented from doing other work.
51+
When using **numactl** to specify the CPUs/sockets on which the engine is allowed to run, there is no restriction as to the CPU on which a particular computation thread is executed. A single thread of computation may run on one or more CPUs during the course of execution. This is desirable if the system is being shared between multiple processes so that idle CPU threads are not prevented from doing other work.
5252

5353
However, the engine works best when threads are pinned (i.e., not allowed to migrate from one CPU to another). Thread pinning can be enabled using the `NM_BIND_THREADS_TO_CORES` environment variable. For example:
5454

@@ -58,20 +58,20 @@ However, the engine works best when threads are pinned (i.e., not allowed to mig
5858
export NM_BIND_THREADS_TO_CORES=1 <deepsparseengine-process>
5959
```
6060

61-
`NM_BIND_THREADS_TO_CORES` should be used with care since it forces the DeepSparse Engine to run on only the threads it has been allocated at startup. If any other process ends up running on the same threads, it could result in a major degradation of performance.
61+
Use `NM_BIND_THREADS_TO_CORES` with care since it forces the DeepSparse Engine to run on only the threads it has been allocated at startup. If any other process ends up running on the same threads, it could result in a major degradation of performance.
6262

63-
**Note:** The threads-to-cores mappings described above are specific to Intel only. AMD has a different mapping. For AMD, all the threads for a single core are consecutive, i.e., if each core has two threads and there are N cores, the threads for a particular core K are 2*K and 2*K+1. The mapping of cores to sockets is also straightforward, for a N socket system with C cores per socket, the cores for a particular socket S are numbered S*C to ((S+1)*C)-1.
63+
**Note:** The threads-to-cores mappings described above are specific to Intel only. AMD has a different mapping. For AMD, all the threads for a single core are consecutive; that is, if each core has two threads and there are N cores, the threads for a particular core K are 2*K and 2*K+1. The mapping of cores to sockets is also straightforward. For an N socket system with C cores per socket, the cores for a particular socket S are numbered S*C to ((S+1)*C)-1.
6464

6565
## Additional Notes
6666

67+
This displays the inventory of available sockets/CPUs on a system:
68+
6769
`numactl --hardware`
6870

69-
Displays the inventory of available sockets/CPUs on a system.
71+
This displays the resources available to the current process:
7072

7173
`numactl --show`
7274

73-
Displays the resources available to the current process.
74-
7575
For further details about these and other parameters, see the man page on **numactl**:
7676

7777
```bash

src/content/user-guide/deepsparse-engine/scheduler.mdx

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,38 +9,38 @@ index: 2000
99

1010
This page explains the various settings for DeepSparse, which enable you to tune the performance to your workload.
1111

12-
Schedulers are special system software which handle the distribution of work across cores in parallel computation.
13-
The goal of a good scheduler is to ensure that while work is available, cores aren’t sitting idle.
12+
Schedulers are special system software, which handle the distribution of work across cores in parallel computation.
13+
The goal of a good scheduler is to ensure that, while work is available, cores are not sitting idle.
1414
On the contrary, as long as parallel tasks are available, all cores should be kept busy.
1515

1616
## Single Stream (Default)
1717
In most use cases, the default scheduler is the preferred choice when running inferences with the DeepSparse Engine.
18-
It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
18+
The default scheduler is highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
1919
Often, particularly when working with large batch sizes, the scheduler is able to distribute the workload of a single request across as many cores as it's provided.
2020

2121
*Single-stream scheduling; requests execute serially by default:*
2222
<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/docs/source/single-stream.png" alt="single stream diagram" />
2323

24-
## Multi Stream
24+
## Multi-Stream
2525

26-
However, there are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little to apply it to.
26+
There are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little to apply it to.
2727

28-
An alternative, "multi-stream" scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work.
28+
An alternative, multi-stream scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work.
2929

30-
If increasing core count doesn't decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API,](https://docs.neuralmagic.com/deepsparse/api/deepsparse.html) the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently.
30+
If increasing core count does not decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API,](https://docs.neuralmagic.com/deepsparse/api/deepsparse.html) the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently.
3131

32-
*Multi-stream scheduling; requests execute in parallel and may utilize HW resources better:*
32+
*Multi-stream scheduling; requests execute in parallel and may better utilize hardware resources:*
3333
<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/docs/source/multi-stream.png" alt="multi stream diagram" />
3434

3535

3636

37-
Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler allows multiple requests to be run in parallel. The `num_streams` argument to the Engine/Context classes controls how the multi-streams scheduler partitions up the machine. Each stream maps to a contiguous set of hardware threads. By default, only one hyperthread per core is used. There is no sharing amongst the partitions and it is generally good practice make sure that the `num_streams` value evenly divides into your number of cores. By default `num_streams` is set to multiplex requests across L3 caches.
37+
Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler allows multiple requests to be run in parallel. The `num_streams` argument to the Engine/Context classes controls how the multi-streams scheduler partitions up the machine. Each stream maps to a contiguous set of hardware threads. By default, only one hyperthread per core is used. There is no sharing amongst the partitions and it is generally good practice to make sure the `num_streams` value evenly divides into your number of cores. By default `num_streams` is set to multiplex requests across L3 caches.
3838

39-
Here's an example: Consider a machine with 2 sockets, each with 8 cores. In this case the multi-stream scheduler will create two streams, one per socket by default. The first stream will contain cores 0-7 and the second stream will contain cores 8-15.
39+
Here's an example. Consider a machine with 2 sockets, each with 8 cores. In this case, the multi-stream scheduler will create two streams, one per socket by default. The first stream will contain cores 0-7 and the second stream will contain cores 8-15.
4040

41-
Manually increasing `num_streams` to 3 will result in the following stream breakdown: threads 0-5 in the first stream, 6-10 in the second, and 11-15 in the last. This is problematic for our two socket system. The second stream (threads 6-10) is straddling both sockets, meaning that each request being serviced by that stream is going to incur a performance penalty each time one of its threads makes a remote memory access. The impact of this penalty will depend on the workload, but it will likely be significant.
41+
Manually increasing `num_streams` to 3 will result in the following stream breakdown: threads 0-5 in the first stream, 6-10 in the second, and 11-15 in the last. This is problematic for our 2-socket system. The second stream (threads 6-10) is straddling both sockets, meaning that each request being serviced by that stream is going to incur a performance penalty each time one of its threads makes a remote memory access. The impact of this penalty will depend on the workload, but it will likely be significant.
4242

43-
Manually increasing `num_streams` to 4 is interesting. Here's the stream breakdown: threads 0-3 in the first stream, 4-7 in the second, 8-11 in the third, and 12-15 in the fourth. Each stream is only making memory accesses that are local to its socket which is good. However, the first two and last two streams are sharing the same L3 cache which can result in worse performance due to cache thrashing. Depending on the workload, the performance gain from the increased parallelism may negate this penalty, though.
43+
Manually increasing `num_streams` to 4 is interesting. Here's the stream breakdown: threads 0-3 in the first stream, 4-7 in the second, 8-11 in the third, and 12-15 in the fourth. Each stream is only making memory accesses that are local to its socket, which is good. However, the first two and last two streams are sharing the same L3 cache, which can result in worse performance due to cache thrashing. Depending on the workload, though, the performance gain from the increased parallelism may negate this penalty.
4444

4545
The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them. Implementing a model server may fit such a scenario and be ideal for using multi-stream scheduling.
4646

@@ -52,12 +52,12 @@ Depending on your engine execution strategy, enable one of these options by runn
5252
engine = compile_model(model_path, scheduler="single_stream")
5353
```
5454

55-
or
55+
or:
5656

5757
```python
5858
engine = compile_model(model_path, scheduler="multi_stream", num_streams=None) # None is the default
5959
```
6060

61-
or pass in the enum value directly, since` "multi_stream" == Scheduler.multi_stream`
61+
or pass in the enum value directly, since` "multi_stream" == Scheduler.multi_stream`.
6262

6363
By default, the scheduler will map to a single stream.

0 commit comments

Comments
 (0)