Skip to content

Commit c5ca28d

Browse files
Virat AgarwalGitHub Enterprise
authored andcommitted
Updating all README.rst to add details.rst information as well
1 parent 6a71717 commit c5ca28d

File tree

87 files changed

+3823
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+3823
-0
lines changed

common/utility/md2rst/md2rst.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ def commandargs(target,data):
128128
else:
129129
target.write('./' + data["host"]["host_exe"])
130130
target.write("\n\n")
131+
return
131132

132133
# Get the argument from the description
133134
script, desc_file, name = argv

common/utility/readme_gen/readme_gen.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,21 @@ def commandargs(target,data):
115115
else:
116116
target.write('./' + data["host"]["host_exe"])
117117
target.write("\n\n")
118+
return
118119

120+
def details(target):
121+
listfiles = os.listdir('./')
122+
if 'details.rst' in listfiles:
123+
target.write("DETAILS\n")
124+
target.write("-" * len("DETAILS"))
125+
target.write("\n")
126+
with open('details.rst', 'r') as fin:
127+
for i, x in enumerate(fin):
128+
if 2 <= i :
129+
target.write(x)
130+
target.write("\n")
131+
target.write("For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.")
132+
return
119133

120134
# Get the argument from the description
121135
script, desc_file = argv
@@ -139,5 +153,6 @@ def commandargs(target,data):
139153
requirements(target,data)
140154
hierarchy(target)
141155
commandargs(target,data)
156+
details(target)
142157

143158
target.close

cpp_kernels/array_partition/README.rst

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,67 @@ Once the environment has been configured, the application can be executed by
3636

3737
./array_partition <matmul XCLBIN>
3838

39+
DETAILS
40+
-------
41+
42+
This example demonstrates how ``array partition`` in HLS kernel can help
43+
to improve the performance. In this example matrix multiplication
44+
functionality is used to showcase the benefit of array partition. Design
45+
contains two kernels “matmul” a simple matrix multiplication and
46+
“matmul_partition” a matrix multiplication implementation using array
47+
partition.
48+
49+
``#pragma HLS array partition`` is used to partition an array into
50+
multiple smaller arrays or memories. Arrays can be partitioned in three
51+
ways, ``cyclic``, ``block`` and ``complete``. In this example,
52+
``complete`` partition is used to partition one of the dimension of
53+
local Matrix array as below
54+
55+
.. code:: cpp
56+
57+
int B[MAX_SIZE][MAX_SIZE];
58+
int C[MAX_SIZE][MAX_SIZE];
59+
#pragma HLS ARRAY_PARTITION variable = B dim = 2 complete
60+
#pragma HLS ARRAY_PARTITION variable = C dim = 2 complete
61+
62+
This array partition helps design to access 2nd dimension of both Matrix
63+
B and C concurrently to reduce the overall latency.
64+
65+
To see the benefit of array partition, user can look into system
66+
estimate report and see overall latency. Latency Information of normal
67+
matmul kernel (without partition):
68+
69+
::
70+
71+
Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute)
72+
------------ ----------- ----------- -------------- ------------- ------------ -------------- --------------- -------------- ----------------
73+
matmul_1 matmul matmul 2856 ~ 2859 2855 2857 2858 9.516 us 9.522 us 9.526 us
74+
75+
Latency Information for matrix multiplication for kernel with partition:
76+
77+
::
78+
79+
Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute)
80+
------------------ ---------------- ---------------- -------------- ------------- ------------ -------------- --------------- -------------- ----------------
81+
matmul_partition_1 matmul_partition matmul_partition 1063 ~ 1066 1062 1064 1065 3.540 us 3.546 us 3.550 us
82+
83+
Example generates the following information as output when ran on Alevo
84+
U200 Card:
85+
86+
::
87+
88+
Found Platform
89+
Platform Name: Xilinx
90+
INFO: Reading ./build_dir.hw.xilinx_u200_qdma_201910_1/matmul.xclbin
91+
Loading: './build_dir.hw.xilinx_u200_qdma_201910_1/matmul.xclbin'
92+
|-------------------------+-------------------------|
93+
| Kernel | Wall-Clock Time (ns) |
94+
|-------------------------+-------------------------|
95+
| matmul: | 396685 |
96+
| matmul: partition | 256367 |
97+
|-------------------------+-------------------------|
98+
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
99+
Please refer to profile summary for kernel execution time for hardware emulation.
100+
TEST PASSED
101+
102+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/bind_op_storage/README.rst

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,44 @@ Once the environment has been configured, the application can be executed by
3737

3838
./bind_op_storage <vadd XCLBIN>
3939

40+
DETAILS
41+
-------
42+
43+
In this design we propose a easy way to specify hardware resource and
44+
its properties by new bind_op and bind_storage pragma.
45+
46+
bind_op pragma
47+
~~~~~~~~~~~~~~
48+
49+
::
50+
51+
#pragma HLS bind_op variable=<string> op=<string> impl=<string> latency=<unsigned>
52+
53+
Using the bind_op pragma we can specify DSP/non-DSP resources and allows
54+
use to specify the latency. In this example we are using addition and
55+
multiply operations using dsp resources.
56+
57+
::
58+
59+
#pragma HLS BIND_OP variable=v1_buffer op=mul impl=DSP latency=2
60+
#pragma HLS BIND_OP variable=v2_buffer op=mul impl=DSP latency=2
61+
#pragma HLS BIND_OP variable=vout_buffer op=add impl=DSP
62+
63+
bind_storage pragma
64+
~~~~~~~~~~~~~~~~~~~
65+
66+
::
67+
68+
#pragma HLS bind_storage variable=<string> type=<string> impl=<string> latency=<unsigned>
69+
70+
Using the bind_storage pragma we have flexibility to decide on the port
71+
type(FIFO/RAM_1P/RAM_2P), storage (BRAM/URAM/LUTRAM) and latency. In
72+
this example we are using RAM_1P with latency 2 for input buffers.
73+
74+
::
75+
76+
#pragma HLS BIND_STORAGE variable=v1_buffer type=RAM_1P impl=BRAM latency=2
77+
#pragma HLS BIND_STORAGE variable=v2_buffer type=RAM_1P impl=LUTRAM latency=2
78+
#pragma HLS BIND_STORAGE variable=vout_buffer type=RAM_1P impl=URAM
79+
80+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/burst_rw/README.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,24 @@ Once the environment has been configured, the application can be executed by
3535

3636
./burst_rw <vadd XCLBIN>
3737

38+
DETAILS
39+
-------
40+
41+
The usual reason for having a burst mode capability, or using burst
42+
mode, is to increase data throughput. This example demonstrates how
43+
multiple items can be read from global memory to kernel’s local memory
44+
in a single burst. This is done to achieve low memory access latency and
45+
also for efficient use of bandwidth provided by the ``m_axi`` interface.
46+
Similarly, computation results are stored in a buffer and are written to
47+
global memory in a burst. Auto-pipeline is going to apply pipeline to
48+
these loops
49+
50+
The for loops used have the following requirements to implement burst
51+
read/write:
52+
53+
- Pipeline the loop : Loop pipeline must have II(Initiation interval) =
54+
1 specfied by the pipeline pragma inside the loop.
55+
- Aligned memory : Memory addresses for read/write should be
56+
contiguous.
57+
58+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/critical_path/README.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,51 @@ Once the environment has been configured, the application can be executed by
3434

3535
./critical_path -x <apply_watermark_GOOD XCLBIN> -i $(ABS_COMMON_REPO)/common/data/xilinx_img.bmp -c ./data/golden.bmp
3636

37+
DETAILS
38+
-------
39+
40+
This example demonstrates the considerations in coding style to avoid
41+
critical paths in kernels.
42+
43+
``Apply_watermark`` kernel is processing the image’s pixels concurrently
44+
by using ``HLS UNROLL`` however, a global variable ``x`` is being
45+
updated in every iteration which nullifies the speedup offered by
46+
unrolling the loop and leads to a critical path.
47+
48+
.. code:: cpp
49+
50+
watermark:
51+
for (int i = 0; i < DATA_SIZE; i++, x++) {
52+
#pragma HLS UNROLL
53+
if (x > width) {
54+
x = x - width;
55+
y += 1;
56+
}
57+
58+
uint w_idy = y % WATERMARK_HEIGHT;
59+
uint w_idx = x % WATERMARK_WIDTH;
60+
tmp.data[i] = saturatedAdd(tmp.data[i], watermark[w_idy][w_idx]);
61+
}
62+
63+
Using local variables and just referencing the value of ``x`` in every
64+
iteration and updating it outside the loop can remove this critical path
65+
and thus improve the performance and timing of kernel execution.
66+
67+
.. code:: cpp
68+
69+
for (int i = 0; i < DATA_SIZE; i++) {
70+
#pragma HLS UNROLL
71+
uint tmp_x = x + i;
72+
uint tmp_y = y;
73+
if (tmp_x > width) {
74+
tmp_x = tmp_x - width;
75+
tmp_y += 1;
76+
}
77+
78+
uint w_idy = tmp_y % WATERMARK_HEIGHT;
79+
uint w_idx = tmp_x % WATERMARK_WIDTH;
80+
tmp.data[i] = saturatedAdd(tmp.data[i], watermark[w_idy][w_idx]);
81+
}
82+
x += DATA_SIZE;
83+
84+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/custom_datatype/README.rst

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,44 @@ Once the environment has been configured, the application can be executed by
3636

3737
./custom_datatype -x <rgb_to_hsv XCLBIN> -i $(ABS_COMMON_REPO)/common/data/xilinx_logo.bmp
3838

39+
DETAILS
40+
-------
41+
42+
Kernel ports can have custom datatypes.It is recommended that custom
43+
datatype is a power of 2 and minimum 32 bits to allow ``burst transfer``
44+
thus using the AXI master bandwidth efficiently. Extra ``padding`` can
45+
be added in case not a multiple of 32 bits as shown below.
46+
47+
.. code:: cpp
48+
49+
typedef struct RGBcolor_struct
50+
{
51+
unsigned char r;
52+
unsigned char g;
53+
unsigned char b;
54+
unsigned char pad;
55+
} RGBcolor;
56+
57+
typedef struct HSVcolor_struct
58+
{
59+
unsigned char h;
60+
unsigned char s;
61+
unsigned char v;
62+
unsigned char pad;
63+
}HSVcolor;
64+
65+
Kernel in this example uses the above structures as datatypes for its
66+
input and output ports.
67+
68+
::
69+
70+
void rgb_to_hsv(RGBcolor *in, // Access global memory as RGBcolor struct-wise
71+
HSVcolor *out, // Access Global Memory as HSVcolor struct-wise
72+
int size)
73+
74+
Custom datatypes can be used to reduce the number of
75+
``kernel arguments`` thus reducing the number of interfaces between
76+
kernels and memory. It can also help to reduce execution time to set
77+
kernel arguments if number of kernel arguments is large.
78+
79+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/dataflow_stream/README.rst

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,31 @@ Once the environment has been configured, the application can be executed by
3535

3636
./dataflow_stream <adder XCLBIN>
3737

38+
DETAILS
39+
-------
40+
41+
This example explains how ``#pragma HLS dataflow`` can be used to
42+
implement task level parallelism using ``HLS Stream`` datatype.
43+
44+
Usually data stored in the array is consumed or produced in a sequential
45+
manner, a more efficient communication mechanism is to use streaming
46+
data as specified by the ``STREAM`` pragma, where FIFOs are used instead
47+
of RAMs. Depth of ``FIFO`` can be specified by ``depth`` option in the
48+
pragma.
49+
50+
.. code:: cpp
51+
52+
#pragma HLS STREAM variable = inStream depth = 32
53+
#pragma HLS STREAM variable = outStream depth = 32
54+
55+
Vector addition in kernel is divided into 3 sub-tasks(read, compute_add
56+
and write) which are then performed concurrently using ``Dataflow``.
57+
58+
.. code:: cpp
59+
60+
#pragma HLS dataflow
61+
read_input(in, inStream, size);
62+
compute_add(inStream, outStream, inc, size);
63+
write_result(out, outStream, size);
64+
65+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

cpp_kernels/dataflow_stream_array/README.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,45 @@ Once the environment has been configured, the application can be executed by
3535

3636
./dataflow_stream_array <N_stage_Adders XCLBIN>
3737

38+
DETAILS
39+
-------
40+
41+
This example demontrates the use of an array of ``HLS streams`` in
42+
kernels.
43+
44+
Kernel performs a number of vector additions. Initial vector is taken
45+
from the global memory and is written into a stream. Operator ``<<`` is
46+
overloaded to perform a ``blocking write`` to a stream from a variable.
47+
48+
.. code:: cpp
49+
50+
mem_rd:
51+
for (int i = 0; i < size; i++) {
52+
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
53+
inStream << input[i];
54+
}
55+
56+
Multiple additions are performed using the ``adder`` function which take
57+
the input from a stream and provide the output to another stream.
58+
59+
.. code:: cpp
60+
61+
compute_loop:
62+
for (int i = 0; i < STAGES; i++) {
63+
#pragma HLS UNROLL
64+
adder(streamArray[i], streamArray[i + 1], incr, size);
65+
}
66+
67+
Finally, result is written back from stream to global memory buffer.
68+
69+
.. code:: cpp
70+
71+
static void write_result(int *output, hls::stream<int> &outStream, int size) {
72+
mem_wr:
73+
for (int i = 0; i < size; i++) {
74+
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
75+
output[i] = outStream.read();
76+
}
77+
}
78+
79+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

0 commit comments

Comments
 (0)