Skip to content

Commit 092495b

Browse files
Saumya GargGitHub Enterprise
authored andcommitted
Updating code with load-compute-store model
1 parent 6d93282 commit 092495b

File tree

7 files changed

+104
-84
lines changed

7 files changed

+104
-84
lines changed

host/multiple_cus_asymmetrical/README.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Multiple Compute Units (Asymmetrical) (C)
33

44
This is simple example of vector addition to demonstrate how to connect each compute unit to different banks and how to use these compute units in host applications
55

6-
**KEY CONCEPTS:** `Multiple Compute Units <https://www.xilinx.com/html_docs/xilinx2021_1/vitis_doc/opencl_programming.html#dqz1555367565037>`__
6+
**KEY CONCEPTS:** `Multiple Compute Units <https://www.xilinx.com/html_docs/xilinx2021_1/vitis_doc/opencl_programming.html#dqz1555367565037>`__, `Task Level Parallelism <https://www.xilinx.com/html_docs/xilinx2021_1/vitis_doc/optimizingperformance.html#cvc1523913889499>`__
77

88
EXCLUDED PLATFORMS
99
------------------
@@ -56,16 +56,16 @@ Kernel can be connected to different banks using vadd.cfg file as below:
5656
nk=vadd:4:vadd_1.vadd_2.vadd_3.vadd_4
5757
sp=vadd_1.in1:DDR[0]
5858
sp=vadd_1.in2:DDR[0]
59-
sp=vadd_1.out_r:DDR[0]
59+
sp=vadd_1.out:DDR[0]
6060
sp=vadd_2.in1:DDR[1]
6161
sp=vadd_2.in2:DDR[1]
62-
sp=vadd_2.out_r:DDR[1]
62+
sp=vadd_2.out:DDR[1]
6363
sp=vadd_3.in1:PLRAM[0]
6464
sp=vadd_3.in2:PLRAM[0]
65-
sp=vadd_3.out_r:PLRAM[0]
65+
sp=vadd_3.out:PLRAM[0]
6666
sp=vadd_4.in1:PLRAM[1]
6767
sp=vadd_4.in2:PLRAM[1]
68-
sp=vadd_4.out_r:PLRAM[1]
68+
sp=vadd_4.out:PLRAM[1]
6969

7070
Some of the vadd compute units are connected to DDR banks and some are
7171
connected to PLRAMs. ``nk`` option can be used to specify how many
@@ -90,4 +90,4 @@ The kernel object which is created above is very specific to ``vadd_1``
9090
compute unit. Using this Kernel Object, host can directly access to this
9191
fix compute unit.
9292

93-
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
93+
For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.

host/multiple_cus_asymmetrical/description.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
],
66
"flow": "vitis",
77
"key_concepts": [
8-
"Multiple Compute Units"
8+
"Multiple Compute Units",
9+
"Task Level Parallelism"
910
],
1011
"platform_blacklist": [
1112
"_u25_",

host/multiple_cus_asymmetrical/details.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,16 @@ Kernel can be connected to different banks using vadd.cfg file as below:
1313
nk=vadd:4:vadd_1.vadd_2.vadd_3.vadd_4
1414
sp=vadd_1.in1:DDR[0]
1515
sp=vadd_1.in2:DDR[0]
16-
sp=vadd_1.out_r:DDR[0]
16+
sp=vadd_1.out:DDR[0]
1717
sp=vadd_2.in1:DDR[1]
1818
sp=vadd_2.in2:DDR[1]
19-
sp=vadd_2.out_r:DDR[1]
19+
sp=vadd_2.out:DDR[1]
2020
sp=vadd_3.in1:PLRAM[0]
2121
sp=vadd_3.in2:PLRAM[0]
22-
sp=vadd_3.out_r:PLRAM[0]
22+
sp=vadd_3.out:PLRAM[0]
2323
sp=vadd_4.in1:PLRAM[1]
2424
sp=vadd_4.in2:PLRAM[1]
25-
sp=vadd_4.out_r:PLRAM[1]
25+
sp=vadd_4.out:PLRAM[1]
2626

2727
Some of the vadd compute units are connected to DDR banks and some are
2828
connected to PLRAMs. ``nk`` option can be used to specify how many

host/multiple_cus_asymmetrical/qor.json

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,15 @@
1212
"check_warning": "false",
1313
"loops": [
1414
{
15-
"name": "read1",
15+
"name": "mem_rd",
1616
"PipelineII": "1"
1717
},
1818
{
19-
"name": "read2",
19+
"name": "execute",
2020
"PipelineII": "1"
2121
},
2222
{
23-
"name": "vadd",
24-
"PipelineII": "1"
25-
},
26-
{
27-
"name": "write",
23+
"name": "mem_rw",
2824
"PipelineII": "1"
2925
}
3026
]

host/multiple_cus_asymmetrical/src/vadd.cpp

Lines changed: 85 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -13,78 +13,102 @@
1313
* License for the specific language governing permissions and limitations
1414
* under the License.
1515
*/
16-
// Work load of each CU
17-
#define BUFFER_SIZE 1024
18-
#define DATA_SIZE 4096
19-
20-
// TRIPCOUNT indentifier
21-
const unsigned int c_len = DATA_SIZE / BUFFER_SIZE;
22-
const unsigned int c_size = BUFFER_SIZE;
2316

24-
/*
25-
Vector Addition Kernel Implementation
26-
Arguments:
27-
in1 (input) --> Input Vector1
28-
in2 (input) --> Input Vector2
29-
out_r (output) --> Output Vector
30-
size (input) --> Size of Vector in Integer
31-
*/
17+
/*******************************************************************************
18+
Description:
19+
This example uses the load/compute/store coding style which is generally
20+
the most efficient for implementing kernels using HLS. The load and store
21+
functions are responsible for moving data in and out of the kernel as
22+
efficiently as possible. The core functionality is decomposed across one
23+
of more compute functions. Whenever possible, the compute function should
24+
pass data through HLS streams and should contain a single set of nested loops.
25+
HLS stream objects are used to pass data between producer and consumer
26+
functions. Stream read and write operations have a blocking behavior which
27+
allows consumers and producers to synchronize with each other automatically.
28+
The dataflow pragma instructs the compiler to enable task-level pipelining.
29+
This is required for to load/compute/store functions to execute in a parallel
30+
and pipelined manner. Here the kernel loads, computes and stores NUM_WORDS integer values per
31+
clock cycle and is implemented as below:
32+
_____________
33+
| |<----- Input Vector 1 from Global Memory
34+
| load_input | __
35+
|_____________|----->| |
36+
_____________ | | in1_stream
37+
Input Vector 2 from Global Memory --->| | |__|
38+
__ | load_input | |
39+
| |<---|_____________| |
40+
in2_stream | | _____________ |
41+
|__|--->| |<--------
42+
| compute_add | __
43+
|_____________|---->| |
44+
______________ | | out_stream
45+
| |<---|__|
46+
| store_result |
47+
|______________|-----> Output result to Global Memory
3248
33-
extern "C" {
34-
void vadd(const unsigned int* in1, // Read-Only Vector 1
35-
const unsigned int* in2, // Read-Only Vector 2
36-
unsigned int* out_r, // Output Result
37-
int size // Size in integer
38-
) {
39-
unsigned int v1_buffer[BUFFER_SIZE]; // Local memory to store vector1
40-
unsigned int v2_buffer[BUFFER_SIZE]; // Local memory to store vector2
41-
unsigned int vout_buffer[BUFFER_SIZE]; // Local Memory to store result
49+
*******************************************************************************/
4250

43-
// Per iteration of this loop perform BUFFER_SIZE vector addition
44-
for (int i = 0; i < size; i += BUFFER_SIZE) {
45-
#pragma HLS LOOP_TRIPCOUNT min = c_len max = c_len
46-
int chunk_size = BUFFER_SIZE;
47-
// boundary checks
48-
if ((i + BUFFER_SIZE) > size) chunk_size = size - i;
51+
#include <stdint.h>
52+
#include <hls_stream.h>
4953

50-
// Transferring data in bursts hides the memory access latency as well as
51-
// improves bandwidth utilization and efficiency of the memory controller.
52-
// It is recommended to infer burst transfers from successive requests of data
53-
// from consecutive address locations.
54-
// A local memory vl_local is used for buffering the data from a single burst.
55-
// The entire input vector is read in multiple bursts.
56-
// The choice of LOCAL_MEM_SIZE depends on the specific applications and
57-
// available on-chip memory on target FPGA.
58-
// burst read of v1 and v2 vector from global memory
54+
#define DATA_SIZE 4096
5955

60-
// Auto-pipeline is going to apply pipeline to these loops
61-
read1:
62-
for (int j = 0; j < chunk_size; j++) {
63-
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
64-
v1_buffer[j] = in1[i + j];
65-
}
56+
// TRIPCOUNT identifier
57+
const int c_size = DATA_SIZE;
6658

67-
read2:
68-
for (int j = 0; j < chunk_size; j++) {
59+
static void read_input(unsigned int* in, hls::stream<unsigned int>& inStream, int size) {
60+
// Auto-pipeline is going to apply pipeline to this loop
61+
mem_rd:
62+
for (int i = 0; i < size; i++) {
6963
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
70-
v2_buffer[j] = in2[i + j];
71-
}
64+
inStream << in[i];
65+
}
66+
}
7267

73-
// PIPELINE pragma reduces the initiation interval for loop by allowing the
74-
// concurrent executions of operations
75-
vadd:
76-
for (int j = 0; j < chunk_size; j++) {
68+
static void compute_add(hls::stream<unsigned int>& inStream1,
69+
hls::stream<unsigned int>& inStream2,
70+
hls::stream<unsigned int>& outStream,
71+
int size) {
72+
// Auto-pipeline is going to apply pipeline to this loop
73+
execute:
74+
for (int i = 0; i < size; i++) {
7775
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
78-
// perform vector addition
79-
vout_buffer[j] = v1_buffer[j] + v2_buffer[j];
80-
}
76+
outStream << (inStream1.read() + inStream2.read());
77+
}
78+
}
8179

82-
// burst write the result
83-
write:
84-
for (int j = 0; j < chunk_size; j++) {
80+
static void write_result(unsigned int* out, hls::stream<unsigned int>& outStream, int size) {
81+
// Auto-pipeline is going to apply pipeline to this loop
82+
mem_wr:
83+
for (int i = 0; i < size; i++) {
8584
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
86-
out_r[i + j] = vout_buffer[j];
87-
}
85+
out[i] = outStream.read();
8886
}
8987
}
88+
89+
extern "C" {
90+
/*
91+
Vector Addition Kernel Implementation using dataflow
92+
Arguments:
93+
in1 (input) --> Input Vector 1
94+
in2 (input) --> Input Vector 2
95+
out (output) --> Output Vector
96+
size (input) --> Size of Vector in Integer
97+
*/
98+
void vadd(unsigned int* in1, unsigned int* in2, unsigned int* out, int size) {
99+
static hls::stream<unsigned int> inStream1("input_stream_1");
100+
static hls::stream<unsigned int> inStream2("input_stream_2");
101+
static hls::stream<unsigned int> outStream("output_stream");
102+
103+
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
104+
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
105+
#pragma HLS INTERFACE m_axi port = out bundle = gmem0
106+
107+
#pragma HLS dataflow
108+
// dataflow pragma instruct compiler to run following three APIs in parallel
109+
read_input(in1, inStream1, size);
110+
read_input(in2, inStream2, size);
111+
compute_add(inStream1, inStream2, outStream, size);
112+
write_result(out, outStream, size);
113+
}
90114
}
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
[connectivity]
22
sp=vadd_1.in1:DDR[0]
33
sp=vadd_1.in2:DDR[0]
4-
sp=vadd_1.out_r:DDR[0]
4+
sp=vadd_1.out:DDR[0]
55
sp=vadd_2.in1:DDR[1]
66
sp=vadd_2.in2:DDR[1]
7-
sp=vadd_2.out_r:DDR[1]
7+
sp=vadd_2.out:DDR[1]
88
sp=vadd_3.in1:PLRAM[0]
99
sp=vadd_3.in2:PLRAM[0]
10-
sp=vadd_3.out_r:PLRAM[0]
10+
sp=vadd_3.out:PLRAM[0]
1111
sp=vadd_4.in1:PLRAM[1]
1212
sp=vadd_4.in2:PLRAM[1]
13-
sp=vadd_4.out_r:PLRAM[1]
13+
sp=vadd_4.out:PLRAM[1]
1414
nk=vadd:4

rtl_kernels/rtl_vadd/src/krnl_vadd/vadd_CModel.cpp

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ extern "C" {
4242
*/
4343

4444
void krnl_vadd_rtl(uint32_t* a, uint32_t* b, uint32_t* c, ap_uint<32> length_r) {
45-
4645
for (int i = 0; i < length_r; i++) c[i] = a[i] + b[i];
4746
}
4847
}

0 commit comments

Comments
 (0)