Xilinx
diff --git a/‎common/utility/md2rst/md2rst.py‎
Lines changed: 1 addition & 0 deletions b/‎common/utility/md2rst/md2rst.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎common/utility/readme_gen/readme_gen.py‎
Lines changed: 15 additions & 0 deletions b/‎common/utility/readme_gen/readme_gen.py‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎cpp_kernels/array_partition/README.rst‎
Lines changed: 64 additions & 0 deletions b/‎cpp_kernels/array_partition/README.rst‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎cpp_kernels/bind_op_storage/README.rst‎
Lines changed: 41 additions & 0 deletions b/‎cpp_kernels/bind_op_storage/README.rst‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎cpp_kernels/burst_rw/README.rst‎
Lines changed: 21 additions & 0 deletions b/‎cpp_kernels/burst_rw/README.rst‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎cpp_kernels/critical_path/README.rst‎
Lines changed: 48 additions & 0 deletions b/‎cpp_kernels/critical_path/README.rst‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎cpp_kernels/custom_datatype/README.rst‎
Lines changed: 41 additions & 0 deletions b/‎cpp_kernels/custom_datatype/README.rst‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎cpp_kernels/dataflow_stream/README.rst‎
Lines changed: 28 additions & 0 deletions b/‎cpp_kernels/dataflow_stream/README.rst‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎cpp_kernels/dataflow_stream_array/README.rst‎
Lines changed: 42 additions & 0 deletions b/‎cpp_kernels/dataflow_stream_array/README.rst‎
Lines changed: 42 additions & 0 deletions
@@ -128,6 +128,7 @@ def commandargs(target,data):
     else:
         target.write('./' + data["host"]["host_exe"])
     target.write("\n\n")
+    return
 
 # Get the argument from the description
 script, desc_file, name = argv
 
@@ -115,7 +115,21 @@ def commandargs(target,data):
     else:
         target.write('./' + data["host"]["host_exe"])
     target.write("\n\n")
+    return
 
+def details(target):
+    listfiles = os.listdir('./')
+    if 'details.rst' in listfiles:
+        target.write("DETAILS\n")
+        target.write("-" * len("DETAILS"))
+        target.write("\n")
+        with open('details.rst', 'r') as fin:
+            for i, x in enumerate(fin):
+                if 2 <= i :
+                    target.write(x)
+        target.write("\n")
+    target.write("For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.")
+    return
 
 # Get the argument from the description
 script, desc_file = argv
@@ -139,5 +153,6 @@ def commandargs(target,data):
     requirements(target,data)
     hierarchy(target)
     commandargs(target,data)
+    details(target)
 
 target.close
@@ -36,3 +36,67 @@ Once the environment has been configured, the application can be executed by
 
    ./array_partition <matmul XCLBIN>
 
+DETAILS
+-------
+
+This example demonstrates how ``array partition`` in HLS kernel can help
+to improve the performance. In this example matrix multiplication
+functionality is used to showcase the benefit of array partition. Design
+contains two kernels “matmul” a simple matrix multiplication and
+“matmul_partition” a matrix multiplication implementation using array
+partition.
+
+``#pragma HLS array partition`` is used to partition an array into
+multiple smaller arrays or memories. Arrays can be partitioned in three
+ways, ``cyclic``, ``block`` and ``complete``. In this example,
+``complete`` partition is used to partition one of the dimension of
+local Matrix array as below
+
+.. code:: cpp
+
+   int B[MAX_SIZE][MAX_SIZE];
+   int C[MAX_SIZE][MAX_SIZE];
+   #pragma HLS ARRAY_PARTITION variable = B dim = 2 complete
+   #pragma HLS ARRAY_PARTITION variable = C dim = 2 complete
+
+This array partition helps design to access 2nd dimension of both Matrix
+B and C concurrently to reduce the overall latency.
+
+To see the benefit of array partition, user can look into system
+estimate report and see overall latency. Latency Information of normal
+matmul kernel (without partition):
+
+::
+
+   Compute Unit  Kernel Name  Module Name  Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
+   ------------  -----------  -----------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
+   matmul_1      matmul       matmul       2856 ~ 2859     2855           2857          2858            9.516 us         9.522 us        9.526 us
+
+Latency Information for matrix multiplication for kernel with partition:
+
+::
+
+   Compute Unit        Kernel Name       Module Name       Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
+   ------------------  ----------------  ----------------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
+   matmul_partition_1  matmul_partition  matmul_partition  1063 ~ 1066     1062           1064          1065            3.540 us         3.546 us        3.550 us
+
+Example generates the following information as output when ran on Alevo
+U200 Card:
+
+::
+
+   Found Platform
+   Platform Name: Xilinx
+   INFO: Reading ./build_dir.hw.xilinx_u200_qdma_201910_1/matmul.xclbin
+   Loading: './build_dir.hw.xilinx_u200_qdma_201910_1/matmul.xclbin'
+   |-------------------------+-------------------------|
+   | Kernel                  |    Wall-Clock Time (ns) |
+   |-------------------------+-------------------------|
+   | matmul:                 |                  396685 |
+   | matmul: partition       |                  256367 |
+   |-------------------------+-------------------------|
+   Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
+   Please refer to profile summary for kernel execution time for hardware emulation.
+   TEST PASSED
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -37,3 +37,44 @@ Once the environment has been configured, the application can be executed by
 
    ./bind_op_storage <vadd XCLBIN>
 
+DETAILS
+-------
+
+In this design we propose a easy way to specify hardware resource and
+its properties by new bind_op and bind_storage pragma.
+
+bind_op pragma
+~~~~~~~~~~~~~~
+
+::
+
+   #pragma HLS bind_op variable=<string> op=<string> impl=<string> latency=<unsigned>
+
+Using the bind_op pragma we can specify DSP/non-DSP resources and allows
+use to specify the latency. In this example we are using addition and
+multiply operations using dsp resources.
+
+::
+
+   #pragma HLS BIND_OP variable=v1_buffer op=mul  impl=DSP latency=2
+   #pragma HLS BIND_OP variable=v2_buffer op=mul  impl=DSP latency=2
+   #pragma HLS BIND_OP variable=vout_buffer op=add  impl=DSP 
+
+bind_storage pragma
+~~~~~~~~~~~~~~~~~~~
+
+::
+
+   #pragma HLS bind_storage variable=<string> type=<string> impl=<string> latency=<unsigned> 
+
+Using the bind_storage pragma we have flexibility to decide on the port
+type(FIFO/RAM_1P/RAM_2P), storage (BRAM/URAM/LUTRAM) and latency. In
+this example we are using RAM_1P with latency 2 for input buffers.
+
+::
+
+   #pragma HLS BIND_STORAGE variable=v1_buffer type=RAM_1P impl=BRAM latency=2
+   #pragma HLS BIND_STORAGE variable=v2_buffer type=RAM_1P impl=LUTRAM latency=2
+   #pragma HLS BIND_STORAGE variable=vout_buffer type=RAM_1P impl=URAM
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -35,3 +35,24 @@ Once the environment has been configured, the application can be executed by
 
    ./burst_rw <vadd XCLBIN>
 
+DETAILS
+-------
+
+The usual reason for having a burst mode capability, or using burst
+mode, is to increase data throughput. This example demonstrates how
+multiple items can be read from global memory to kernel’s local memory
+in a single burst. This is done to achieve low memory access latency and
+also for efficient use of bandwidth provided by the ``m_axi`` interface.
+Similarly, computation results are stored in a buffer and are written to
+global memory in a burst. Auto-pipeline is going to apply pipeline to
+these loops
+
+The for loops used have the following requirements to implement burst
+read/write:
+
+-  Pipeline the loop : Loop pipeline must have II(Initiation interval) =
+   1 specfied by the pipeline pragma inside the loop.
+-  Aligned memory : Memory addresses for read/write should be
+   contiguous.
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -34,3 +34,51 @@ Once the environment has been configured, the application can be executed by
 
    ./critical_path -x <apply_watermark_GOOD XCLBIN> -i $(ABS_COMMON_REPO)/common/data/xilinx_img.bmp -c ./data/golden.bmp
 
+DETAILS
+-------
+
+This example demonstrates the considerations in coding style to avoid
+critical paths in kernels.
+
+``Apply_watermark`` kernel is processing the image’s pixels concurrently
+by using ``HLS UNROLL`` however, a global variable ``x`` is being
+updated in every iteration which nullifies the speedup offered by
+unrolling the loop and leads to a critical path.
+
+.. code:: cpp
+
+   watermark:
+          for (int i = 0; i < DATA_SIZE; i++, x++) {
+             #pragma HLS UNROLL
+             if (x > width) {
+                  x = x - width;
+                  y += 1;
+              }
+
+              uint w_idy = y % WATERMARK_HEIGHT;
+              uint w_idx = x % WATERMARK_WIDTH;
+              tmp.data[i] = saturatedAdd(tmp.data[i], watermark[w_idy][w_idx]);
+          }
+
+Using local variables and just referencing the value of ``x`` in every
+iteration and updating it outside the loop can remove this critical path
+and thus improve the performance and timing of kernel execution.
+
+.. code:: cpp
+
+   for (int i = 0; i < DATA_SIZE; i++) {
+              #pragma HLS UNROLL
+               uint tmp_x = x + i;
+               uint tmp_y = y;
+               if (tmp_x > width) {
+                   tmp_x = tmp_x - width;
+                   tmp_y += 1;
+               }
+
+               uint w_idy = tmp_y % WATERMARK_HEIGHT;
+               uint w_idx = tmp_x % WATERMARK_WIDTH;
+               tmp.data[i] = saturatedAdd(tmp.data[i], watermark[w_idy][w_idx]);
+           }
+           x += DATA_SIZE;       
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -36,3 +36,44 @@ Once the environment has been configured, the application can be executed by
 
    ./custom_datatype -x <rgb_to_hsv XCLBIN> -i $(ABS_COMMON_REPO)/common/data/xilinx_logo.bmp
 
+DETAILS
+-------
+
+Kernel ports can have custom datatypes.It is recommended that custom
+datatype is a power of 2 and minimum 32 bits to allow ``burst transfer``
+thus using the AXI master bandwidth efficiently. Extra ``padding`` can
+be added in case not a multiple of 32 bits as shown below.
+
+.. code:: cpp
+
+   typedef struct RGBcolor_struct
+   {
+     unsigned char r;
+     unsigned char g;
+     unsigned char b;
+     unsigned char pad;
+   } RGBcolor;
+
+   typedef struct HSVcolor_struct
+   {
+     unsigned char h;
+     unsigned char s;
+     unsigned char v;
+     unsigned char pad;
+   }HSVcolor;
+
+Kernel in this example uses the above structures as datatypes for its
+input and output ports.
+
+::
+
+   void rgb_to_hsv(RGBcolor *in,  // Access global memory as RGBcolor struct-wise
+                   HSVcolor *out, // Access Global Memory as HSVcolor struct-wise
+                   int size) 
+
+Custom datatypes can be used to reduce the number of
+``kernel arguments`` thus reducing the number of interfaces between
+kernels and memory. It can also help to reduce execution time to set
+kernel arguments if number of kernel arguments is large.
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -35,3 +35,31 @@ Once the environment has been configured, the application can be executed by
 
    ./dataflow_stream <adder XCLBIN>
 
+DETAILS
+-------
+
+This example explains how ``#pragma HLS dataflow`` can be used to
+implement task level parallelism using ``HLS Stream`` datatype.
+
+Usually data stored in the array is consumed or produced in a sequential
+manner, a more efficient communication mechanism is to use streaming
+data as specified by the ``STREAM`` pragma, where FIFOs are used instead
+of RAMs. Depth of ``FIFO`` can be specified by ``depth`` option in the
+pragma.
+
+.. code:: cpp
+
+   #pragma HLS STREAM variable = inStream depth = 32
+   #pragma HLS STREAM variable = outStream depth = 32
+
+Vector addition in kernel is divided into 3 sub-tasks(read, compute_add
+and write) which are then performed concurrently using ``Dataflow``.
+
+.. code:: cpp
+
+   #pragma HLS dataflow
+       read_input(in, inStream, size);
+       compute_add(inStream, outStream, inc, size);
+       write_result(out, outStream, size);
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.
@@ -35,3 +35,45 @@ Once the environment has been configured, the application can be executed by
 
    ./dataflow_stream_array <N_stage_Adders XCLBIN>
 
+DETAILS
+-------
+
+This example demontrates the use of an array of ``HLS streams`` in
+kernels.
+
+Kernel performs a number of vector additions. Initial vector is taken
+from the global memory and is written into a stream. Operator ``<<`` is
+overloaded to perform a ``blocking write`` to a stream from a variable.
+
+.. code:: cpp
+
+    mem_rd:
+       for (int i = 0; i < size; i++) {
+          #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
+           inStream << input[i];
+       }
+
+Multiple additions are performed using the ``adder`` function which take
+the input from a stream and provide the output to another stream.
+
+.. code:: cpp
+
+   compute_loop:
+       for (int i = 0; i < STAGES; i++) {
+          #pragma HLS UNROLL
+           adder(streamArray[i], streamArray[i + 1], incr, size);
+       }
+
+Finally, result is written back from stream to global memory buffer.
+
+.. code:: cpp
+
+   static void write_result(int *output, hls::stream<int> &outStream, int size) {
+   mem_wr:
+       for (int i = 0; i < size; i++) {
+          #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
+           output[i] = outStream.read();
+       }
+   }
+
+For more comprehensive documentation, `click here <http://xilinx.github.io/Vitis_Accel_Examples>`__.