Build the Kernel Using the Vitis Tool Flow - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-11-13
Version
2023.2 English

Now, build the kernel using the Vitis compiler. The Vitis compiler will call the Vitis HLS tool to synthesize the C++ kernel code into an RTL kernel. You will also review the reports to confirm if the kernel meets the latency/throughput requirements for your performance goals.

  1. Use the following command to build the kernel.

    cd $LAB_WORK_DIR/makefile; make build STEP=single_buffer PF=4 TARGET=hw_emu
    

    This command will call the v++ compiler which then calls the Vitis HLS tool to translate the C++ code into RTL code that can be used to run Hardware Emulation.

    NOTE: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation.

  2. Then, use the following commands to visualize the HLS Synthesis Report in the Vitis analyzer.

    vitis_analyzer ../build/single_buffer/kernel_4/hw_emu/runOnfpga_hw_emu.xclbin.link_summary
    

NOTE: In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the link summary as described in Working with the Analysis View. You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.

Select the System Estimate report to open it.

missing image

- The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with 4 words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal.
- The `compute_hash_flags_dataflow` function has `dataflow` enabled in the Pipeline column. This function is important to review and indicates that the task-level parallelism is enabled and expected to have overlap across the sub-functions in the `compute_hash_flags_dataflow` function.
- The latency reported for `read_bloom_filter` function is 16,385 for reading the Bloom filter coefficients from the DDR using the `bloom_filter maxi` port. This loop is iterated over 16,000 cycles reading 32-bits data of from the Bloom filter coefficients.

The reports confirm that the latency of the function meets your target. You still need to ensure the functionality is correct when communicating with the host. In the next section, you will walk through the initial host code and run the software and hardware emulation.