Macro Architecture Implementation - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-11-13
Version
2023.2 English

Navigate to the function runOnfpga in 02-bloom/reference_files/compute_score_fpga_kernel.cpp.

The algorithm has been updated to receive 512-bits of words from the DDR with the following arguments:

  • input_words: 512-bit input data.

  • output_flags: 512-bit output data.

  • Additional arguments:

    • bloom_filter: Pointer of array with Bloom coefficients.

    • Total number of words to be computed

    • load_filter: Enable or disable of loading coefficients. This only needs to be loaded one time.

  1. The first step of the kernel development methodology requires structuring the kernel code into the Load-Compute-Store pattern. This means creating a top-level function, runOnfpga with:

    • Added sub-functions in the compute_hash_flags_dataflow for Load, Compute and Store.

    • Local arrays or hls::stream variables to pass data between these functions.

  2. The source code has the following INTERFACE pragmas for input_words, output_flags and bloom_filter.

    #pragma HLS INTERFACE m_axi port=output_flags bundle=maxiport0   offset=slave 
    #pragma HLS INTERFACE m_axi port=input_words  bundle=maxiport0   offset=slave 
    #pragma HLS INTERFACE m_axi port=bloom_filter bundle=maxiport1   offset=slave 
    

    where:

    • m_axi: Interface pragmas are used to characterize the AXI Master ports.

    • port: Specifies the name of the argument to be mapped to the AXI4 interface.

    • offset=slave: Indicates that the base address of the pointer is made available through the AXI4-Lite slave interface of the kernel.

    • bundle: Specifies the name of the m_axi interface. In this example, the input_words and output_flags are mapped to a maxiport0 and bloom_filter argument is mapped to maxiport1.

    The function runOnfpga loads the Bloom filter coefficients and calls the compute_hash_flags_dataflow function which has the main functionality of the Load, Compute and Store functions.

    Refer to the function compute_hash_flags_dataflow in the 02-bloom/cpu_src/compute_score_fpga_kernel.cpp file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit.

    missing image

    The kernel interface to the DDR memories is an AXI interface that is kept at its maximum width of 512 at the input and output. The compute_hash_flags function input can have a width different than 512, managed through “PARALLELIZATION”. To deal with these variations on the processing element boundaries, “Resize” blocks are inserted that adapt between the memory interface width and the processing unit interface width. Essentially, blocks named “Buffer” are memory adapters that convert between streams, and the AXI and “Resize” blocks adapt to interface widths as it depends on PARALLELIZATION factor chosen for the given configuration.

  3. The input of the compute_hash_flags_dataflow function, input_words are read as 512-bit burst reads from the global memory over an AXI interface and data_from_gmem, the stream of 512-bit values are created.

    hls_stream::buffer(data_from_gmem, input_words, total_size/(512/32));
    
  4. The stream of parallel words, word_stream (equals PARALLELIZATION words) are created from data_from_gmem as compute_hash_flags requires 128-bit for 4 words to process in parallel.

    hls_stream::resize(word_stream, data_from_gmem, total_size/(512/32));
    
  5. The function compute_hash_flags_dataflow calls the compute_hash_flags function for computing hash of parallel words.

  6. With PARALLELIZATION=4, the output of the compute_hash_flags, flag_stream is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream as data_to_mem.

    hls_stream::resize(data_to_gmem, flag_stream, total_size/(512/8));
    
  7. The stream of 512-bit values, data_to_mem is written as 512-bit values to the global memory over an AXI interface using output_flags.

    hls_stream::buffer(output_flags, data_to_gmem, total_size/(512/8));
    
  8. The #pragmas HLS DATAFLOW is added to enable task-level pipelining. This enables DATAFLOW and will instruct the Vitis High-Level Synthesis (HLS) compiler to run all the functions simultaneously, creating a pipeline of concurrently running tasks.

    void compute_hash_flags_dataflow(
          ap_uint<512>*   output_flags,
          ap_uint<512>*   input_words,
          unsigned int    bloom_filter[PARALLELIZATION][bloom_filter_size],
          unsigned int    total_size)
    {
    #pragma HLS DATAFLOW
    
        hls::stream<ap_uint<512> >    data_from_gmem;
        hls::stream<parallel_words_t> word_stream;
        hls::stream<parallel_flags_t> flag_stream;
        hls::stream<ap_uint<512> >    data_to_gmem;
        . . . . 
    }