Micro Architecture Implementation - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2023-11-13
Version
2023.2 English

Now that you have the top-level function, runOnfpga updated with the proper datawidths and interface types, you need to identify the loops to optimize to improve latency and throughput.

  1. The runOnfpga function reads the Bloom filter coefficients from the DDR using maxiport1 and saves the coefficients into the bloom_filter_local local array. This only needs to be read one time.

      if(load_filter==true)
      {
        read_bloom_filter: for(int index=0; index<bloom_filter_size; index++) {
        #pragma HLS PIPELINE II=1
        unsigned int tmp = bloom_filter[index];
        for (int j=0; j<PARALLELISATION; j++) {
        bloom_filter_local[j][index] = tmp;
      }
    
    • #pragma HLS PIPELINE II=1 is added to initiate the burst DDR accesses and read the Bloom filter coefficients every cycle.

    • The expected latency is about 16,000 cycles because the bloom_filter_size is fixed to 16,000. You should confirm this after you run HLS Synthesis.

  2. Within the compute_hash_flags function, the for loop is rearchitected as nested for the loop to compute 4 words in parallel.

    void compute_hash_flags (
        hls::stream<parallel_flags_t>& flag_stream,
        hls::stream<parallel_words_t>& word_stream,
        unsigned int                   bloom_filter_local[PARALLELISATION][bloom_filter_size],
        unsigned int                   total_size)
        {
          compute_flags: for(int i=0; i<total_size/PARALLELISATION; i++)
          {
            #pragma HLS LOOP_TRIPCOUNT min=1 max=10000
            parallel_words_t parallel_entries = word_stream.read();
            parallel_flags_t inh_flags = 0;
    
            for (unsigned int j=0; j<PARALLELISATION; j++)
            {
              #pragma HLS UNROLL
              unsigned int curr_entry = parallel_entries(31+j*32, j*32);
              unsigned int frequency = curr_entry & 0x00ff;
              unsigned int word_id = curr_entry >> 8;
              unsigned hash_pu = MurmurHash2(word_id, 3, 1);
              unsigned hash_lu = MurmurHash2(word_id, 3, 5);
              bool doc_end= (word_id==docTag);
              unsigned hash1 = hash_pu&hash_bloom;
              bool inh1 = (!doc_end) && (bloom_filter_local[j][ hash1 >> 5 ] & ( 1 << (hash1 & 0x1f)));
              unsigned hash2=(hash_pu+hash_lu)&hash_bloom;
              bool inh2 = (!doc_end) && (bloom_filter_local[j][ hash2 >> 5 ] & ( 1 << (hash2 & 0x1f)));
    
              inh_flags(7+j*8, j*8) = (inh1 && inh2) ? 1 : 0;
            }
    
            flag_stream.write(inh_flags);
        }
    }
    
    • Added #pragma HLS UNROLL

      • Unrolls internal loop to make four copies of the Hash functionality.

    • Vitis HLS will try to pipeline the outer loop with II=1. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel.

    • Added #pragma HLS LOOP_TRIPCOUNT min=1 max=3500000`

      • Reports the latency of the function after HLS Synthesis.