3-in-1 Kernel - 2023.2 English

Vitis Libraries

Release Date
2023-12-20
Version
2023.2 English

The 3-in-1 GQE kernel is a compound of multiple previous released GQE kernels, containing a large number of post-bitstream programmable primitives. It can execute not only hash-join/hash-bloomfilter/hash-partition individually but also hash-based bloom filter build + partition or bloom filter probe + partition as a combination to minimize the intermdediate data transfer. To instantiate the 3-in-1 in the resource-limited U50 Alveo for obtaining the best benefit-cost ratio, we refactored every processing unit (PU) as well as the output data paths and retired the bypass design in the current kernel, now it enpowers the Q5-simplified query can be done with a 4x performance improvement without any cost increasing on device comparing to the previous separated GQE kernels.

3-in-1 GQE Kernel

The internal structure of 3-in-1 GQE is illustrated in the figure above. Besides those challenging parts we have accomplished in previous separated GQE kernels, it is important to emphasize the necessity of the hardware structure we implemented in 3-in-1 GQE for reusing the 2 AXI-Master ports and 3 internal huge URAM buffers in each PU, as we cannot tolerate even a little of resource wastes on the device to make the gqeJoin/gqeBloomfilter/gqePart all on a single U50 comes ture.

Reusable hardware structure

The reason why we need this reusable hardware structure is because of the limitation on the resources of U50, and the different precedence of URAM and HBM in different flows. It can be described in the figure above on the left side of the vertical red line.

JOIN build/probe flow: needs to save the total number of an unique key in the URAM first, and then save the key/payload pairs in the HBM.

Bloom filter build/probe flow: needs HBM to access its corresponding hash-table.

PART: needs URAM to buffer the post-partitioned key/payload pairs to get a reasonable throughput when flushing the partitioned bucket out.

While performing the bloom filter probe operation + PART flow, we don’t know when the key/payload pairs in a specific bucket is enough to be flushed out. To avoid implementing a duplicated URAM buffer after the partition module, we have to arrange the bloom filter operation before the partition so that we can fully utilize the original URAM in partition module to collect the partitioned rows. Thus we need this reusable hardware structure as illustrated on the right of the vertical red line in the figure above, where the data goes along with the different paths (marked with red) under different configurations.

The hardware resource utilization of 3-in-1 kernel is shown in the table below (post-placement). Total part contains not only the listed sub-modules resource utilization but the interconnect streams to serve for the dataflow design in 3-in-1 GQE kernel.

Module LUT LUT as memory Register BRAM36 URAM DSP
AXI adapters 34306 9256 75782 159 0 0
Scan 9581 900 9581 1 0 0
Filter 20480 3936 11850 0 0 0
crossbar 4-to-8 9044 15110 4696 0 0 0
PU x 8 285419 47728 257620 300 264 72
adapt sizer x 8 9597 0 8476 0 0 0
collect 8-to-1 7147 0 8543 0 0 0
Write 19205 1612 22488 1 0 3
Total 424564 81037 434954 569 264 75

Caution

In the current release, all columns are expected to have the same number of rows with the same type.