Data Frame format (on DDR) - 2023.2 English

Vitis Libraries

Release Date
2023-12-20
Version
2023.2 English

An Apache Arrow format data can be represented in the illustrated figure leftside. The whole data is seperated into multiple record batches, each batch consists of multiple columnes with the same length.

data frame layout

It is worth mentioning that the length of each record batch is a statistic info, unknown while reading/writing each record batch data. Besides, the data width of different data types are different, especially for string, since the length of each string data is variable.

Thus, the apache arrow columnar data format can not be implemented directly on hardware. A straight-forward implementation of arrow data would be, for each field id, one fixed size ddr buffer is pre-defined. However, since the number and data type of each field is unknow, DDR space is wasted heavly. To fully utilize the DDR memory on FPGA, the “data-frame” format is defined and employed, which can be seen in the right side figure above.

The DDR is split into multiple mem blocks. Each block is 4MB size with 64-bit width. The mem block address and linking info is recored on the meta section of DDR header. In other words, for each column / field, the data is stored in 4M -> 4M -> 4M linkable mem blocks. The length, size, count etc info are also saved in the DDR header.

Three types of data are columnar stored differently comparing to the Apache Arrow format, namely, Null, Boolean and String. For Null and Boolean, due to only 1-bit is required for each data, bitmap[4096][16] and boolbuff[4096][16] (each data 64-bit) is used to save these data, respectively. Figure below illustreates the bitmap layout, each 64-bit data indicates 64 x input data, the maximum supported number of input data number of 64 x 4096. And supported maximum field num is 16. Same data storage buffer is employed for Boolbuff.

data layout1

As for the String data, an four lines of input example is provided. The input data are given at the left side, the compact arrow format data storage is in the middle. It is clear that no bubbles exist in the data buffer. And in data-frame, the string data layout is shown on the right side. Each input string data is consist of one or multi-lines of 64 bit data, each char is 8 bit. If the string is not 64-bit aligned, bubbles are inserted to the ending 64-bit string. The reason that we introduced bubbles to data-frame storage is to ensure each string data is started in a new DDR address. This greatly guarteened the string data acess is faster without timing issue. Simliar to arror format, the offset buffer always points to the starting address of each string input.

string layout

For the normal 4MB mem blocks, the f_buff saves the starting and ending Node address of each mem block. The tail mem block size is also counted. The detailed info of each node is provided in the LinkTable buffer.

Beside the data, input data length, size, etc info are also counted and added to the according buffer when the input stream ends.

data layout2