QDMA Performance Optimization

QDMA Performance Optimization - 3.4 English

Versal Adaptive SoC CPM DMA and Bridge Mode for PCI Express Product Guide (PG347)

Document ID

PG347

Release Date

2023-11-20

Version

3.4 English

AMD provides multiple example designs for you to experiment. All example designs can be downloaded from GitHub. Performance example design can be selected from the CED example design.

Following are the QDMA register settings recommended by AMD for better performance. Performance numbers vary depending on systems and OS used.

Table 1. QDMA Performance Registers
Address	Name	Fields	Field Value	Register Value
0xB08	PFCH CFG	evt_pfch_fl_th[15:0] pfch_fl_th[15:0]	256 256	0x100_0100
0xA80	PFCH_CFG_1	evt_qcnt_th[15:0] pfch_qcnt[15:0]	120 124	0x78_007C
0xA84	PFCH_CFG_2	fence rsvd[1:0] var_desc_no_drop pfch_ll_sz_th[15:0] var_desc_num_pfch[5:0] num_pfch[5:0]	1 0 0 1024 15 8	0x8040_03C8
0x1400	CRDT_COAL_CFG_1	rsvd[12:0] dis_fence_fix pld_fifo_th[7:0] crdt_timer_th[9:0]	0 0 16 16	0x4010
0x1404	CRDT_COAL_CFG_2	rsv2[7:0] crdt_fifo_th[7:0] rsv1[4:0] crdt_cnt_th[10:0]	0 120 0 96	0x78_0060
0xE24	H2C_REQ_THROT_PCIE	req_throt_en_req req_throt req_throt_en_data data_thresh	1 448 1 57344	0x8E04_E000
0xE2C	H2C_REQ_THROT_AXIMM	req_throt_en_req req_throt req_throt_en_data data_thresh	0 448 0 65536	0x8E05_0000
0x250	QDMA_GLBL_DSC_CFG	c2h_uodsc_limit h2c_uodsc_limit reserved Max_dsc_fetch wb_acc_int	0 0 0 5 1	0x00_0015
0x4C	CONFIG_BLOCK_MISC_CONTROL	10b_tag_en reserved axi_wbk axi_dsc num_tags reserved rq_metering_multiplier	1 0 0 0 512 0 31	0x81_001F

QDMA_C2H_INT_TIMER_TICK (0xB0C) set to 50. Corresponding to 100 ns (1 tick = 4 ns for 250 MHz user clock)
C2H trigger mode set to User + Timer, with counter set to 64 and timer to match round trip latency. Global register for timer should have a value of 30 for 3 μs
TX/RX API burst size = 64, ring depth = 2048. The driver should update TX/RX PIDX in batches of 64
PCIe MPS = 256 bytes, MRRS = 4K bytes, 10-bit Tag Enabled, Relaxed Ordering Enabled
QDMA_C2H_WRB_COAL_CFG (0xB50) bits[31:26] set to 63.This is maximum buffer size for WRB.
The driver updates the completion CIDX in batches of 64 to reduce number of MMIO writes before updating the C2H PIDX
The driver should update the H2C PIDX in batches of 64, and also update for the last descriptor of the scatter gather list.
C2H context:
- bypass = 0 (Internal mode)
- frcd_en = 1
- qen = 1
- wbk_en = 1
- irq_en = irq_arm = int_aggr = 0
C2H prefetch context:
- pfch = 1
- bypass = 0
- valid = 1
C2H CMPT context:
- en_stat_desc = 1
- en_int = 0 (Poll_mode)
- int_aggr = 0 (Poll mode)
- trig_mode = 5
- counter_idx = corresponding to 64
- timer_idx = corresponding to 3 μs
- valid = 1
H2C context:
- bypass = 0 (Internal mode)
- frcd_en = 0
- fetch_max = 0
- qen = 1
- wbk_en = 1
- wbi_chk = 1
- wbi_intvl_en = 1
- irq_en = 0 (Poll mode)
- irq_arm = 0 (Poll mode)
- int_aggr = 0 (Poll mode)

For optimal QDMA streaming performance, packet buffers of the descriptor ring should be aligned to at least 256 bytes.

Recommended: AMD recommends that you limit the total outstanding descriptor fetch to be less than 8 KB on the PCIe. For example, limit the outstanding credits across all queues to 512 for a 16B descriptor.

Performance in Descriptor Bypass Mode

QDMA supports both internal mode and descriptor bypass mode. Depending on the number of active queues needed for the design, you need to select the Internal mode or Descriptor bypass mode. If the number of active queues are less than 64, then Internal mode works fine. If the number of queues are more than 64, it is better to use the descriptor bypass mode.

In the descriptor bypass mode, it is your responsibility to maintain descriptors for corresponding queues and need to control their priority in sending the descriptors back to the IP.

When the design is configured in descriptor bypass mode, all the above setting apply. The following information provides recommendations to improve performance in bypass mode.

When bypass in dma<0/1>_h2c_byp_in_st_sdi ports is set, the QDMA IP generates the status write back for every packet. AMD recommends that this port be asserted once in 32 packets or 64 packets. And if there are no more descriptors left then assert dma<0/1>_h2c_byp_in_st_sdi at the last descriptor. This requirement is per queue basis, and applies to AXI4 (H2C and C2H) bypass transfers and AXI4-Stream H2C transfers.
For AXI4-Stream C2H Simple bypass mode, the dma<0/1>_dsc_crdt_in_fence port should be set to 1 for performance reasons. This recommendation assumes the user design already coalesced credits for each queue and sent them to the IP. In internal mode, set the fence bit in the QDMA_C2H_PFCH_CFG_2 (0xA84) register.

Performance Optimization Based on Available Cache/Buffer Size

Table 2. CPM5 QDMA
Name	Entry/Depth	Description
C2H Descriptor Cache Depth	2048	Total number of outstanding C2H stream descriptor fetches for cache bypass and internal. This cache depth is not relevant in Simple bypass mode, in simple bypass mode user can have longer descriptor cache.
Prefetch Cache Depth	128	C2H prefetch tags available. If a you have more then 128 active queues for packets < 512B, performance may reduce depending on the data pattern. If user see performance degradation they can implement simple bypass mode, where user can maintain all descriptor flow.
C2H Payload FIFO Depth	1024	Units of 64B. Amount of C2H data that C2H engine can buffer. This amount of buffer can sustain host read latency up to 2us (1024 *2ns). If latency is more then 2us there could be performance degradation.
MM Reorder Buffer Depth	512	Units of 64B. Amount of MM read data that can be stored to absorb host read latency.
Desc Eng Reorder Buffer Depth	512	Units of 64B. Amount of Descriptor fetch data that can be stored to absorb host read latency.
H2C-ST Reorder Buffer Depth	1024	Units of 64B. Amount of H2C-ST data that can be stored to absorb host read latency.