Performance and Resource Utilization

Performance

QDMA performance and detailed analysis is available in AR 71453.

AMD provides two example designs for you to experiment with. Standard example design is for functional test only. To generate an example design for performance analysis, use the following Tcl command to generate a performance example design:

set_property CONFIG.performance_exdes {true} [get_ips qdma_0]

Following are the QDMA register settings recommended by AMD for better performance. Performance numbers can vary based on systems and OS used.

Table 1. QDMA Performance Registers
Address	Name	Fields	Field Value	Register Value
0xB08	PFCH CFG	evt_pfch_fl_th[15:0] pfch_fl_th[15:0]	256 256	0x100_0100
0xA80	PFCH_CFG_1	evt_qcnt_th[15:0] pfch_qcnt[15:0]	60 60	0x3c_003c
0xA84	PFCH_CFG_2	fence rsvd[1:0] var_desc_no_drop pfch_ll_sz_th[15:0] var_desc_num_pfch[5:0] num_pfch[5:0]	1 0 0 1024 15 8	0x8040_03C8
0x147C	PFCH_CFG_3	rsvd[4:0] var_desc_fl_free_cnt_th[8:0] var_desc_lg_pkt_cam_cn_th[6:0]	0 256 0	0x8000
0x1484	PFCH_CFG_4	glb_evt_timer_tick[14:0] disable_glb_evt_timer evt_timer_tick[14:0] disable_evt_timer	64 0 400 0	0x80_0320
0x1400	CRDT_COAL_CFG_1	rsvd[12:0] dis_fence_fix pld_fifo_th[7:0] crdt_timer_th[9:0]	NA 0 16 16	0x4010
0x1404	CRDT_COAL_CFG_2	rsv2[7:0] crdt_fifo_th[7:0] rsv1[4:0] crdt_cnt_th[10:0]	NA 56 NA 96	0x38_0060
0x15C	GLBL_RRQ_PCIE_THROT	req_throt_en req_throt dat_throt_en dat_throt	0 192 1 20480	0x604_5000
0x160	GLBL_RRQ_AXIMM_THROT	req_throt_en req_throt dat_throt_en dat_throt	0 0 0 0	0
0x158	GLBL_RRQ_BRG_THROT	req_throt_en req_throt dat_throt_en dat_throt	1 192 1 20480	0x8604_5000
0xE24	H2C_REQ_THROT_PCIE	req_throt_en_req req_throt req_throt_en_data data_thresh	1 192 1 24576	0x8604_6000
0xE2C	H2C_REQ_THROT_AXIMM	req_throt_en_req req_throt req_throt_en_data data_thresh	1 64 1 16384	0x8204_4000
0x12EC	H2C_MM_DATA_THROT	data_throt_en data_throt	1 20480	0x1_5000
0x250	QDMA_GLBL_DSC_CFG	c2h_uodsc_limit (Soft IP) h2c_uodsc_limit (Soft IP) uodsc_limit (KS-B) Max_dsc_fetch wb_acc_int	5 8 NA 2 5	0x50_2015
0x4C	CONFIG_BLOCK_MISC_CONTROL	10b_tag_en num_tags rq_metering_multiplier	0 256 9	0x1_0009

QDMA_C2H_INT_TIMER_TICK (0xB0C) set to 25. Corresponding to 100 ns (1 tick = 4 ns for 250 MHz user clock)
C2H trigger mode set to user timer, with counter set to 64 and timer to match round trip latency. Global register for timer should have a value of 30 for 3 μs.
TX/RX API burst size = 64, ring depth = 2048. The driver should update TX/RX PIDX in batches of 64.
PCIe MPS = 256 bytes, MRRS >= 512 bytes, Extended Tag Enabled, Relaxed Ordering Enabled
The driver will update the completion CIDX in batches of 64 to reduce number of MMIO writes before updating the C2H PIDX
The driver should update the H2C PIDX in batches of 64, and also update for the last descriptor of the scatter gather list.
C2H context:
- bypass = 0 (Internal mode)
- frcd_en = 1
- qen = 1
- wbk_en = 1
- irq_en = irq_arm = int_aggr = 0
C2H prefetch context:
- pfch = 1
- bypass = 0
- valid = 1
C2H CMPT context:
- en_stat_desc = 1
- en_int = 0 (Poll_mode)
- int_aggr = 0 (Poll mode)
- trig_mode = 5
- counter_idx = corresponding to 64
- timer_idx = corresponding to 3 μs
- valid = 1
H2C context:
- bypass = 0 (Internal mode)
- frcd_en = 0
- fetch_max = 0
- qen = 1
- wbk_en = 1
- wbi_chk = 1
- wbi_intvl_en = 1
- irq_en = 0 (Poll mode)
- irq_arm = 0 (Poll mode)
- int_aggr = 0 (Poll mode)

For optimal QDMA streaming performance, packet buffers of the descriptor ring should be aligned to at least 256 bytes.

Recommended: AMD recommends that you limit the total outstanding descriptor fetch to be less than 8 KB on the PCIe. For example, limit the outstanding credits across all queues to 512 for a 16B descriptor.

Performance in Descriptor Bypass Mode

When the design is configured in descriptor bypass mode, all the above setting apply. The following information provides recommendations to improve performance in bypass mode.

When bypass in h2c_byp_in_st_sdi ports is set, the QDMA IP generates the status write back for every packet. AMD recommends that this port be asserted once in 32 packets or 64 packets. And if there are no more descriptors left then assert h2c_byp_in_st_sdi at the last descriptor. This requirement is per queue basis, and applies to AXI4 (H2C and C2H) bypass transfers and AXI4-Stream H2C transfers.
For AXI4-Stream C2H Simple bypass mode, the dsc_crdt_in_fence port should be set to 1 for performance reasons. This recommendation assumes the user design already coalesced credits for each queue and sent them to the IP. In internal mode, set the fence bit in the QDMA_C2H_PFCH_CFG_2 (0xA84) register.

Performance Optimization Based on Available Cache/Buffer Size

Prefetch cache size: QDMA prefetch cache size is programmable (16 or 64). When QDMA is configured in an internal mode, the QDMA IP has space for 64 Qs that IP can prefetch the descriptors. If you want to support more than 64 Qs (for example 65 Q's), the engine evicts the least used Q and adds a new Q. This can potentially have a negative effect on the performance for smaller packet sizes and more than 64 Qs. To support more Qs, you can configure the QDMA IP in simple bypass mode and have descriptor cache outside the IP. This way you can manage the descriptor flow.
CPMT data FIFO: QDMA IP has very shallow two deep completion fifo. It will be efficient if you can implement a deeper fifo (512) in your space and feed the output of the fifo to QDMA completion input.
C2h streaming descriptor in FIFO: QDMA IP has a descriptor in fifo which is common for descriptor bypass input and for internal mode. This is 1K deep descriptor fifo. When the fifo fills up, there is back pressure from the IP to your bypass input.

Resources Utilization

For QDMA Resource Utilization, see Resource Use web page.

Performance and Resource Utilization - 5.0 English

QDMA Subsystem for PCI Express Product Guide (PG302)

Performance

Performance in Descriptor Bypass Mode

Performance Optimization Based on Available Cache/Buffer Size

Resources Utilization