Performance and Resource Utilization - 4.0 English

QDMA Subsystem for PCI Express Product Guide (PG302)

Document ID
PG302
Release Date
2022-05-20
Version
4.0 English

Performance

QDMA performance and detailed analysis is available in AR 71453.

Xilinx provides two example designs for you to experiment with. Standard example design is for functional test only. To generate a example design for performance analysis, use the following Tcl command to generate a performance example design:

set_property CONFIG.performance_exdes{true} [get_ips qdma_0]

Below are the QDMA register settings recommended by Xilinx for better performance. Performance numbers will vary based on systems and which OS is being used.

  • QDMA_C2H_INT_TIMER_TICK (0xB0C) set to 25. Corresponding to 100 ns (1 tick = 4 ns for 250 MHz user clock)
  • C2H trigger mode set to Counter + Timer, with counter set to 64 and timer to match round trip latency. Global register for timer should have a value of 30 for 3 μs.
  • QDMA_GLBL_DSC_CFG (0x250), max_desc_fetch = 6, wb_int = 5
  • QDMA_H2C_REQ_THROT (0xE24), req_throt_en_data = 1, data_thresh = 0x4000
  • QDMA_C2H_PFCH_CFG (0xB08/0xA80/0xA84)
    • evt_qcnt_th = (QDMA_C2H_PFCH_CACHE_DEPTH/2) - 2
    • pfch_qcnt = QDMA_C2H_PFCH_CACHE_DEPTH/2
    • num_pfch = 8. A minimum of 8 is recommended. In environments with low number of active queues, programing higher values can help to boost the performance.
    • pfch_fl_th = 256
  • QDMA_C2H_WRB_COAL_CFG (0xB50),
    • max_buf_sz = QDMA_C2H_CMPT_COAL_BUF_DEPTH (0xBE4)
    • tick_val = 25
    • tick_cnt = 5
  • TX/RX API burst size = 64, ring depth = 2048. The driver should update TX/RX PIDX in batches of 64.
  • PCIe MPS = 256 bytes, MRRS >= 512 bytes, Extended Tag Enabled, Relaxed Ordering Enabled
  • The driver will update the completion CIDX in batches of 64 to reduce number of MMIO writes before updating the C2H PIDX
  • The driver should update the H2C PIDX in batches of 64, and also update for the last descriptor of the scatter gather list.
  • C2H context:
    • bypass = 0 (Internal mode)
    • frcd_en = 1
    • qen = 1
    • wbk_en = 1
    • irq_en = irq_arm = int_aggr = 0
  • C2H prefetch context:
    • pfch = 1
    • bypass = 0
    • valid = 1
  • C2H CMPT context:
    • en_stat_desc = 1
    • en_int = 0 (Poll_mode)
    • int_aggr = 0 (Poll mode)
    • trig_mode = 5
    • counter_idx = corresponding to 64
    • timer_idx = corresponding to 3 μs
    • valid = 1
  • H2C context:
    • bypass = 0 (Internal mode)
    • frcd_en = 0
    • fetch_max = 0
    • qen = 1
    • wbk_en = 1
    • wbi_chk = 1
    • wbi_intvl_en = 1
    • irq_en = 0 (Poll mode)
    • irq_arm = 0 (Poll mode)
    • int_aggr = 0 (Poll mode)

For optimal QDMA streaming performance, packet buffers of the descriptor ring should be aligned to at least 256 bytes.

Performance in Descriptor Bypass Mode

When the design is configured in descriptor bypass mode, all the above setting apply. The following information provides recommendations to improve performance in bypass mode.

  1. When bypass in h2c_byp_in_st_sdi ports is set, the QDMA IP generates the status write back for every packet. Xilinx recommends that this port be asserted once in 32 packets, or 64 packets. And if there are no more descriptors left then assert h2c_byp_in_st_sdiat the last descriptor. This requirement is per queue basis, and applies to AXI4-MM (H2C and C2H) bypass transfers and AXI4-Stream H2C transfers.
  2. For AXI-Stream C2H Simple bypass mode, the dsc_crdt_in_fence port should be set to 1 for performance reasons. This recommendation assumes the user design already coalesced credits for each queue and sent them to the IP. In internal mode, set the fence bit in the QDMA_C2H_PFCH_CFG_2 (0xA84) register.

Resources Utilization

For QDMA Resource Utilization, see Resource Use web page.