Improving Performance in the PS - 2021.2 English

Versal ACAP System Integration and Validation Methodology Guide (UG1388)

Document ID
UG1388
Release Date
2021-11-19
Version
2021.2 English

In the Versal ACAP processing system (PS), you can control quality of service (QoS) using the following interconnect traffic types: low-latency datapaths and high-throughput datapaths. For information, see this link in the Versal ACAP Technical Reference Manual (AM011).

When thinking of the processing system and memory subsystems performance, the first thoughts that come to mind are about the processors performances, their clock frequencies, memory hierarchy, caches, and more generally internal and external memory capabilities. While these considerations are important, understanding the processing system architecture is equally so.

The external shared DDR memory subsystem and the on-chip OCM are the main two memory subsystems present in Versal devices. The masters accessing the memory subsystems can be in PL or in PS. The PL master can greatly impact PS masters performances when traversing the PS to access OCM or external DDR memory.

The PL masters traffic routes are:

  • PL master (soft IP) attached to NoC via a NoC NMU
  • PL master attached to PS AXI interfaces
  • PL master attached to PS ACP

The AI Engine and CPM can also require to route traffic via the PS interconnect and close attention must be paid to the multiple ways traffic can be routed.

The internal PS master (APU, RPU, DMA, etc.) generates traffic and leverages the PS interconnect that connects to the NoC.

  • Direct PL to PS interfaces (S_AXI_FPD, S_ACE_LITE_FPD, S_AXI_LPD)
    • S_AXI_FPD (in FPD) is virtualized and non-coherent
    • S_ACE_LITE_FPD (in FPD) is virtualized and coherent
    • S_AXI_LPD (in LPD) can be configured as physical, or as virtualized and coherent
  • NoC to PS interfaces (NoC_FPD_AXI0, NoC_FPD_AXI_1, NoC_FPD_CCI_0, NoC_FPD_CCI_1, all in FPD)
    • NoC_FPD_AXI0 and NoC_FPD_AXI1 are virtualized and non-coherent
    • NoC_FPD_CCI_0 and NoC_FPD_CCI_1 are virtualized and coherent

Understanding and leveraging the multiple ways a master can reach the external DDR memory or OCM is critical to optimized the routing and avoiding PS interconnect congestion.

For example, selecting to route the traffic of multiple masters through the CCI can be detrimental to performances, as a maximum of four CCI500 AXI4 master ports connect to four NoC NMUs. In the following figure these are M2, M3, M4, and M5.

Figure 1. Master Ports Connected to NoC NMUs

The SMMU and CCI can also impact performance. Each have share internal resources that may contribute to adding latency and therefore reducing throughput. If you are optimizing for high performance and if virtualization/isolation and coherency are not required, using the CCI500 and SMMU are not recommended.

Traffic regulation mechanisms are available at the source, which is the NMU in the NoC switches, and at the destination slaves (MC). You can use these mechanisms to apply the desired QoS scheme.

The PS interconnect does not support virtual channels, so physical separation must be used.

In the ingress direction (external masters to PS slaves), each NSU supports a single traffic class, so different classes must enter the PS by different physical ports. Similarly, in the egress direction, the internal PS interconnect network will carry different traffic classes over different physical channels into NMUs that connect to the horizontal NoC. For example, two of the four CCI to NoC channels can carry LL (M2 and M3 in the above figure), while the other two may carry BE (M4 and M5 above).