MicroBlaze Based Designs
The system cache size should be configured to be larger than the connected L1 caches to achieve any improvements. Increasing the system cache size increases hit rates and has a positive effect on performance. The downside of increasing the system cache size is an increased number of FPGA resources being used. Higher set associativity usually increases the hit rate and the application performance.
For maximum performance the MicroBlaze™ processor should be configured to match the target frequency of the System Cache core and the rest of the system. Depending on how high the target frequency is the MicroBlaze processor configuration might need to be tweaked to achieve this goal.
There are two primary alternatives that should be considered first. With MicroBlaze v10.0 and later there is a new frequency-optimized 8-stage pipeline that has a slightly higher inter-process clocks per instruction (CPI) but is free from parameter configuration frequency dependencies. It always matches the frequency of the System Cache core. Note that the longer pipeline also results in increased resource use.
If a MicroBlaze processor with a 5-stage pipeline is used there are a number of factors that can be changed to increase the frequency. These techniques also apply to a 3-stage pipeline when the options are supported. The maximum frequency of the MicroBlaze processor is affected by its cache sizes. Smaller MicroBlaze processor cache sizes usually means that the MicroBlaze processor can meet higher frequency targets, but at the cost of reduced L1 hit rates. The optimum point for the frequency versus cache size trade-off using the System Cache core occurs when the MicroBlaze processor caches are set to either 256 or 512 bytes (dependent on other MicroBlaze configuration settings). For improved frequency, implement the MicroBlaze cache tags with distributed RAM.
Enabling the MicroBlaze branch target cache can improve performance but might reduce the maximum obtainable frequency for 5-stage pipeline (when using an 8-stage pipeline, BTC should always be enabled with maximum configuration, if resources are available). Depending on the rest of the MicroBlaze processor configuration, smaller BTC sizes (for example, 32 entries (C_BRANCH_TARGET_CACHE_SIZE = 3)) could be considered.
MicroBlaze processor advanced cache features can be used to tweak performance but they are only available in non-coherent configurations. Enabling MicroBlaze processor victim caches increases MicroBlaze processor cache hit rates, with improved performance as a result. Enabling victim caches can however reduce the MicroBlaze processor maximum frequency in some cases. Instruction stream cache should be disabled, because it reduces performance when connected to the System Cache core. MicroBlaze processor performance is often improved by using 8-word cache lines on the Instruction Cache and Data Cache.
CCIX Based Designs
In the CCIX case there is no upstream cache that can be configure in relation to System Cache, and accelerators or kernels do not usually offer many parametrization options. However, there are still a number of options that can be tweaked in System Cache, such as size and transaction limits. The CCIX CPM (Versal Premium) use case has the same configuration possibilities as CCIX XDMA, but with an asynchronous interface to the CPM, which enables System Cache to use a single clock domain for CCIX and ATS independent of the PCIe frequency.
When CCIX designs are configured with ATS, the ATS data width must be configured according to the CCIX/PCIe data width, which can be 256 or 512 bits, as well as 1024 bits in Versal Premium devices.
The parameter configuration for maximum achievable performance is unique for all application and system configuration; the optimum settings for one case is not necessarily the same for a different case.
Another property of CCIX based designs is that frequency is fixed to 250 MHz for System Cache. Since System Cache with CCIX is a quite complex and large IP core, used in large designs, it is often necessary to use advanced Vivado implementation strategies in addition to adjusting System Cache parameters to achieve the frequency target. For the most difficult designs it might be necessary to try different strategies to find the one that provides the best result for a particular design, but in many cases it is sufficient to use the strategy Performance_NetDelay_high. Additional improvements can be achieved by enabling Post-Route Phys Opt Design with the directive Explore and advanced Vivado implementation strategies.
See also UltraFast Design Methodology Timing Closure Quick Reference Guide (UG1292) and Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906) for additional information and resources for achieving timing closure.
CHI Based Design
CHI designs have the same configuration possibilities as CCIX, but with an asynchronous interface to the CPM and fixed Data width of 512 bits.
When CHI designs are configured with ATS, the ATS data width must be configured according to the PCIe data width, independent of CHI Data width, and must use the same derived PCIe Clock and frequency. When CHI designs with ATS cannot use the same clock and frequency as PCIe the ATS AXI4-Stream interfaces must use an asynchronous connection. (alternative external ATS Switch with dual clock support) In both configurations ATS interface must be configured with the PCIe data width.
CHI designs are also large and can benefit from advanced Vivado implementation strategies.