Performance - 5.0 English

System Cache LogiCORE IP Product Guide (PG118)

Document ID
PG118
Release Date
2021-11-05
Version
5.0 English

The perceived performance is dependent on many factors such as frequency, latency and throughput. Which factor has the dominating effect is application-specific. There is also a correlation between the performance factors; for example, achieving high frequency can add latency and also wide datapaths for throughput can adversely affect frequency.

Read latency is defined as the clock cycle from the read address is accepted by the System Cache core to the cycle when first read data is available.

Write latency is defined as the clock cycle from the write address is accepted by the System Cache core to the cycle when the BRESP is valid. These calculations assume that the start of the write data is aligned to the transaction address.

Snoop latency is defined as the time from the clock cycle a snoop request is accepted by the System Cache core to the cycle when CRRESP or CDDATA is valid, whichever is last. Not all snoops result in a CDDATA transaction.

Maximum Frequencies

For details about performance, visit Performance and Resource Utilization.

CCIX Cache Latency

CCIX latency calculations are in principle defined in the same way as above in the introduction, but the time in flight on the PCIe bus is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.

For the different types of transactions, this means:

  • Read: From AXI ARVALID to start of TLP request + start of TLP response to AXI RRESP
  • Write: From AXI AWVALID to start of TLP request + start of TLP response to AXI BRESP
  • Snoop: From start of TLP request to start of TLP response
  • Scrub (automatic): From timer initiation of the scrub to completion
  • Scrub (manual): From AXI control write initiation of the scrub to completion

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.

Table 1. System Cache CCIX Latency
Type CCIX Latency
Read Hit 16
Read Miss 45 + round-trip delay on PCIe for request
Read Miss Dirty

max of:

45 + round-trip delay on PCIe for read request

39 + round-trip delay on PCIe for write request

Write Hit 16
Write Miss 44 + round-trip delay on PCIe for request
Write Miss Dirty

max of:

45 + round-trip delay on PCIe for read request

39 + round-trip delay on PCIe for write request

Snoop (missing broadcast) 23
Snoop 26
Snoop with data 33
Scrub (automatic) 11
Scrub (manual) 17

CHI Cache Latency

CHI latency calculations are defined in a similar way as CCIX, but the time in flight in the CHI domain is not included. Also, no additional delays due to ATS/ATC virtual address translation is included.

For the different types of transactions, this means:

  • Read: From AXI ARVALID to request FLIT + response FLIT to AXI RRESP
  • Write: From AXI AWVALID to request FLIT + response FLIT to AXI BRESP
  • Snoop: From start of Request FLIT to start of TLP response
  • Scrub (automatic): From timer initiation of the scrub to completion
  • Scrub (manual): From AXI control write initiation of the scrub to completion

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core.

Table 2. System Cache CHI Latency
Type CHI Latency
Read Hit 16
Read Miss 34 + round-trip delay on CHI for request
Read Miss Dirty

max of:

32 + round-trip delay on CHI for read request

31 + round-trip delay on CHI for write request

Write Hit 16
Write Miss 32 + round-trip delay on CHI for request
Write Miss Dirty

max of:

32 + round-trip delay on CHI for read request

31 + round-trip delay on CHI for write request

Snoop (missing broadcast) 23
Snoop 26
Snoop with data 33
Scrub (automatic) 11
Scrub (manual) 17

ATS/ATC Latency

The inclusion of the Address Translation Service, ATS and the ATC TLB in the AXI port interfaces will add latency to the above AXI4/ACE Cache Latency, in most cases the locality will have minimal hit latency but miss latency are expected in restarted systems with no accumulated translation and in case of locality context change.

Read latency is defined from the read address is accepted by the System Cache core to the cycle when first read data is available via the Address Translation service.

Write latency is defined from the write address is accepted by the System Cache core to start of the write data is aligned to the transaction address via the Address Translation service.

For the transaction best case latency it is assumed that previous accesses have already used the address range, hence both read and write will result in ATC TLB hits.

When a miss occurs in the ATC TLB, it is assumed that the ATC Table has a copy of a valid Translation, latency is added with best-case and worst-case latency due to ATS search.

Note: The worst-case latency is related to ATS Tables size and is depth dependent – here the default size of 256 entries is used.

The average expected for ATC Search latency depends upon temporal locality of addresses in use in the ATC Table, where the entries of the n last mapped translations are cached, and ATC search hits will be within these n translations.

In the case of ATC table miss, the latency will be extended with PCIe Root Complex, Host TA, translation latency, which are system level defined latency – outside System Cache definition scope.

The host TA best-case translation time is the round trip transaction latency from request to response plus the host TA lookup time. The time can be extended with one or more page request round trips plus host page management latencies, and finally another retry translation round trip host TA lookup.

ATC table lookup latency is two clock cycles best case, 256+2 clock cycles worst case, and n/2+2 clock cycles on average (assuming locality for the n last accesses and hit latency within n entries).

Table 3. System Cache Core Latencies with Address Translation
Type AXI4 Port Latency with Address Translation Latency (See CHI/CCIX Master port Read/Write Latency)
Read Hit/Miss, ATC TLB Hit 2 + Master port Read latency (Hit/Miss)
Read Hit/Miss, ATC TLB Miss, ATC Table Hit 3 + ATC table lookup + Master port Read latency (Hit/Miss)
Read Hit/Miss, ATC TLB Miss and ATC Table Miss 3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Read latency (Hit/Miss)
Write Hit/Miss, ATC TLB Hit 2 + Master port Write burst latency (Hit/Miss)
Write Hit/Miss, ATC TLB Miss, ATC Table Hit 3 + ATC table lookup + Master port Write burst latency (Hit/Miss)
Write Hit/Miss, ATC TLB Miss, ATS Table Miss 3 + ATC table lookup Worst + latency added by PCIe ATS lookup + Master port Write burst latency (Hit/Miss)

AXI4/ACE Cache Latency

Here latency is used as described in the introduction.

The latency depends on many factors such as traffic from other ports and conflict with earlier transactions. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. For transactions using a generic AXI4 port an additional two clock cycle latency is added.

Table 4. System Cache Core Latencies for Optimized Port
Type Optimized Port Latency
Read Hit 6
Read Miss 7 + latency added by memory subsystem
Read Miss Dirty

Maximum of:

7 + latency added by memory subsystem

7 + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width)

Write Hit 3 + burst length
Write Miss

Non-bufferable transaction: 7 + latency added by memory subsystem for writing data

Bufferable transaction: Same as Write Hit

Enabling optimized port cache coherency affects the latency and also introduces new types of transaction latencies. The numbers in the following table assume a completely idle System Cache core and no write data delay for transactions on one of the optimized ports. Transactions from a generic port still have two cycles of extra latency.

Table 5. System Cache Core Latencies for Cache Coherent Optimized Port
Type Coherent Optimized Port Latency
DVM Message 9 + latency added by snooped masters
DVM Sync 12 + latency added by snooped masters
Read Hit 9 + latency added by snooped masters
Read Miss 10 + latency added by snooped masters + latency added by memory subsystem
Read Miss Dirty

Maximum of:

10 + latency added by snooped masters + latency added by memory subsystem

10 + latency added by snooped masters + latency added for evicting dirty data (cache line length * 32 / M_AXI Data Width)

Write Hit

Maximum of:

3 + burst length

6 + latency added by snooped masters

Write Miss

Non-bufferable transaction: 10 + latency added by snooped masters + latency added by memory subsystem for writing data

Bufferable transaction: same as Write Hit

When master port cache coherency is enabled the System Cache core provides CRRESP and potential data as quickly as possible, but the response time varies according to the current state and transactions in flight, both internally and externally, as long as they have an effect on the System Cache state. See the following table for latency values.
Table 6. Core Latency Values for Master Port Cache Coherency
Type Master Port Snoop Latency
Snoop Miss

3 + latency of any preceding snoop blocking progress

4 + latency of any preceding snoop blocking progress (if hazard with pipelined access)

5 + latency of any preceding snoop blocking progress + latency to compete active write with hazard

Snoop Hit

4 + latency to acquire data access + latency of any preceding snoop blocking progress

5 + latency of any preceding snoop blocking progress (if hazard with pipelined access)

5 + latency of any preceding snoop blocking progress + latency to complete active write with hazard

The numbers for an actual MicroBlaze application vary depending on access patterns, hit/miss ratio and other factors. Example values from a system (see Typical System with a Single Processor above) running the iperf network testing tool with the LWIP TCP/IP stack in raw mode are shown in the following four tables, where the first contains the hit rate for transactions from all ports, and the remaining show per port hit rate and latencies for the three active ports.
Table 7. Application Total Hit Rate
Type Hit Rate
Read 99.82%
Write 92.93%
Table 8. System Cache Hit Rate and Latencies for MicroBlaze D-Side Port
Type Hit Rate Min Max Average Standard Deviation
Read 99.68% 6 290 8 3
Write 96.63% 4 31 4 1
Table 9. System Cache Hit Rate and Latencies for MicroBlaze I-Side Port
Type Hit Rate Min Max Average Standard Deviation
Read 9.96% 5 568 6 2
Write N/A N/A N/A N/A N/A
Table 10. System Cache Hit Rate and Latencies for Generic Port
Type Hit Rate Min Max Average Standard Deviation
Read 76.68% 7 388 18 13
Write 9.78% 6 112 24 5

Throughput

The System Cache core is fully pipelined and can have a theoretical maximum transaction rate of one read or write hit data concurrent with one read and one write miss data per clock cycle when there are no conflicts with earlier transactions.

This theoretical limit is subject to memory subsystem bandwidth, intra-transaction conflicts and cache hit detection overhead, which reduce the achieved throughput to less than three data beats per clock cycle.