Performance Improvement - 1.1 English

RAMA 1.1 LogiCORE IP Product Guide (PG310)

Document ID
PG310
Release Date
2021-01-21
Version
1.1 English

The following table for an example system uses an improvement multiplier rather than the efficiency figure. For example, for 64 B read only transactions, the measured bandwidth without RAMA is 4225 MB/s, while with RAMA it is 40730 MB/s, thus an almost 10 times improvement in bandwidth.

Table 1. Random Access Performance Improvement
Access Type 32 B 64 B 128 B 256 B 512 B
Read Only 10 10 5 3 2
Write Only 2 2 1.5 1 1
Read/Write 3 3 2 1 1
Note: Results shown are for a specific test scenario with four AXI masters, each randomly accessing four HBM pseudo-channels. The relative improvements quoted are the results with RAMA IP on each master compared to without RAMA IP on each master.
Note: It should be noted that for random access there can be significant advantage in using a higher ratio of memories enabled to ports connected. This is dependent upon:
  • The number of memories used.
  • How much of the HBM Subsystem switch is spanned (that is, how many memories are accessed by each port).
  • The transaction size.

For some cases for a ratio of 1 port to 2 memories a further two-fold performance increase can be seen.

Latency

Latency figures should be considered carefully. The RAMA IP adds latency to an individual transaction due to data buffering and re-ordering. However, due to bandwidth improvements, using the RAMA IP means the time between transaction request and completion is, in general, much shorter. The following table below shows mean latency figures for 2000 Read Only and Write Only transactions.

Table 2. RAMA IP Latency
Transaction Size Read Only (AXI Clock Cycles) Write Only (AXI Clock Cycles)
Without RAMA With RAMA Without RAMA With RAMA
32 225 597 44 591
64 247 532 46 134
128 240 497 55 49
256 263 512 77 78
512 304 564 119 137

To illustrate why latency figures can be misleading, consider the following: a given number of read transactions of 32 bytes in size can take 100 μs to complete without RAMA. This means that the last transaction would be delayed by almost 100 μs after it could have been issued by the master. Because bandwidth is 10 times better for 32 bytes using RAMA, the same number of transactions would be completed within 10 μs, plus latency added by buffering and reordering in the RAMA IP (in this case typically 1.3 μs).