Lateral AXI Switch Access Throughput Loss - 1.0 English

AXI High Bandwidth Memory Controller LogiCORE IP Product Guide (PG276)

Document ID
PG276
Release Date
2022-11-02
Version
1.0 English

Two lateral connections are provided between sets of 4 masters x 4 slaves within the switch, with one lateral connected to M0 and M1, and the other connected to M2 and M3. The shared connections limit the maximum throughput laterally to 50% of the full bandwidth, but enables global addressing from any AXI port to any portion of the HBM. For Write cycles there is a single dead cycle when switching between masters on the lateral channel. The throughput loss due to this inserted cycle depends on the block size of the writes and the extent to which throughput is switch limited, rather than HBM MC limited.

Figure 1. HBM AXI Switch Connections

For East/West transactions within an AXI Switch instance there is no lateral performance loss. For East/West transactions which leave an AXI Switch instance there will be a lateral throughput performance loss.

A transaction that is stalled by the AXI Master while crossing switches will continue to reserve the lateral path, preventing its use from other channels. For example, an AXI port, reading from a memory location that requires crossing switches, is unable to receive a full data burst and de-asserts RREADY mid-transaction. This would cause the lateral path to remain claimed by this transaction and unable to process transactions from other AXI channels that require the same path.

A similar behavior may occur on a write transaction. When a write command is issued and accepted by the switch, the corresponding data path is reserved and will continue to be held until the data has been fully transmitted. The AXI protocol allows commands to be sent first and data at a later time. However, doing that can result in significant impact to overall switch performance if the data path is a lateral connection needed by another command. It is recommended that write commands are only sent when the corresponding write data is available.

Table 1. MC Performance Impact Per Switch Instance
MC Transaction Performance Impact
M0/1 going east to S2/S3 No
M0/1 going west to S4 or east to S7 Yes
M2/3 going west to S0/S1 No
M2/3 going east to S6 or west to S5 Yes
Figure 2. Switch Lateral Connections

A Write sequence with near 100% MC throughput results in the largest drop in throughput due to the extra switching cycle. As a result, smaller blocks have a larger percentage loss than larger blocks. The following figure demonstrates the lateral throughput loss with 32-byte and 256-byte transactions. For the 32-byte transactions two masters are each issuing three 32-byte bursts which must laterally traverse the AXI switch. For the 256-byte transactions each master is issuing two 256-byte bursts which must laterally traverse the AXI switch. In the 32-byte case there is one clock inserted between each data burst for each master which results in a total of 12 clock cycles to move 6 beats of data. In the 256-byte case there is one idle clock inserted between each 256-byte burst which results in a total of 36 clock cycles to move 32 beats of data. Because M2 and M3 only have access to one lateral slave port, the total bandwidth for these ports is split between the two masters. This results in a total efficiency for the 32-byte transactions of about 25% due to the additional clock cycles for switching between masters as well as both masters sharing a single slave port. For the 256-byte transactions these same behaviors result in approximately 44.4% efficiency.

Figure 3. Switch Throughput Pattern
Table 2. Measured Lateral Throughput Efficiency Based on 100% Page Hit Rate Write Streams for 4H Stack
Block Size Switch Limited BW Pct
32 B 24.9%
64 B 33.3%
128 B 39.9%
256 B 44.4%
512 B 47%