Memory Error Handling

Versal ACAP AIE-ML Architecture Manual (AM020)

Document ID
AM020
Release Date
2022-09-28
Revision
1.0 English

Memory Error Detection and Correction

Each AIE-ML has 64 KB of data memory and 16 KB of program memory. Due to the large amount of memory in the AIE-ML tiles, protection is provided against soft errors. The 128-bit word in the program memory is protected with two 8-bit ECC (one for each 64-bit). The 8-bit ECC can detect 2-bit errors and detect/correct a 1-bit error within the 64-bit word. The two 64-bit data and two 8-bit ECC fields are each interleaved within its own pair (distance of two) to create larger bit separation.

There are eight memory banks in each data memory module. The first two memory banks have 7-bit ECC protection for each of the four 32-bit fields. The 7-bit ECC can detect 2-bit errors and detect/correct a 1-bit error. The last six memory banks have even parity bit protection for each 32 bits in a 128-bit word. The four 32-bit fields are interleaved with a distance of four.

Error injection is supported for both program and data memory. Errors can be introduced into program memory over memory-mapped AXI4. Similarly, errors can be injected into data memory banks over AIE-ML DMA or memory-mapped AXI4.

When the memory-mapped AXI4 access reads or writes to AIE-ML data memory, two requests are sent to the memory module. On an ECC/parity event, the event might be counted twice in the AIE-ML performance counter. There is duplicate memory access but no impact on functionality. Refer to AIE-ML Tile Architecture for more information on events and performance counters.

Internal memory errors (correctable and uncorrectable) create internal events that use the normal debug, trace, and profiling mechanism to report error conditions. They can also be used to raise an interrupt to the PMC/PS.