Memory Error Handling

Versal Adaptive SoC AI Engine Architecture Manual (AM009)

Document ID
AM009
Release Date
2023-08-18
Revision
1.3 English

Memory Error Detection and Correction

Each AI Engine has 32 KB of data memory and 16 KB of program memory. For devices with many AI Engine tiles, protection against soft errors are both required and provided. The 128-bit word in the program memory is protected with two 8-bit ECC (one for each 64-bit). The 8-bit ECC can detect 2-bit errors and detect/correct a 1-bit error within the 64-bit word. The two 64-bit data and two 8-bit ECC fields are each interleaved within its own pair (distance of two) to create larger bit separation.

There are eight memory banks in each data memory module. The first two memory banks have 7-bit ECC protection for each of the four 32-bit fields. The 7-bit ECC can detect 2-bit errors and detect/correct a 1-bit error. The last six memory banks have even parity bit protection for each 32 bits in a 128-bit word. The four 32-bit fields are interleaved with a distance of four.

Error injection is supported for both program and data memory. Errors can be introduced into program memory over memory-mapped AXI4. Similarly, errors can be injected into data memory banks over AI Engine DMA or memory-mapped AXI4.

When the memory-mapped AXI4 access reads or writes to AI Engine data memory, two requests are sent to the memory module. On an ECC/parity event, the event might be counted twice in the AI Engine performance counter. There is duplicate memory access but no impact on functionality. Refer to AI Engine Tile Architecture for more information on events and performance counters.

Internal memory errors (correctable and uncorrectable) create internal events that use the normal debug, trace, and profiling mechanism to report error conditions. They can also be used to raise an interrupt to the PMC/PS.