Feature Summary - 3.1 English

UltraScale Architecture Soft Error Mitigation Controller LogiCORE IP Product Guide (PG187)

Document ID
PG187
Release Date
2023-11-08
Version
3.1 English

The SEM controller can be generated in six different modes dependent on the design requirements:

  • Mitigation and Testing
  • Mitigation only
  • Detect and Testing
  • Detect only
  • Emulation
  • Monitoring only

The mitigation modes, Mitigation and Testing and Mitigation only, enables error detection, error correction, and error classification (optional) functions. Error injection is not available in the Mitigation only mode.

The next two modes, Detect and Testing and Detect only, enables error detection without the ability to correct or classify. Error injection is not available in Detect only mode.

The last two modes, Emulation and Monitoring only, enable you to use the SEM controller to assess and monitor your system behavior when a single event upset (SEU) event occurs without enabling the error detection, error correction, and error classification functions. Error injection is not available in the Monitoring only mode.

For all the modes, the IP performs an initialization function that brings the integrated soft error detection capability of the FPGA into a known state after the FPGA enters the user mode. After this initialization, depending on the chosen mode, the SEM controller can either observe the integrated soft error detection status (mitigation and detect modes) or transition to Idle state where you can give commands to the IP through the Command or Monitor Interface (emulation and monitoring modes).

In the mitigation modes, when an ECC or CRC error is detected, the SEM controller evaluates the situation to identify the Configuration Memory location involved.

If the location can be identified, the SEM controller corrects the soft error. The correction method uses active partial reconfiguration to perform a localized correction of the Configuration Memory using a read-modify-write scheme. This method uses algorithms to identify the error in need of correction.

The SEM controller optionally classifies the soft error as essential or non-essential using a lookup table. Information is fetched as needed during execution of error classification. This data is also provided by the implementation tools and stored outside the SEM controller.

Tip: Although the out-of-the-box solution for error classification requires an external SPI flash to store the essential bits data, you can also implement the classification feature outside the IP by storing the essential bits data in the system memory and performing the look up based on the error location reported by the SEM controller Monitor/UART Interface.

In the detect modes, when an ECC or CRC error is detected, the SEM controller evaluates the situation to identify the Configuration Memory location involved and reports it if possible. After the error detect report is completed, the IP transitions to idle.

When the SEM controller is idle, it optionally accepts input from you to inject errors into the Configuration Memory (for Mitigation and Testing, Detect and Testing, and Emulation modes). In the Mitigation and Testing and Detect and Testing modes, this feature is useful for testing the integration of the SEM controller into a larger system design.

In the Emulation mode, this feature is useful to evaluate the effects of SEU events on the system design. Using the error injection feature, system verification and validation engineers can construct test cases to ensure the complete system responds to soft error events as expected.

In addition to error injection, there are other useful testing and debugging features provided in the Idle state including frame reads, configuration register reads, external memory reads, and frame address translations.

Finally, there are two other types of error detection: Detect only and Diagnostic scan. These commands are issued to the SEM controller from the Idle state (independent of mode):

Detect Only
This command causes the SEM controller to continuously monitor the configuration memory until it detects an ECC or CRC error. After an error is detected, the SEM controller reports the error and goes to the Idle state. Unlike the mitigation mode, this type of monitoring does not include error correction.
Diagnostic Scan
This command causes the SEM controller to perform a single scan of the configuration memory and report all ECC errors that are detected. After completing a single pass, the SEM controller returns to the Idle state. The error detection mechanism used in this feature does not leverage the built-in error detection capability of the device. No error correction is performed in this type of scan.

The SEM controller is most commonly used in the Mitigation and Testing mode in its default configuration as it enables SEU event detection and correction with the ability to inject errors, and has access to all the other convenience features in idle. Others might opt to migrate to the Mitigation only mode during production to disable any error injection capabilities.

Other modes and features are targeted for systems that require advanced or user-controlled SEU mitigation solutions.

One example of a mitigation strategy that requires a user-controlled mitigation solution is error logging with no correction. One method to implement this is by configuring the SEM controller in the Detect mode.

For example, you could use the IP configured in Detect mode which automatically enables the SEM controller to perform Detect only scan right after it completes initialization. After an error is detected, you can take any action on the errors found (reconfigure the device, reset the logic, etc.) if you choose to.

After detecting the first error or first uncorrectable error, you can issue a Diagnostic Scan of the device periodically to log all accumulative ECC errors resident on the device. The frame level ECC errors that are detected and logged can be used for off-line post processing.

Important: The Diagnostic Scan feature is not designed for mitigation purposes and should not be used for real-time mitigation. AMD does not guarantee the functionality of the FPGA if the device is left to accumulate errors in its configuration memory. The error detection latency for this feature is also significantly longer than the Detect or Mitigation modes, hence it should only be used for diagnostic purposes.