Systems - 3.1 English

UltraScale Architecture Soft Error Mitigation Controller LogiCORE IP Product Guide (PG187)

Document ID
PG187
Release Date
2023-11-08
Version
3.1 English

Although the Soft Error Mitigation solution can operate autonomously, many applications of this solution are used with a system-level supervisory function. The decision to implement a system-level supervisory function and the scope of the responsibilities of this function are system-specific.

Note: The references below also include system-level recommendations described in each interface.

The following points illustrate methods by which a system-level supervisory function can monitor the Soft Error Mitigation solution.

  • Monitor the Soft Error Mitigation solution to determine if additional system-level actions are necessary in response to a soft error event. This action can be as simple as logging each soft error event that is detected, or it might involve a more complex determination of the appropriate system-level response based on factors such as the classification value of the error or whether the error is correctable. Analysis of these and other factors could result in system-level actions including, but not limited to, resetting the design, reconfiguring the FPGA, or rebooting the system.

    To monitor the Soft Error Mitigation solution event reporting in Mitigation modes, use the Status Interface status_correction and status_uncorrectable signals, the Status Interface status_classification and status_essential signals, or the UART Interface uart_tx signal for error detection, correction, and classification reports.

    To monitor the Soft Error Mitigation solution event reporting in Detect modes, use the Status Interface status_uncorrectable , or the UART Interface uart_tx signal for error detect reports.

  • Monitor the Soft Error Mitigation solution to confirm it is healthy. As discussed and quantified in the Solution Reliability, there is a very small possibility of failure of the Soft Error Mitigation solution. Statistically, such failures might occur during any state of the controller:
    Boot and Initialization States
    Monitor the Soft Error Mitigation solution to confirm it boots, initializes, and enters the correct state, Observation, Detect only, or Idle state based on the selected modes.

    AMD specifies the Soft Error Mitigation solution boots, initialize, and enter the designated state within the time specified through Table 1 and Figure 1 , provided that the cap_gnt signal is asserted, the FPGA configuration logic is available to the Soft Error Mitigation solution through the ICAP primitive, and there is no throttling on the Monitor Interface.

    Reasons the Soft Error Mitigation solution could fail to initialize and/or fail to enter the correct state are usually design errors (versus soft error events) and include incorrect tie-offs of unused ports, incorrect control of the cap_gnt signal, incorrect implementation of ICAP sharing, and general unavailability of the FPGA configuration logic to the Soft Error Mitigation solution through the ICAP primitive. This last issue can occur for several reasons, ranging from use of bitstream options documented to be incompatible with the Soft Error Mitigation solution, to the failure of a system-level JTAG controller to properly complete and/or clear FPGA configuration instructions issued through JTAG to the FPGA.

    To confirm the solution initializes and enters the correct state, the system-level supervisory function can observe the Status Interface status_initialization and relevant status_* signals for assertion (see state diagrams Figure 1 through Figure 3), or the UART Interface uart_tx signal for the expected initialization report.

    The CRC Indicator, INIT_B , can be ignored in this state.

Observation State (Mitigation Modes)
The controller spends virtually all of its time in this state. There are at least three methods for monitoring the controller in this state, each provides slightly different information about the health of the controller:
Controller Heartbeat, status_heartbeat

This signal is a direct output from the Soft Error Mitigation solution. This signal exhibits pulses, specified in the Status Interface, which indicate the readback process is active. If, during the Observation state, these pulses become out-of-specification, the system-level supervisory function should conclude that the readback process has experienced a fault. This condition is an uncorrectable, essential error.

In both UltraScale and UltraScale+ SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.

status_heartbeat is undefined in other controller states and should only be observed during the Observation state.

See Heartbeat.

CRC Failure Indicator, INIT_B
This signal is a direct output from the readback process. If the readback process detects a CRC failure, it asserts INIT_B . If, during the Observation state,

INIT_B indicates an error and the controller does not respond with a state transition to correction within one second, the controller has experienced a fault. State transition can be determined using the Status Interface status_correction signal or the Monitor Interface state change report. This condition is an uncorrectable, essential error.

In UltraScale and UltraScale+ SSI implementations, which have an internal CRC failure indicator per SLR, the indicators are wire-ORed to form the single INIT_B device pin. For UltraScale implementation, the Status Interface has a status_correction signal for each SLR.

The CRC failure indicator, INIT_B , should only be observed during Observation and Detect only states and is undefined in other controller states.

Controller Status Command and Report
Using the UART Interface uart_rx and uart_tx signals, the system-level supervisory function can periodically transmit a status command and confirm receipt of the expected status report. Provided the controller has not changed state, the system-level supervisory function should conclude that the controller has experienced a fault if the expected status report is not received within one second. This condition is an uncorrectable, essential error.

In the use of this method, care should be taken to select the lowest frequency of the status command transmission that yields acceptable detection time of a "controller unresponsive" condition.

Status command and report processing by the controller can be an undesirable source of additional latency. For example, a status command transmission period of 60 seconds might be a reasonable trade-off to guard against rare "controller unresponsive" conditions while not adding significant additional latency to general operation. As a counter example, one second would be a poor choice. In this counter example, the status reports could keep the UART helper block transmit buffer frequently non-empty, possibly resulting in throttling on the Monitor Interface, adding latency to error detection, correction, and classification activities.

The controller status command and report method only functions in the Observation and Idle states. Assuming the UART helper block receive buffer is not in an overflow condition, status commands sent during other states are buffered and processed upon return to the Observation or Idle state.

Correction and Classification States
The Soft Error Mitigation solution transitions through the Correction and Classification states within the time specified in Table 1/Figure 1 and Table 1/Figure 1 , provided there is no throttling on the Monitor Interface. Due to the infrequency of soft errors, the controller spends very little time in these states and normally transitions back to the Observation state, or less frequently, the Idle state.

If the controller dwells continuously in either the Correction or Classification states in excess of one second, as observed on the Status Interface status_correction and status_classification signals, or on the Monitor Interface as indicated by the state change reports, the system-level supervisory function should conclude that the controller has experienced a fault. This is an uncorrectable, essential error.

Independently, the system-level supervisory function might elect to monitor for conditions where the Soft Error Mitigation solution repeatedly corrects the same address. Many rare issues might generate this symptom, ranging from soft errors in the controller to hard errors in the device itself.

Detect Only Mode or State
The controller spends virtually all of its time in this state after it transitions into this mode after initialization or when it is commanded to do so. There are at least two methods for monitoring the controller in this state, each provides slightly different information about the health of the controller:
Controller Heartbeat, status_heartbeat
This signal is a direct output from the Soft Error Mitigation solution. This signal exhibits pulses, specified in the Status Interface, which indicates the readback process is active. If, during the Detect only state, these pulses become out-of-specification, the system-level supervisory function should conclude that the readback process has experienced a fault. This condition is an uncorrectable, essential error.

In UltraScale and UltraScale+ SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.

See Heartbeat.

CRC Failure Indicator, INIT_B
This signal is a direct output from the readback process. If the readback process detects a CRC failure, it asserts INIT_B . If, during the Detect only state, INIT_B indicates an error and the controller does not respond with a state transition to Idle within one second, the controller has experienced a fault. State transition can be determined using the Status Interface (to detect idle state) or the Monitor Interface state change report. This condition is an uncorrectable, essential error.

In UltraScale and UltraScale+ SSI implementations, which have an internal CRC failure indicator per SLR, the indicators are wire-ORed to form the single INIT_B device pin, but the Status Interface for each SLR must be monitored for an Idle state separately.

The CRC failure indicator, INIT_B , should only be observed during Observation and Detect only states and is undefined in other controller states.

Diagnostic Scan State
When commanded, the controller scans all the configuration memory in the device in this state and reports all ECC errors it encounters. Here is the recommended method for monitoring the controller in this state:
Controller Heartbeat, status_heartbeat
This signal is a direct output from the Soft Error Mitigation solution. This signal exhibits pulses, specified in the Status Interface, which indicate the readback process is active. If, during the Diagnostic Scan state, these pulses become out-of-specification, the system-level supervisory function should conclude that the readback process has experienced a fault. This condition is an uncorrectable, essential error.

In UltraScale and UltraScale+ SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.

See Heartbeat.

Idle and Injection States
The controller only enters the Idle state as a result of an uncorrectable error, or if specifically directed. In the event of an uncorrectable error, see the previous section about monitoring event reporting. Directed entry to the Idle state is generally for the purpose of issuing other commands for error injection or ICAP sharing. It is inadvisable to implement the “Observation State” point mentioned previously for status command and report monitoring during the Idle state as it might conflict with commands issued by other processes at the application level. Instead, the application-level processes should test that any issued command completes and generates a response within one second. Otherwise, an uncorrectable, essential error has occurred and the application should report this to the system.
Fatal Error State
The controller only enters this state when it has detected an inconsistent internal state. This condition is observable on the Status Interface as the assertion of all seven state indicators, and might be observable on the Monitor Interface as a HLT message. In UltraScale SSI implementations, where more than one controller instance exists, the solution is considered halted if one or more of the controller instances halts or transitions to idle as a result of an uncorrectable error event. This is an uncorrectable, essential error.

Even though it is optional to implement any system-level supervisory function that is described above, AMD recommends that at the minimum implement the following system-level supervisory function to ensure that the IP is healthy and functional when using the IP in mitigation modes:

  1. Confirm that IP has completed Boot and Initialization states and successfully transitions into Observation, Idle, or Detect only (based on mode selected) state after device configuration as discussed in the Boot and Initialization. INIT_B signal should not be observed in the Boot and Initialization states.
  2. Monitor status_heartbeat signal during Observation, Detect only, and Diagnostic Scan states to ensure that it is within the specification as discussed in the Heartbeat. An example of this monitoring logic is delivered in the example design. See the Functions.
  3. Ensure that IP has NOT halted or gone to Idle when it is deployed in any Mitigation and Detect modes. If either of these states occur, the IP has stopped any mitigation activity and can no longer detect or correct any SEU that might occur. This can be done by monitoring the status_* signals. An example logic to flag if the IP is halted is delivered in the example design. See the Functions.
  4. Monitor the INIT_B signal when the SEM controller is in the Observation and Detect only states. If INIT_B remains asserted for longer than one second and the controller has not transitioned to the Correction or Idle state respectively, this is an indication that a non-correctable error has occurred or that the IP is no longer responsive to mitigate errors as discussed in the CRC Failure Indicator, INIT_B .
  5. Buffer monitor_txdata[7:0] output into a FIFO to ease debugging of the IP behavior if required at a future point. This is recommended especially if the Monitor Interface is not used by the system. See the Monitor Interface.