Systems - 4.1 English

Soft Error Mitigation Controller Product Guide (PG036)

Document ID
PG036
Release Date
2023-11-01
Version
4.1 English

Although the soft error mitigation solution can operate autonomously, many applications are in conjunction with a system-level supervisory function.  The decision to implement a system-level supervisory function and the scope of the responsibilities of this function are system-specific.

The following points illustrate methods by which a system-level supervisory function can monitor the soft error mitigation solution.

TIP: None of these are required.

Monitor the soft error mitigation solution to determine if additional system-level actions are necessary in response to a soft error event.  This can be as generic as a soft error event logging function, or can be a substantially complex and application specific response, based on the type of error and its classification (for example, reset logic, reconfigure device, reboot system, etc.).

To monitor the soft error mitigation solution event reporting, use the Status Interface status_correction and status_uncorrectable signals, the Status Interface status_classification and status_essential signals, or the Monitor Interface monitor_tx signal for error detection, correction, and classification reports.

Monitor the soft error mitigation solution to confirm it is healthy. As discussed and quantified in the Solution Reliability , there is a very small possibility of failure of the soft error mitigation solution. Statistically, such failures might occur during any state of the controller:

° Boot and Initialization States : Monitor the soft error mitigation solution to confirm it boots, initializes, and enters the observation state. AMD specifies the soft error mitigation solution boots, initializes, and enters the observation state within the time specified through Table: Maximum Start-Up Latency at ICAP FMax and This Equation , provided that the icap_grant signal is asserted, the FPGA configuration logic is available to the soft error mitigation solution through the ICAP primitive, and there is no throttling on the Monitor Interface.

Reasons the soft error mitigation solution could fail to initialize and/or fail to enter the observation state are usually design errors (versus soft error events) and include incorrect control of the icap_grant signal, incorrect implementation of ICAP sharing, and general unavailability of the FPGA configuration logic to the soft error mitigation solution through the ICAP primitive. This last issue can occur for several reasons, ranging from use of bitstream options documented to be incompatible with the soft error mitigation solution, to the failure of a system-level JTAG controller to properly complete and/or clear FPGA configuration instructions issued through JTAG to the FPGA.

To confirm the solution initializes and enters the observation state, the system-level supervisory function can observe the Status Interface status_initialization and status_observation signals for assertion, or the Monitor Interface monitor_tx signal for the expected initialization report.

° Observation State : The controller spends virtually all of its time in this state. There are at least three methods for monitoring the controller in this state, each provides slightly different information about the health of the controller:

- Controller heartbeat, status_heartbeat : This signal is a direct output from the soft error mitigation solution.  This signal exhibits pulses, specified in the Status Interface section, which indicate the readback process is active. If, during the observation state, these pulses become out-of-specification, the system-level supervisory function should conclude that the readback process has experienced a fault.  This condition is an uncorrectable, essential error. In SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.

Note: status_heartbeat is undefined in other controller states and should only be observed during the observation state.

- CRC failure indicator, INIT_B : This signal is a direct output from the readback process. If the readback process detects a CRC failure, it asserts INIT_B . If, during the observation state, INIT_B indicates an error and the controller does not respond with a state transition to correction within one second, then the controller has experienced a fault. State transition can be determined using the Status Interface status_correction signal or the Monitor Interface state change report. This condition is an uncorrectable, essential error. In SSI implementations, which have an internal CRC failure indicator per SLR, the indicators are wire-ORed to form the single INIT_B device pin, but the Status Interface has a status_correction signal for each SLR.

Note: The CRC failure indicator, INIT_B , is undefined in other controller states and should only be observed during the observation state.

- Controller status command and report: Using the Monitor Interface monitor_rx and monitor_tx signals, the system-level supervisory function can periodically transmit a status command and confirm receipt of the expected status report. Provided the controller has not changed state, the system-level supervisory function should conclude that the controller has experienced a fault if the expected status report is not received within one second. This condition is an uncorrectable, essential error.

In the use of this method, care should be taken to select the lowest frequency of status command transmission that yields acceptable detection time of a controller unresponsive condition.

Status command and report processing by the controller can be an undesirable source of additional latency. For example, a status command transmission period of 60 seconds might be a reasonable trade-off to guard against rare controller unresponsive conditions while not adding significant additional latency to general operation. As a counter example, one second would be a poor choice. In this counter example, the status reports could keep the Monitor Shim transmit buffer frequently non-empty, possibly resulting in throttling on the Monitor Interface, adding latency to error detection, correction, and classification activities.

Note: The controller status command and report method only functions in the observation and idle states. Assuming the Monitor Shim receive buffer is not in an overflow condition, status commands sent during other states are buffered and processed upon return to the observation or idle state.

° Correction and Classification States : The soft error mitigation solution transitions through the correction and classification states within the time specified in Table: Non-SSI: Max Error Correction Latency (100 MHz) No Throttling on Monitor Interface / This Equation and Table: Non-SSI: Max Error Classification Latency (100 MHz) No Throttling on Monitor Interface / This Equation , provided there is no throttling on the Monitor Interface. Due to the infrequency of soft errors, the controller spends very little time in these states and normally transitions back to the observation state, or less frequently, the idle state. If the controller dwells continuously in either correction or classification states in excess of one second, as observed on the Status Interface status_correction and status_classification signals, or on the Monitor Interface as indicated by the state change reports, then the system-level supervisory function should conclude that the controller has experienced a fault. This is an uncorrectable, essential error.

Independently, the system-level supervisory function might elect to monitor for conditions where the soft error mitigation solution repeatedly corrects the same address. Several rare issues might generate this symptom, ranging from soft errors in the controller to hard errors in the device itself.

° Idle State and Injection State : The controller only enters the idle state as a result of an uncorrectable error, or if specifically directed. In the event of an uncorrectable error, see the section above about monitoring event reporting. Directed entry to the idle state is generally for the purpose of issuing other commands for error injection or ICAP sharing. It is inadvisable to implement the “Observation State” point above for status command and report monitoring during the idle state as it might conflict with commands issued by other processes at the application level. Instead, the application-level processes should test that any issued command completes and generates a response within one second. Otherwise, an uncorrectable, essential error has occurred and the application should report this to the system.

° Halt State : The controller only enters this state when it has detected inconsistent internal state. This condition is observable on the Status Interface as the assertion of all five state indicators, and on the Monitor Interface as a HLT message.

In SSI implementations, where more than one controller instance exists, the solution is considered halted if one or more of the controller instances halts. This is an uncorrectable, essential error.