Although the Soft Error Mitigation solution can operate autonomously, many applications of this solution are used with a system-level supervisory function. The decision to implement a system-level supervisory function and the scope of the responsibilities of this function are system-specific.
The following points illustrate methods by which a system-level supervisory function can monitor the Soft Error Mitigation solution.
- Monitor the Soft Error Mitigation
solution to determine if additional system-level actions are necessary in response to a soft
error event. This action can be as simple as logging each soft error event that is detected,
or it might involve a more complex determination of the appropriate system-level response
based on factors such as the classification value of the error or whether the error is
correctable. Analysis of these and other factors could result in system-level actions
including, but not limited to, resetting the design, reconfiguring the FPGA, or rebooting
the system.
To monitor the Soft Error Mitigation solution event reporting in Mitigation modes, use the Status Interface
status_correction
andstatus_uncorrectable
signals, the Status Interfacestatus_classification
andstatus_essential
signals, or the UART Interfaceuart_tx
signal for error detection, correction, and classification reports.To monitor the Soft Error Mitigation solution event reporting in Detect modes, use the Status Interface
status_uncorrectable
, or the UART Interfaceuart_tx
signal for error detect reports. - Monitor the Soft Error Mitigation
solution to confirm it is healthy. As discussed and quantified in the Solution Reliability, there is a very small possibility of
failure of the Soft Error Mitigation solution. Statistically, such failures might occur
during any state of the controller:
- Boot and Initialization States
- Monitor the Soft Error Mitigation solution to confirm it boots,
initializes, and enters the correct state, Observation, Detect only, or Idle state
based on the selected modes.
AMD specifies the Soft Error Mitigation solution boots, initialize, and enter the designated state within the time specified through Table 1 and Figure 1 , provided that the
cap_gnt
signal is asserted, the FPGA configuration logic is available to the Soft Error Mitigation solution through the ICAP primitive, and there is no throttling on the Monitor Interface.Reasons the Soft Error Mitigation solution could fail to initialize and/or fail to enter the correct state are usually design errors (versus soft error events) and include incorrect tie-offs of unused ports, incorrect control of the
cap_gnt
signal, incorrect implementation of ICAP sharing, and general unavailability of the FPGA configuration logic to the Soft Error Mitigation solution through the ICAP primitive. This last issue can occur for several reasons, ranging from use of bitstream options documented to be incompatible with the Soft Error Mitigation solution, to the failure of a system-level JTAG controller to properly complete and/or clear FPGA configuration instructions issued through JTAG to the FPGA.To confirm the solution initializes and enters the correct state, the system-level supervisory function can observe the Status Interface
status_initialization
and relevantstatus_*
signals for assertion (see state diagrams Figure 1 through Figure 3), or the UART Interfaceuart_tx
signal for the expected initialization report.The CRC Indicator,
INIT_B
, can be ignored in this state.
- Observation State (Mitigation Modes)
- The controller spends virtually all of its time in this state. There are at least three
methods for monitoring the controller in this state, each provides slightly different
information about the health of the controller:
- Controller Heartbeat, status_heartbeat
-
This signal is a direct output from the Soft Error Mitigation solution. This signal exhibits pulses, specified in the Status Interface, which indicate the readback process is active. If, during the Observation state, these pulses become out-of-specification, the system-level supervisory function should conclude that the readback process has experienced a fault. This condition is an uncorrectable, essential error.
In both UltraScale and UltraScale+ SSI implementations, which have a
status_heartbeat
output per SLR, it is necessary to monitor the heartbeat from all SLRs.status_heartbeat
is undefined in other controller states and should only be observed during the Observation state. - CRC Failure Indicator, INIT_B
- This signal is a direct output from the readback process. If the
readback process detects a CRC failure, it asserts
INIT_B
. If, during the Observation state,INIT_B
indicates an error and the controller does not respond with a state transition to correction within one second, the controller has experienced a fault. State transition can be determined using the Status Interfacestatus_correction
signal or the Monitor Interface state change report. This condition is an uncorrectable, essential error.In UltraScale and UltraScale+ SSI implementations, which have an internal CRC failure indicator per SLR, the indicators are wire-ORed to form the single
INIT_B
device pin. For UltraScale implementation, the Status Interface has astatus_correction
signal for each SLR.The CRC failure indicator,
INIT_B
, should only be observed during Observation and Detect only states and is undefined in other controller states. - Controller Status Command and Report
- Using the UART Interface
uart_rx
anduart_tx
signals, the system-level supervisory function can periodically transmit a status command and confirm receipt of the expected status report. Provided the controller has not changed state, the system-level supervisory function should conclude that the controller has experienced a fault if the expected status report is not received within one second. This condition is an uncorrectable, essential error.In the use of this method, care should be taken to select the lowest frequency of the status command transmission that yields acceptable detection time of a "controller unresponsive" condition.
Status command and report processing by the controller can be an undesirable source of additional latency. For example, a status command transmission period of 60 seconds might be a reasonable trade-off to guard against rare "controller unresponsive" conditions while not adding significant additional latency to general operation. As a counter example, one second would be a poor choice. In this counter example, the status reports could keep the UART helper block transmit buffer frequently non-empty, possibly resulting in throttling on the Monitor Interface, adding latency to error detection, correction, and classification activities.
The controller status command and report method only functions in the Observation and Idle states. Assuming the UART helper block receive buffer is not in an overflow condition, status commands sent during other states are buffered and processed upon return to the Observation or Idle state.
- Correction and Classification States
- The Soft Error Mitigation solution transitions through the
Correction and Classification states within the time specified in Table 1/Figure 1 and Table 1/Figure 1 , provided
there is no throttling on the Monitor Interface. Due to the infrequency of soft
errors, the controller spends very little time in these states and normally
transitions back to the Observation state, or less frequently, the Idle state.
If the controller dwells continuously in either the Correction or Classification states in excess of one second, as observed on the Status Interface
status_correction
andstatus_classification
signals, or on the Monitor Interface as indicated by the state change reports, the system-level supervisory function should conclude that the controller has experienced a fault. This is an uncorrectable, essential error.Independently, the system-level supervisory function might elect to monitor for conditions where the Soft Error Mitigation solution repeatedly corrects the same address. Many rare issues might generate this symptom, ranging from soft errors in the controller to hard errors in the device itself.
- Detect Only Mode or State
- The controller spends virtually all of its time in this state after it transitions into this mode after initialization or when it is commanded to do so. There are at least two methods for monitoring the controller in this state, each provides slightly different information about the health of the controller:
- Controller Heartbeat, status_heartbeat
- This signal is a direct output from the Soft Error Mitigation solution. This
signal exhibits pulses, specified in the Status Interface, which indicates the
readback process is active. If, during the Detect only state, these pulses become
out-of-specification, the system-level supervisory function should conclude that the
readback process has experienced a fault. This condition is an uncorrectable,
essential error.
In UltraScale and UltraScale+ SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.
See Heartbeat.
- CRC Failure Indicator, INIT_B
- This signal is a direct output from the readback process. If the
readback process detects a CRC failure, it asserts
INIT_B
. If, during the Detect only state,INIT_B
indicates an error and the controller does not respond with a state transition to Idle within one second, the controller has experienced a fault. State transition can be determined using the Status Interface (to detect idle state) or the Monitor Interface state change report. This condition is an uncorrectable, essential error.In UltraScale and UltraScale+ SSI implementations, which have an internal CRC failure indicator per SLR, the indicators are wire-ORed to form the single
INIT_B
device pin, but the Status Interface for each SLR must be monitored for an Idle state separately.The CRC failure indicator,
INIT_B
, should only be observed during Observation and Detect only states and is undefined in other controller states.
- Diagnostic Scan State
- When commanded, the controller scans all the configuration memory in the device in this
state and reports all ECC errors it encounters. Here is the recommended method for
monitoring the controller in this state:
- Controller Heartbeat, status_heartbeat
- This signal is a direct output from the Soft Error Mitigation solution. This
signal exhibits pulses, specified in the Status Interface, which indicate the
readback process is active. If, during the Diagnostic Scan state, these pulses
become out-of-specification, the system-level supervisory function should conclude
that the readback process has experienced a fault. This condition is an
uncorrectable, essential error.
In UltraScale and UltraScale+ SSI implementations, which have a status_heartbeat output per SLR, it is necessary to monitor the heartbeat from all SLRs.
See Heartbeat.
- Idle and Injection States
- The controller only enters the Idle state as a result of an uncorrectable error, or if specifically directed. In the event of an uncorrectable error, see the previous section about monitoring event reporting. Directed entry to the Idle state is generally for the purpose of issuing other commands for error injection or ICAP sharing. It is inadvisable to implement the “Observation State” point mentioned previously for status command and report monitoring during the Idle state as it might conflict with commands issued by other processes at the application level. Instead, the application-level processes should test that any issued command completes and generates a response within one second. Otherwise, an uncorrectable, essential error has occurred and the application should report this to the system.
- Fatal Error State
- The controller only enters this state when it has detected an inconsistent internal state. This condition is observable on the Status Interface as the assertion of all seven state indicators, and might be observable on the Monitor Interface as a HLT message. In UltraScale SSI implementations, where more than one controller instance exists, the solution is considered halted if one or more of the controller instances halts or transitions to idle as a result of an uncorrectable error event. This is an uncorrectable, essential error.
Even though it is optional to implement any system-level supervisory function that is described above, AMD recommends that at the minimum implement the following system-level supervisory function to ensure that the IP is healthy and functional when using the IP in mitigation modes:
- Confirm that IP has completed Boot and Initialization states and successfully transitions
into Observation, Idle, or Detect only (based on mode selected) state after device
configuration as discussed in the Boot and Initialization.
INIT_B
signal should not be observed in the Boot and Initialization states. - Monitor
status_heartbeat
signal during Observation, Detect only, and Diagnostic Scan states to ensure that it is within the specification as discussed in the Heartbeat. An example of this monitoring logic is delivered in the example design. See the Functions. - Ensure that IP has NOT halted or
gone to Idle when it is deployed in any Mitigation and Detect modes. If either of these
states occur, the IP has stopped any mitigation activity and can no longer detect or correct
any SEU that might occur. This can be done by monitoring the
status_*
signals. An example logic to flag if the IP is halted is delivered in the example design. See the Functions. - Monitor the
INIT_B
signal when the SEM controller is in the Observation and Detect only states. IfINIT_B
remains asserted for longer than one second and the controller has not transitioned to the Correction or Idle state respectively, this is an indication that a non-correctable error has occurred or that the IP is no longer responsive to mitigate errors as discussed in the CRC Failure Indicator,INIT_B
. - Buffer
monitor_txdata[7:0]
output into a FIFO to ease debugging of the IP behavior if required at a future point. This is recommended especially if the Monitor Interface is not used by the system. See the Monitor Interface.