Recovery of the MicroBlaze Subsystem - 1.0 English

MicroBlaze Triple Modular Redundancy (TMR) Subsystem (PG268)

Document ID
Release Date
1.0 English

When the TMR Manager detects a TMR Comparator mismatch with one faulty MicroBlaze uniquely identified, it enters Lockstep state (FS-mode). In this state the two healthy MicroBlaze sub-blocks ensures that the nominal operation of the entire TMR MicroBlaze subsystem continues without degradation. The Lockstep state is signaled to the MicroBlaze processors by asserting a break signal. The software application can handle the break and restore the faulty sub-block by performing the following steps:

1. The executing software is interrupted by the break signal.

2. The software break handler stores all internal MicroBlaze registers in RAM.

3. The software performs a reset of the entire MicroBlaze subsystem excluding the TMR Managers by executing a SUSPEND instruction.

4. The reset restores the TMR Manager to Voting (FT-mode) state.

5. The software starts executing from the reset vector, and reads the TMR Manager First Failing Register (FFR) to determine the actions to perform.

6. If the FFR indicates a cold reset (the Recovery bit is not set), a normal program cold start should be done. If the FFR indicates that one MicroBlaze sub-block is faulty (all of the Fatal bits are cleared, the Recovery bit and two of the three Lockstep mismatch bits are set), a recovery should be done. If the register holds any other value, the software should not attempt a recovery. The action in this case is application dependent, and could for example be entering an infinite loop to allow logic outside the subsystem to handle recovery, or doing a cold reset.

7. The software clears the TMR Manager FFR.

8. The software restores all registers from RAM and execute an RTBD instruction to return from the break handler, to resume nominal execution at the place where the break occurred.

Because restoring the faulty processor is controlled by software, it can postpone recovery until any critical tasks have been completed if necessary. This scheme can be modified to handle permanent errors and run in degraded Lockstep mode using the two healthy processors, by masking the break.

If the system requirements allow a periodic reset of the MicroBlaze subsystem, the software need not perform an explicit restore by handling the break, because a potential Lockstep state would implicitly be restored to the Voting state by the periodic reset. Another advantage of a periodic reset is that any latent faults in the subsystem are removed, which reduces the failure intensity.

Finally, if the system requirements allow, software recovery can be omitted altogether. The subsystem would then run until there is a fatal error condition, which could be resolved by a power-on-reset.