AI Engine Error Reporting - 2023.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2023-12-04
Version
2023.2 English

XRT provides error reporting APIs and tools. The error can be categorized into two types:

Synchronous error
Errors that can be detected during the XRT run-time function call.
Asynchronous error
Errors from the underneath driver, system, hardware, etc.
A synchronous error handling example:
auto ghdl=xrt::graph(device,uuid,"gr");
try{
  ghdl.update("gr.fir24.in[1]",narrow_filter);
  ghdl.run(16);
  ghdl.read("gr.fir24.inout[0]", coeffs_readback);//Async read
}catch(std::exception const& e){
  std::cout<<"Graph Execution Error"<<std::endl;
  return 1;
}

An asynchronous error might not be related to the current XRT function call or the application that is running. Asynchronous errors are cached in driver subsystems and can be accessed by the user application through the asynchronous error reporting APIs. Cached errors are persistent until explicitly cleared. Persistent errors are not necessarily indicative of the current system state, for example, a board might have been reset and be functioning correctly while previously cached errors are still available. To avoid current state confusion, asynchronous errors have a timestamp attached indicating when the error occurred. The timestamp can be compared to, for example, the timestamp for last xbutil reset.

The errors cached by the driver contain a system error code and additional meta data as defined in https://github.com/Xilinx/XRT/blob/master/src/runtime_src/core/include/xrt_error_code.h, which is shared between the user space and the kernel space.

The XRT error handling APIs can refer to experimental/xrt_error.h. An asynchronous error handling example:

xrt::error error(device, XRT_ERROR_CLASS_AIE);
auto errCode = error.get_error_code();
auto timestamp = error.get_timestamp();
auto err_str = error.to_string();
/* code to deal with this specific error */
std::cout<<"Async error: "<< err_str << std::endl;

An example asynchronous error output:


Error Number (6): AIE_ACCESS
Error Driver (4): DRIVER_AIE
Error Severity (3): SEVERITY_CRITICAL
Error Module (3): MODULE_AIE_CORE
Error Class (2): CLASS_AIE
Timestamp: 1637342412366664740

XRT maintains the latest error for each class and an associated timestamp for when the error was generated. From https://github.com/Xilinx/XRT/blob/master/src/runtime_src/core/include/xrt_error_code.h, the information of error can be interpreted. For example, Error Module (3): MODULE_AIE_COREcorresponds to XRT_ERROR_MODULE_AIE_CORE in enumeration xrtErrorModule.

xbutil can be used to report errors. The error report accumulates all the errors from the various classes and sorts them by timestamp. The report queries the drivers as to when the last reset was requested.

$ xbutil examine -r error -d 0               

Asynchronous Errors
Time Class Module Driver Severity Error Code
Fri Nov 19 17:19:42 2021 GMT CLASS_AIE MODULE_AIE_CORE DRIVER_AIE SEVERITY_CRITICAL AIE_ACCESS


$ xbutil examine -r error -f json -o <OUTPUT_FILE> -d 0
{
  "schema_version": {
    "schema": "JSON",
    "creation_date": "Fri Nov 19 17:58:09 2021 GMT"
  },
  "devices": [
    {
      "interface_type": "pcie",
      "device_id": "0000:00:00.0",
      "asynchronous_errors": [
        {
          "time": {
            "epoch": "1637342382770339700",
            "timestamp": "Fri Nov 19 17:19:42 2021 GMT"
          },
          "class": "CLASS_AIE",
          "module": "MODULE_AIE_CORE",
          "severity": "SEVERITY_CRITICAL",
          "driver": "DRIVER_AIE",
          "error_code": {
          "error_id": "6",
          "error_msg": "AIE_ACCESS"
          }
        }
      ]
    }
  ]
}

xbutil can also be used to report AI Engine running status and read registers for debug purposes. For example, the following command reads the status of kernels after the graph has executed.

$ xbutil examine -r aie -d 0

--------------------------
1/1 [0000:00:00.0] : edge
--------------------------
Aie
  Aie_Metadata
  GRAPH[ 0] Name : gr
          Status : unknown
    SNo. Core [C:R] Iteration_Memory [C:R] Iteration_Memory_Addresses 
    [ 0] 23:1 23:1 16388 
    [ 1] 23:2 23:0 6980 
    [ 2] 23:3 23:1 4 
    [ 3] 24:1 24:0 4 
    [ 4] 24:2 24:2 4 
    [ 5] 24:3 24:1 4 
    [ 6] 25:1 25:1 4 


Core [ 0]
  Column : 23
  Row : 1
  Core:
    Status : disabled, core_done
    Program Counter : 0x00000308
    Link Register : 0x00000290
    Stack Pointer : 0x000340a0
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

  Locks:
    0 : released_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write


  Events:
    core : 1, 2, 5, 22, 23, 24, 28, 29, 31, 32, 35, 36, 38, 39, 40, 44, 45, 47, 68
    memory : 1, 43, 44, 45, 106, 113

......


Core [ 6]
  Column : 25
  Row : 1
  Core:
    Status : enabled, east_lock_stall
    Program Counter : 0x000001e6
    Link Register : 0x000000b0
    Stack Pointer : 0x00030020
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : stalled_on_requesting_lock
        Queue Size : 0
        Queue Status : okay
        Current BD : 2

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : running
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0


  Locks:
    0 : acquired_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write

  Events:
    core : 1, 2, 5, 22, 26, 28, 29, 31, 32, 35, 38, 39, 44
    memory : 1, 20, 21, 23, 35, 43, 44, 106, 113

The following command can be used to read specific registers for debug purposes.

$ xbutil advanced --read-aie-reg -d 0 0 25 Core_Status 
Register Core_Status Value of Row:0 Column:25 is 0x00000201

For AI Engine register definitions, see the Versal Adaptive SoC AI Engine Register Reference (AM015). For details on xbutil command use, see Xilinx Runtime (XRT) Architecture. For error analysis in the Vitis IDE, see Analyzing AI Engine Status.