AI Engine – Adaptive Data Flow Programming

AI Engine Programming: A Kahn Process Network Evolution (WP552)

Document ID
WP552
Release Date
2023-07-20
Revision
1.0 English

This section describes how data flow programming works for the AI Engine. Nodes (or actors) indicate some type of operation. Nodes or kernels are implemented in AI Engines, which perform the operations, but not strictly as a single operator as shown in Figure 1. An AI Engine can contain multiple kernels that can perform several operations.

KPN edges indicate the path that data takes to or from actors or ports. Edges are implemented as I/O streams, cascade I/O streams, streams or direct memory access (DMA) FIFOs, and local tile memory buffers in the AI Engine tile architecture.

The connection between the KPN nodes (AI Engine kernels) in an AI Engine design are made through the C++ adaptive data flow (ADF) graph program. This code establishes the data flow graph wiring between KPN nodes (AI Engine kernels), identifies any large memory buffers required for those nodes and any I/Os to the graph.

The execution schedule is determined by the graph and the availability of input data and output resources:

  • There is no instruction pointer to firing the AI Engines. Each tile fires and executes its kernel function once all input data is available, like in a KPN.
  • There are many execution units available—10s to 100s of AI Engines based on the device. Some, none, or all of these engines might execute in parallel depending on the nature of the data flow graph that interconnects them together.
  • All AI Engines are either computing or waiting for their input data, like in a KPN.
    • The AI Engine compiler takes inputs (data flow graph and kernels) and produces executable applications for running on an AI Engine device. The AI Engine compiler allocates the necessary resources such as locks, memory buffers, and DMA channels and descriptors, and generates routing information for mapping the graph onto the AI Engine array. It synthesizes a main program for each core that schedules all the kernels on the cores and implements the necessary locking mechanism and data copy among buffers.

In the following figure, function 1 generates two A’s for every B. On average, function 2 consumes twice as many A’s than B’s. It might not always be A and B. It can be A for some time and B for another time. To handle this scenario, the data/token need to be accumulated to process later. In some cases, if this accumulation is for longer cycles, this might stall the system and affect performance. Based on the design requirements, the difficulties can vary. A few ways to overcome some of these challenges are by adding FIFOs to accumulate the data, program the kernel in such a way to improve the performance by using the multiple AI Engines, and other optimization techniques. It is important to understand the deadlock problem and use the proper techniques to solve it.

Figure 1. Data Need to Be Accumulated

The following table lists the comparisons between the KPN and AI Engine terminologies.

Table 1. KPN and AI Engine Terminology
Terminology KPN AI Engine
Node/actor Represents the processes (functions). AI Engine kernel: the processes (node/actor) are implemented as kernels in the AI Engine.
Tokens/inputs Input data to the node/actor. Input data to the AI Engine kernel.
Edge Edges indicate the path that the data takes to or from actors or ports. The output of the nodes (edges) is implemented as FIFO buffers. Edges are implemented as I/O streams, cascade I/O streams, streams or DMA FIFOs, and local tile memory buffers in the AI Engine tile architecture.
Firing Nodes (or actors) fire only when a single token is present on every input to the node. The AI Engine compiler manages the firing based on the availability of the input token (input window size) and the availability of the buffers.
Blocking Reading is blocked if a node/actor (process) tries to read from an empty input.

For memory communication, kernels are stalled if the kernel is waiting for the buffer to be filled. With stream or cascade communication, the sink kernel can stall if the source is not producing the samples. This is taken care of by the AI Engine compiler.

Locks:

The AI Engine compiler allocates the necessary locks, memory buffers, and DMA channels and descriptors, and generates routing information for mapping the graph onto the AI Engine array. It synthesizes a main program for each core that schedules all the kernels on the cores and implements the necessary locking mechanism and data copy among buffers.

The C program for each core is compiled using the Synopsys Single Core Compiler to produce loadable ELF files.

The buffer structure is responsible for managing buffer locks tracking buffer type (ping/pong).

The input and output buffers for the AI Engine kernel are ensured to be ready by the locks associated with the buffers.

In some scenarios, the data flow programming can be challenging for certain algorithms because scheduling can lead to stalling the process.