Overview - 2022.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2022-10-19
Version
2022.2 English

Versal® adaptive compute acceleration platforms (ACAPs) combine Scalar Engines, Adaptable Engines, and Intelligent Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. Most importantly, Versal ACAP hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. Versal ACAPs are enabled by a host of tools, software, libraries, IP, middleware, and frameworks to enable all industry-standard design flows.

Built on the TSMC 7 nm FinFET process technology, the Versal portfolio is the first platform to combine software programmability and domain-specific hardware acceleration with the adaptability necessary to meet today's rapid pace of innovation. The portfolio includes six series of devices uniquely architected to deliver scalability and AI inference capabilities for a host of applications across different markets—from cloud—to networking—to wireless communications—to edge computing and endpoints.

The Versal architecture combines different engine types with a wealth of connectivity and communication capability and a network on chip (NoC) to enable seamless memory-mapped access to the full height and width of the device. Intelligent Engines are SIMD VLIW AI Engines for adaptive inference and advanced signal processing compute, and DSP Engines for fixed point, floating point, and complex MAC operations. Adaptable Engines are a combination of programmable logic blocks and memory, architected for high-compute density. Scalar Engines, including Arm® Cortex®-A72 and Cortex-R5F processors, allow for intensive compute tasks.

AI Engines

The Versal AI Core series delivers breakthrough AI inference acceleration with AI Engines that deliver over 100x greater compute performance than current server-class of CPUs. This series is designed for a breadth of applications, including cloud for dynamic workloads and network for massive bandwidth, all while delivering advanced safety and security features. AI and data scientists, as well as software and hardware developers, can all take advantage of the high compute density to accelerate the performance of any application. Given the AI Engine's advanced signal processing compute capability, it is well-suited for highly optimized wireless applications such as radio, 5G, backhaul, and other high-performance DSP applications.

AI Engines are an array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications, specifically digital signal processing (DSP), 5G wireless applications, and artificial intelligence (AI) technology such as machine learning (ML).

AI Engines are hardened blocks that provide multiple levels of parallelism including instruction-level and data-level parallelism. Instruction-level parallelism includes a scalar operation, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction that can be executed—in total, a 7-way VLIW instruction per clock cycle. Data-level parallelism is achieved via vector-level operations where multiple sets of data can be operated on a per-clock-cycle basis. Each AI Engine contains both a vector and scalar processor, dedicated program memory, local 32 KB data memory, access to local memory in any of three neighboring directions. It also has access to DMA engines and AXI4 interconnect switches to communicate via streams to other AI Engines or to the programmable logic (PL) or the DMA. Refer to the Versal ACAP AI Engine Architecture Manual (AM009) for specific details on the AI Engine array and interfaces.

AI Engine Architecture Overview provides a high-level overview of the AI Engine architecture, tools, and documents that can be referenced for kernel programming.

AI Engine Kernels

An AI Engine kernel is a C/C++ program which is written using specialized APIs that target the VLIW vector processor. The AI Engine kernel code is compiled using the AI Engine compiler (aiecompiler) that is included in the Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce an ELF file that is run on the AI Engine processors.

AI Engine Graphs

An AI Engine program consists of a data flow graph specification which is written in C++. This specification can be compiled and executed using the AI Engine compiler. An adaptive data flow (ADF) graph application consists of nodes and edges where nodes represent compute kernel functions, and edges represent data connections. Kernels in the application can be compiled to run on the AI Engines or in the PL region of the device. Refer to AI Engine Kernel and Graph Programming Guide (UG1079) for more information about AI Engine how to develop, debug and optimize AI Engine kernels and graphs. It also include information on specialized graph constructs and ways to control the AI Engine graph.

Controlling the AI Engine Graph

Programming the PS Host Application describes the process of creating a host application to control the graph and PL kernels of the system. When your design is deployed in hardware, you can install drivers that facilitate initializing and controlling the graph execution via a host application running on the PS, or load and run the AI Engine graph at device boot time.

Application-specific AI Engine control code is generated by the AI Engine compiler as part of compiling the AI Engine design graph and kernel code. The AI Engine control code can:

  • Control the initial loading of the AI Engine kernels.
  • Run the graph for several iterations, update the run-time parameters (RTP) associated with the graph, exit and reset the AI Engines.
    Note: A graph can have multiple kernels, input and output ports. The graph connectivity, which is equivalent to the nets in a data flow graph is either between the kernels, between kernel and input ports, or between kernel and output ports, and can be configured as a connection. A graph runs for an iteration when it consumes data samples equal to the window or stream of data expected by the kernels in the graph, and produces data samples equal to the window or stream of data expected at the output of all the kernels in the graph.

The Vitis core development kit provides the xilinx_vck190_base_202220_1 platform and the xilinx_vck190_base_dfx_202220_1 platform for building, simulating, debugging, and deploying your AI Engine designs, targeting the VCK190 board. It enables development of a design including AI Engine and PL kernels with a host application that targets the Linux OS running on the Arm processor in the PS. Designs developed on this platform can be verified using the hardware emulation flow. These designs can also run on the VCK190 board.

Compiling and Simulating the Program

Compiling an AI Engine Graph Application describes in detail the different types of compilation available with the AI Engine compiler, the options and input files that can be passed in, and the expected output. You can compile the graph and kernels independently, or as part of a larger system, and set up the design to capture and profile event trace data at run time.

Simulating an AI Engine Graph Application describes the AI Engine simulator in detail, as well as the x86 simulator for functional simulation. The AI Engine simulator simulates the graph application as a standalone entity, or as part of the hardware emulation of a larger system design.

Performance Analysis of AI Engine Graph Application during Simulation describes how to extract performance data by performing event tracing when running the hardware emulation build or the hardware build. This data can be used to further optimize the AI Engine kernels and graphs.

Mapper/Router Methodology describes the mapper and router methodology to be used when handling a failure in the AI Engine compiler during the mapper and/or router phase.

Integrating and Deploying the AI Engine Graph as Part of a Versal ACAP System Design

The AI Engine kernels and graph developed in the previous steps can used as part of a larger Versal ACAP system design that can consist of AI Engine kernels, HLS PL kernels, RTL kernels, and the host application. The Vitis compiler builds this larger system.

As described in Integrating the Application Using the Vitis Tools Flow, you can use a command-line approach for building the system, or use the a GUI-based approach as described in Using the Vitis IDE. Either approach lets you perform simulation or emulation to verify the design, debug the design in an interactive debug environment, and build the design to deploy on hardware.

Integrating the Application Using the Vitis Tools Flow also introduces Xilinx Dynamic Function eXchange (DFX) platform deployment flow in Targeting the DFX Platform. The DFX flow allows you to load or reload xclbin into DFX region at run time, and thus to reset the AI Engine design and PL kernels.

Profiling and Debugging Designs with AI Engine in Hardware

Performance Analysis of AI Engine Graph Application on Hardware describes how to profile and extract performance data by performing event tracing when running the design in hardware.

Debugging the AI Engine Application shows you how to run and use the debug environment from the command line, or from the Vitis IDE. The evaluation of the system performance and debugging the application are the key steps to achieve the application objectives.

AI Engine Hardware Profile and Debug Methodology describes the five stage profile and debug methodology to use when you run a Versal design with AI Engine graphs in hardware.