Overview

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
UG1076
Release Date
2020-11-24
Version
2020.2 English

Versal™ adaptive compute acceleration platforms (ACAPs) combine Scalar Engines, Adaptable Engines, and Intelligent Engines with leading-edge memory and interfacing technologies to deliver powerful heterogeneous acceleration for any application. Most importantly, Versal ACAP hardware and software are targeted for programming and optimization by data scientists and software and hardware developers. Versal ACAPs are enabled by a host of tools, software, libraries, IP, middleware, and frameworks to enable all industry-standard design flows.

Built on the TSMC 7 nm FinFET process technology, the Versal portfolio is the first platform to combine software programmability and domain-specific hardware acceleration with the adaptability necessary to meet today's rapid pace of innovation. The portfolio includes six series of devices uniquely architected to deliver scalability and AI inference capabilities for a host of applications across different markets—from cloud—to networking—to wireless communications—to edge computing and endpoints.

The Versal architecture combines different engine types with a wealth of connectivity and communication capability and a network on chip (NoC) to enable seamless memory-mapped access to the full height and width of the device. Intelligent Engines are SIMD VLIW AI Engines for adaptive inference and advanced signal processing compute, and DSP Engines for fixed point, floating point, and complex MAC operations. Adaptable Engines are a combination of programmable logic blocks and memory, architected for high-compute density. Scalar Engines, including Arm® Cortex™-A72 and Cortex-R5F processors, allow for intensive compute tasks.

AI Engines

The Versal AI Core series delivers breakthrough AI inference acceleration with AI Engines that deliver over 100x greater compute performance than current server-class of CPUs. This series is designed for a breadth of applications, including cloud for dynamic workloads and network for massive bandwidth, all while delivering advanced safety and security features. AI and data scientists, as well as software and hardware developers, can all take advantage of the high compute density to accelerate the performance of any application. Given the AI Engine's advanced signal processing compute capability, it is well-suited for highly optimized wireless applications such as radio, 5G, backhaul, and other high-performance DSP applications.

AI Engines are an array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications, specifically digital signal processing (DSP), 5G wireless applications, and artificial intelligence (AI) technology such as machine learning (ML).

AI Engines provide multiple levels of parallelism including instruction-level and data-level parallelism. Instruction-level parallelism includes a scalar operation, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction that can be executed—in total, a 7-way VLIW instruction per clock cycle. Data-level parallelism is achieved via vector-level operations where multiple sets of data can be operated on a per-clock-cycle basis. Each AI Engine contains both a vector and scalar processor, dedicated program memory, local 32 KB data memory, access to local memory in any of three neighboring directions. It also has access to DMA engines and AXI4 interconnect switches to communicate via streams to other AI Engines or to the programmable logic (PL) or the DMA. Refer to the Versal ACAP AI Engine Architecture Manual (AM009) for specific details on the AI Engine array and interfaces.

AI Engine Kernels

An AI Engine kernel is a C/C++ program which is written using specialized intrinsic calls that target the VLIW vector processor. The AI Engine kernel code is compiled using the AI Engine compiler (aiecompiler) that is included in the Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce an ELF file that is run on the AI Engine processors. AI Engine Architecture Overview presents a high-level overview of kernel programming, tools, and documents that can be referenced for AI Engine kernel programming.

AI Engine Graphs

An AI Engine program consists of a data-flow graph specification which is written in C++. This specification can be compiled and executed using the AI Engine compiler. An adaptive data flow (ADF) graph application consists of nodes and edges where nodes represent compute kernel functions, and edges represent data connections. Kernels in the application can be compiled to run on the AI Engines or in the PL region of the device. AI Engine Programming presents a brief overview of the AI Engine programming model, introduction to ADF graphs, and compiling and simulating an AI Engine graph.

Controlling the AI Engine Graph

Run-Time Graph Control API describes the various control APIs available to control and update the AI Engine graphs at run time. The graph control APIs can be used to initialize, run, update, and control the graph execution from an external controller and runs in the context of a platform. This platform can be a simulation-only platform, an extensible target platform which can be connected to the PL kernels, or a fixed platform for bare-metal applications.

The external controller can be the host code running on one of the processors in the embedded processing system (PS). Programming the PS Host Application describes the process of creating a host application to control the graph and PL kernels of the system. When your design is deployed in hardware, you can install drivers that facilitate initializing and controlling the graph execution via a host application running on the PS, or load and run the AI Engine graph at device boot time.

Application-specific AI Engine control code is generated by the AI Engine compiler as part of compiling the AI Engine design graph and kernel code. The AI Engine control code can:

  • Control the initial loading of the AI Engine kernels.
  • Run the graph for several iterations, update the run-time parameters associated with the graph, exit and reset the AI Engines.

The Vitis core development kit provides the xilinx_vck190_base_202020_1 platform for building, simulating, debugging, and deploying your AI Engine designs. The xilinx_vck190_base_202020_1 is a platform targeting the VCK190 board. It enables development of a design including AI Engine and PL kernels with a host application that targets the Linux OS running on the Arm processor in the PS. Designs developed on this platform can be verified using the hardware emulation flow. These designs can also run on the VCK190 board.

Writing an Example AI Engine Design

AI Engine Programming walks you through the steps involved in creating, compiling and simulating an AI Engine example using the Vitis tools.

The next few chapters describe the APIs that are available for data communication between kernels, controlling and updating graphs at run time, graph constructs to constrain the graph based on your design requirements, and graph constructs to interact with the rest of the Versal architecture areas, the scalar engine and adaptable engine.

Compiling and Simulating the Program

Compiling an AI Engine Graph Application describes in detail the different types of compilation available with the AI Engine compiler, the options and input files that can be passed in, and the expected output. You can compile the graph and kernels independently, or as part of a larger system, and set up the design to capture and profile event trace data at run time.

Simulating an AI Engine Graph Application describes the AI Engine simulator in detail, as well as the x86simulator for functional simulation. The AI Engine simulator simulates the graph application as a standalone entity, or as part of the hardware emulation of a larger system design.

Using the AI Engine Graph as Part of a Versal ACAP System Design

The AI Engine kernels and graph developed in the previous steps can used as part of a larger Versal ACAP system design that can consist of AI Engine kernels, HLS PL kernels, RTL kernels, and the host application. The Vitis compiler builds this larger system.

As described in Integrating the Application Using the Vitis Tools Flow, you can use a command-line approach for building the system, or use the a GUI based approach as described in Using the Vitis IDE. Either approach lets you perform simulation or emulation to verify the design, debug the design in an interactive debug environment, and build the design to deploy on hardware.

Performance Analysis of AI Engine Graph Application describes how to extract performance data by performing event tracing when running the hardware emulation build or the hardware build. Debugging the AI Engine Application shows how to run and use the debug environment from the command line, or from the Vitis IDE. The evaluation of the system performance and debugging the application are the key steps to achieve the application objectives.