Deep-Learning Processor Unit for RNN - 1.4.1 English

Vitis AI RNN User Guide (UG1563)

Document ID
UG1563
Release Date
2021-12-03
Version
1.4.1 English

Recurrent Neural Networks (RNNs) can process sequential data of variable length and have been widely used in natural language processing, speech synthesis and recognition, and financial time series forecasting. However, RNNs are very compute-intensive and have to process input frame by frame because of their strict sequential dependency. Traditional hardware cannot achieve the ideal number in latency, especially for financial data processing, in which latency is one of the most important factors for customers.

The deep-learning processor unit (DPU) for RNN is a customized accelerator built on FPGA or ACAP devices to achieve acceleration for the RNN. It can support different types of recurrent neural networks, including RNN, gate recurrent unit (GRU), long-short term memory (LSTM), Bi-directional LSTM, and their variants. The DPU for RNN has been deployed on the Alveo U25 and the U50LV data center accelerator cards and the VersalĀ® VCK5000 development card. The following table summarizes the features of these three RNN accelerators:

Table 1. DPU Features for RNN on Alveo U25, U50LV Cards, and Versal VCK5000 Development Card
Feature DPURADR16L (U25) DPURAHR16L (U50LV) DPURVDRML (VCK5000)
Precision int16 int16 Mix: int8 for GEMM on AI Engine, int16 for others
Operation Type Matrix-Vector multiplication, element-wise multiplication and addition, sigmoid and Tanh GEMM, Element-wise multiplication and addition, Sigmoid and Tanh, Relu, Max, Embedding (in RNN-T)
Multiplication Unit One 32x32 Systolic Array Seven 16x32 Systolic Arrays 40 AI Engine cores
Frequency Freq_DSP = Freq_PL = 310 MHz Freq_DSP = 540 MHz, Freq_PL = 270 MHz Freq_AIE = 1.25 GHz,

Freq_PL = 300 MHz

Resource Utilization LUTs: 187,509 (35.9%)

Regs: 303670 (29.0%)

Block RAM: 659 (67.0%)

URAM: 56 (43.8%)

DSPs: 1092 (55.5%)

LUTs: 488,679 (56.1%)

Regs: 1045016 (60.0%)

Block RAM: 796 (59.2%)

URAM: 512 (80%)

DSPs: 4148 (69.7%)

LUTs: 169,163 (18.8%)

Regs: 241657 (13.4%)

Block RAM: 197 (20.4%)

URAM: 332 (71.7%)

DSPs: 82 (4.2%)

AI Engine: 40 (10.0%)

Example Models IMDB Sentiment Detection, Customer Satisfaction, Open Information Extraction RNN-T
Quantization RNN Quantizer v1.4.1 Manually
Compilation RNN Compiler v1.4.1 Manually
  1. In the DPURVDRML, UltraRAM resources are mainly used as on-chip memory for weights. If multiple 40-core AI Engine kernels could be instantiated, this memory is shared.
  2. The embedding module is customized to support the RNN-T network. This module does not support any embedding table update (size, contents).
  3. Quantization and compilation for the RNN-T model are completed manually. The tools are not ready now.