Deep-Learning Processor Unit for RNN

Recurrent Neural Networks (RNNs) can process sequential data of variable length and have been widely used in natural language processing, speech synthesis and recognition, and financial time series forecasting. However, RNNs are very compute-intensive and have to process input frame by frame because of their strict sequential dependency. Traditional hardware cannot achieve the ideal number in latency, especially for financial data processing, in which latency is one of the most important factors for customers.

The deep-learning processor unit (DPU) for RNN is a customized accelerator built on FPGA or ACAP devices to achieve acceleration for the RNN. It can support different types of recurrent neural networks, including RNN, gate recurrent unit (GRU), long-short term memory (LSTM), Bi-directional LSTM, and their variants. The DPU for RNN has been deployed on the Alveo U25 and the U50LV data center accelerator cards and the Versal® VCK5000 development card. The following table summarizes the features of these three RNN accelerators:

Table 1. DPU Features for RNN on Alveo U25, U50LV Cards, and Versal VCK5000 Development Card
Feature	DPURADR16L (U25)	DPURAHR16L (U50LV)	DPURVDRML (VCK5000)
Precision	int16	int16	Mix: int8 for GEMM on AI Engine, int16 for others
Operation Type	Matrix-Vector multiplication, element-wise multiplication and addition, sigmoid and Tanh		GEMM, Element-wise multiplication and addition, Sigmoid and Tanh, Relu, Max, Embedding (in RNN-T)
Multiplication Unit	One 32x32 Systolic Array	Seven 16x32 Systolic Arrays	40 AI Engine cores
Frequency	Freq_DSP = Freq_PL = 310 MHz	Freq_DSP = 540 MHz, Freq_PL = 270 MHz	Freq_AIE = 1.25 GHz, Freq_PL = 300 MHz
Resource Utilization	LUTs: 187,509 (35.9%) Regs: 303670 (29.0%) Block RAM: 659 (67.0%) URAM: 56 (43.8%) DSPs: 1092 (55.5%)	LUTs: 488,679 (56.1%) Regs: 1045016 (60.0%) Block RAM: 796 (59.2%) URAM: 512 (80%) DSPs: 4148 (69.7%)	LUTs: 169,163 (18.8%) Regs: 241657 (13.4%) Block RAM: 197 (20.4%) URAM: 332 (71.7%) DSPs: 82 (4.2%) AI Engine: 40 (10.0%)
Example Models	IMDB Sentiment Detection, Customer Satisfaction, Open Information Extraction		RNN-T
Quantization	RNN Quantizer v1.4.1		Manually
Compilation	RNN Compiler v1.4.1		Manually
In the DPURVDRML, UltraRAM resources are mainly used as on-chip memory for weights. If multiple 40-core AI Engine kernels could be instantiated, this memory is shared. The embedding module is customized to support the RNN-T network. This module does not support any embedding table update (size, contents). Quantization and compilation for the RNN-T model are completed manually. The tools are not ready now.

Deep-Learning Processor Unit for RNN - 1.4.1 English

Vitis AI RNN User Guide (UG1563)