Creating Runners in DPURAHR16L

DPURAHR16L IP on the Alveo™ U50 card has two CUs. One CU can process a batch-3 input, while the next CU can process a batch-4 input. These two CUs can work in parallel.

The XRNN compiler generates two different XMODELS for them. Runners created with batch-3 XMODEL are assigned to batch-3 CU only, and runners created with batch-4 XMODEL are assigned to batch-4 CU only. So, to utilize both the CUs, you need to create a runner with each XMODEL.

While passing the input, the batch size should match the batch size supported by the corresponding runner. The batch size of a runner can be accessed from the shape of the tensor returned by runner->get_input_tensors().

This section describes the important parts of the customer satisfaction application in Python running on DPURAHR16L. The complete code can be accessed from Vitis-AI/demo/ rnn_u25_u50lv/apps/customer_satisfaction/run_dpu_e2e.py.

Import required modules using the following command:
```
import vart
import xir
```

Load the model on CUs.

There are two available CUs. The first CU processes batch-3 input while the second CU processes batch-4 input. To utilize both CUs, create two runners, each one with a corresponding XMODEL.

runners = []
models = ["compiled_batch_3.xmodel", "compiled_batch_4.xmodel"]
for i in range(len(models)):
graph = xir.Graph.deserialize(models[i])
    runners.append(vart.Runner.create_runner(
		        graph.get_root_subgraph(), "run"))

Quantize the input data using the following command:

in_pos = graph.get_root_subgraph().get_attr('input_float2fix')
quantized_lstm_input = quanti_convert_float_to_int16(
    	lstm_input.reshape(num_records * 25*32), in_pos)
	.reshape((num_records, 25*32))

Start the execution. The input data is fed into two runners in an alternating manner. The dimensions for input and output can be accessed from a runner, like batch size and aligned dimensions for input or output. Allocate the output array for execute_async() beforehand.

lstm_output = np.zeros((num_records, 25*100), dtype=np.int16)
i = 0
num_cores = 2
while count < len(quantized_input):
    inputTensors = runners[i].get_input_tensors()
    outputTensors = runners[i].get_output_tensors()
batch_size, num_frames, runner_in_seq_len = tuple(inputTensors[0].dims) _, _, runner_out_seq_len = tuple(outputTensors[0].dims)

    input_data = quantized_input[count:count+batch_size]
    batch_size = input_data.shape[0]
    input_data = input_data.reshape(batch_size, num_sequences,							runner_in_seq_len)
    output_data = np.empty((batch_size, num_sequences, runner_out_seq_len),				dtype=np.int16)
    job_id = runners[i].execute_async([input_data], [output_data],							 True)
    runners[i].wait(job_id)
    out_np[count:count+batch_size, ...] = output_data[..., :output_seq_dim]
			.reshape(batch_size, num_sequences*output_seq_dim)
    count += batch_size
    i = (i + 1) % num_cores

To run both the CUs in parallel, invoke the execute_async() call in two different threads. Refer Vitis-AI/demo/ rnn_u25_u50lv/apps/customer_satisfaction/run_dpu_e2e_mt.py for example.

Dequantize the output using the following command:

out_pos = graph.get_root_subgraph().get_attr('output_fix2float')
lstm_output = quanti_convert_int16_to_float(lstm_output, out_pos)

Creating Runners in DPURAHR16L - 2.0 English

Vitis AI RNN User Guide (UG1563)