Creating Runners in DPURAHR16L - 2.0 English

Vitis AI RNN User Guide (UG1563)

Document ID
UG1563
Release Date
2022-01-20
Version
2.0 English
DPURAHR16L IP on the Alveo™ U50 card has two CUs. One CU can process a batch-3 input, while the next CU can process a batch-4 input. These two CUs can work in parallel.

The XRNN compiler generates two different XMODELS for them. Runners created with batch-3 XMODEL are assigned to batch-3 CU only, and runners created with batch-4 XMODEL are assigned to batch-4 CU only. So, to utilize both the CUs, you need to create a runner with each XMODEL.

While passing the input, the batch size should match the batch size supported by the corresponding runner. The batch size of a runner can be accessed from the shape of the tensor returned by runner->get_input_tensors().

This section describes the important parts of the customer satisfaction application in Python running on DPURAHR16L. The complete code can be accessed from Vitis-AI/demo/ rnn_u25_u50lv/apps/customer_satisfaction/run_dpu_e2e.py.

  1. Import required modules using the following command:
    import vart
    import xir
    
  2. Load the model on CUs.

    There are two available CUs. The first CU processes batch-3 input while the second CU processes batch-4 input. To utilize both CUs, create two runners, each one with a corresponding XMODEL.

    runners = []
    models = ["compiled_batch_3.xmodel", "compiled_batch_4.xmodel"]
    for i in range(len(models)):
    graph = xir.Graph.deserialize(models[i])
        runners.append(vart.Runner.create_runner(
    		        graph.get_root_subgraph(), "run"))
    
  3. Quantize the input data using the following command:
    in_pos = graph.get_root_subgraph().get_attr('input_float2fix')
    quantized_lstm_input = quanti_convert_float_to_int16(
        	lstm_input.reshape(num_records * 25*32), in_pos)
    	.reshape((num_records, 25*32))
    
  4. Start the execution. The input data is fed into two runners in an alternating manner. The dimensions for input and output can be accessed from a runner, like batch size and aligned dimensions for input or output. Allocate the output array for execute_async() beforehand.
    lstm_output = np.zeros((num_records, 25*100), dtype=np.int16)
    i = 0
    num_cores = 2
    while count < len(quantized_input):
        inputTensors = runners[i].get_input_tensors()
        outputTensors = runners[i].get_output_tensors()
    batch_size, num_frames, runner_in_seq_len = tuple(inputTensors[0].dims) _, _, runner_out_seq_len = tuple(outputTensors[0].dims)
    
        input_data = quantized_input[count:count+batch_size]
        batch_size = input_data.shape[0]
        input_data = input_data.reshape(batch_size, num_sequences,							runner_in_seq_len)
        output_data = np.empty((batch_size, num_sequences, runner_out_seq_len),				dtype=np.int16)
        job_id = runners[i].execute_async([input_data], [output_data],							 True)
        runners[i].wait(job_id)
        out_np[count:count+batch_size, ...] = output_data[..., :output_seq_dim]
    			.reshape(batch_size, num_sequences*output_seq_dim)
        count += batch_size
        i = (i + 1) % num_cores
    

    To run both the CUs in parallel, invoke the execute_async() call in two different threads. Refer Vitis-AI/demo/ rnn_u25_u50lv/apps/customer_satisfaction/run_dpu_e2e_mt.py for example.

  5. Dequantize the output using the following command:
    out_pos = graph.get_root_subgraph().get_attr('output_fix2float')
    lstm_output = quanti_convert_int16_to_float(lstm_output, out_pos)