The split I/O memory model is introduced to resolve the limitation of the unique
memory model so that data coming from other physical memory buffer can be consumed by
the DPU directly. When the dpuCreateTask()
function is
called to create a DPU task from the DPU kernel compiled with the -split-io-mem
option, N2Cube only allocates the DPU memory buffer for the
intermediate feature maps. It is up to you to allocate the physical continuous memory
buffers for boundary input tensors and output tensors individually. The size of the
input memory buffer and the output memory buffer can be found from the compiler building
log with the field names Input Mem Size and Output Mem Size. You also need to take care
of cache coherence, if these memory buffers can be cached.
DNNDK sample split_io
provides a programming
reference for the split I/O memory model. The TensorFlow SSD is used as the reference.
There is one input tensor image:0, and two output tensors ssd300_concat:0
and ssd300_concat_1:0
for
the SSD model. From the compiler building log, you can see that the size of the DPU
input memory buffer (for tensor image:0) is 270000, and the size of the DPU output
memory buffer (for output tensors ssd300_concat:0
and
ssd300_concat_1:0
) is 218304. dpuAllocMem()
is used to allocate memory buffers for them.
dpuBindInputTensorBaseAddress()
and dpuBindOutputTensorBaseAddress()
are subsequently used to
bind the input/output memory buffer address to DPU task before launching its execution.
After the input data is fed into the DPU input memory buffer, dpuSyncMemToDev()
is called to flush the cache line. When the DPU task
completes running, dpuSyncDevToMem()
is called to
invalidate the cache line.
dpuAllocMem()
, dpuFreeMem()
, dpuSyncMemToDev()
, and
dpuSyncDevToMem()
are provided only to demonstrate
the split IO memory model. They are not expected to be used directly in your production
environment. It is up to you whether you want to implement such functionalities to
better meet customized requirements.