graph 吞吐量可定义为每秒生成(或耗用)的平均字节数。event::io_stream_start_to_bytes_transferred_cycles
枚举可用于记录传输一定量的数据所耗费的周期数。
执行 event::start_profiling()
后,performance counter 0
和 performance counter 1
这两个性能计数器即可协同工作。performance counter 0
在接收到第一条数据后就会开始递增计数器。performance counter
1
在接收到数据后会递增。当 performance counter 1
与 event::start_profiling
中指定的数据量相等时,它会生成事件以通知 performance counter 0
停止。event::read_profiling()
所读取的值是 performance counter 0
值。在 performance counter 0
停止后,计数器的值表示传输数据所耗费的周期数。
如果
event::start_profiling
中指定的数据量未传输完成,performance counter 0
就不会停止。另一项限制是,如果指定数据量已完成传输,并且有其它数据流过,performance counter
0
将继续自由运行,且永不终止。警告: 对于任意 graph 吞吐量剖析方法,如果迭代次数太小,graph 时延或 API 调用的开销可能无法忽略。请尝试运行大量迭代,尽可能减小 graph 时延或 API 调用的影响。
使用 graph 输出来剖析 graph 吞吐量
以下示例演示了如何使用 graph 输出来剖析 graph 吞吐量:
auto s2mm_run = s2mm(out_bo, nullptr, OUTPUT_SIZE);
const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.dataout, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr_pl.run(iterations);
s2mm_run.wait();//performance counter 0 stops, assumming s2mm able to receive all data
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared
请注意,在以上代码中,运行会等待 s2mm
完成,以确保所有数据都通过 PLIO 完成传输。
在 AI 引擎仿真流程中使用 API 时,可改用 graph.wait()
。请注意,执行 graph.wait()
后,API 仍将需要额外周期以将数据从窗口缓冲器传输至 PLIO。有一种解决方案是使用足够多的迭代次数,使开销尽可能小且可忽略。另一种解决方案是使用 graph.wait(<NUM_CYCLES>)
并运行多个周期,使其足以确保所有数据都通过 PLIO 完成传输。
使用 graph 输入来剖析 graph 吞吐量
从 PLIO 到内核的串流以及输入缓冲器的 DMA 完成配置后即可立即接收数据。当 PL 内核
mm2s
断言有效时,输入信号线即可开始接收数据,即使在 graph::run
之前也是如此。有一种 PLIO 输入剖析方法是在 event::start_profiling()
后断言 PL 有效。以下示例演示了如何使用 graph 输入来剖析 graph 吞吐量:const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.in, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr_pl.run(iterations);
auto mm2s_run = mm2s(nullptr, OUTPUT_SIZE_MM2S);//After start profiling, send data from mm2s
gr_pl.wait();//performance counter 0 stops, assumming s2mm able to receive all data
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared
graph 吞吐量估算
如有任何原因导致无法按指定数据量以理想方式停止
performance counter 0
,也可以采用此方法来估算 graph 吞吐量。例如,如果 PL 内核自由运行并且 graph 输出 AI 引擎到 PL 接口列已没有性能计数器可用,那么我们仍可通过 graph 输入来剖析 graph 吞吐量:const int WINDOW_SIZE_in_bytes=8192;
int iterations=999;
//Third parameter is the amount of data to be transferred (in bytes).
event::handle handle = event::start_profiling(gr_pl.in, event::io_stream_start_to_bytes_transferred_cycles, WINDOW_SIZE_in_bytes*iterations);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr_pl.run(iterations);
gr_pl.wait();//performance counter 0 does not stop
//Read performance counter value immediately
//Assuming that overhead can be negligible if iteration is large enough
long long cycle_count = event::read_profiling(handle);
double throughput = (double)WINDOW_SIZE_in_bytes*iterations / (cycle_count * 1e-9); //bytes per second
event::stop_profiling(handle);//Performance counter is released and cleared