Description
This rule checks for compute units that are called too often, thus incurring an overhead.
Explanation
Workgroups provide a powerful mechanism to split operations across multiple compute units (CUs). However, creating a kernel with too small a work size can incur an overhead since each CU call has a startup time. AMD recommends finding the right balance of global and local work sizes to optimize overall performance.
Consider a global work size of (1024, 1024, 1); AMD does not
recommend:
__kernel __attribute__ ((reqd_work_group_size(4, 4, 1)))
void madd(__global int* a, __global int* b, __global int* c) {
int i = get_global_id(1)*get_global_size(0) + get_global_id(0);
c[i] = a[i] + b[i];
}
In contrast, AMD
recommends:
__kernel __attribute__ ((reqd_work_group_size(<b>256</b>, <b>256</b>, 1)))
void madd(__global int* a, __global int* b, __global int* c) {
int i = get_global_id(1)*get_global_size(0) + get_global_id(0);
c[i] = a[i] + b[i];
}