Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL

Chapter 10. General coding principles > Ten tips for high-performance kernels

10.4. Ten tips for high-performance kernels

Between examining expert code and conducting my own experiments, I’ve gleaned ten simple methods that boost the performance of OpenCL kernels. In no particular order, here are these methods:

  • Unroll loops. If you know in advance how many iterations will be performed by a for or while loop, you should consider coding the iterations separately. This removes the need for the comparison operations associated with the loop statements. Of course, you need to make sure that the kernel doesn’t grow too large.

  • Disable processing of denormalized numbers. As discussed in chapter 3, denormalized numbers are floating-point numbers whose values fall below the smallest regular value. They reduce the likelihood of division-by-zero operations, but their processing can take time. If division-by-zero operations aren’t a concern for the kernel, the host application can set the -cl-denorms-are-zero option in clBuildProgram. This is shown in the following function call:

    clBuildProgram(program, 0, NULL, "-cl-denorms-are-zero", NULL, NULL);

    To disable processing of infinite values and NaNs, set the -cl-finite-math-only option. Table 2.7 lists all of the options available for compiling kernels.

  • Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. If the host application needs to transfer a constant value to the kernel, it’s better to send the value using compiler options like -DNAME=VALUE than to create a separate argument for the kernel function. For example, if you need to tell the kernel that each work-item must process 128 values, you can define the SIZE macro as follows:

    clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);

    Now, when the compiler builds the kernel, it will replace every incidence of SIZE with 128. No private or local memory is needed to store the constant.

  • If sharing isn’t an issue, store small variable values in private memory instead of local memory. Work-items can access private memory faster than they can access local memory. Therefore, if the kernel needs to store primitive variable data, work-items can access the data faster in private memory. But if the data needs to be shared with other work-items in the work-group, it should be stored in local memory.

  • Avoid local memory bank conflicts by accessing local memory sequentially. Local memory is arranged into banks that are individually accessible. These banks interleave their data so that successive 32-bit elements are stored in successive banks. Therefore, if work-items access data sequentially, the read/write operations can be processed in parallel. Otherwise, if multiple work-items access the same memory bank, the memory operations will be processed serially.

  • Avoid using the modulo (%) operator. The % operator requires a significant amount of processing time on GPUs and other OpenCL devices. If possible, try to find another method to distinguish work-items from one another.

  • Reuse private variables throughout the kernel—create macros to avoid confusion. If a kernel uses one private variable in one section of code and another private variable in another section, the two variables can be replaced by a single variable. The problem is that, when a variable serves multiple purposes, it’s confusing for the programmer to understand what’s happening in code.

    To fix this problem, set macros whose names correspond to the same private variable. For example, suppose the variable tmp1 should hold an exponent in one section of the kernel, a sine value in a second section, and a loop counter in another. You could code three macros as follows:

    #define EXP tmp1
    #define SINE tmp1
    #define COUNT tmp1

    With these definitions in place, you can code with EXP, SINE, and COUNT as though they were distinct variables, but private memory will only need to store a single value.

  • For multiply-and-add operations, use the fma function if it’s available. If the FP_FAST_FMAF macro is defined, you can compute a*b+c with greater accuracy by calling the function fma(a, b, c). The processing time for this function will be less than or equal to that of computing a*b+c.

  • Inline non-kernel functions in a program file. The inline modifier preceding a function tells the compiler that each call to the function should be replaced with the complete function code. This is not memory efficient—if an inline function is called N times, the function body will be expanded N times—but it saves processing time by removing the context switches and stack operations associated with regular function calls.

  • Avoid branch miss penalties by coding conditional statements to be true more often than false. Many processors predict that branch statements will return true, and they plan for this in advance by loading the address corresponding to a true result. But if the condition produces a false result, the processor must clear the processing pipeline and load instructions from a new address. This is called a branch miss penalty, and it can be avoided by coding if statements and similar conditional statements to produce a true result as often as possible.


You are currently reading a PREVIEW of this book.


Get instant access to over $1 million worth of books and videos.


Start a Free 10-Day Trial

  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint