Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

Share this Page URL

Section A.1. Dot Product Revisited - Pg. 250

AdvAnced AtomIcs A.1 Dot Product Revisited In Chapter 5, we looked at the implementation of a vector dot product using CUDA C. This algorithm was one of a large family of algorithms known as reductions. If you recall, the algorithm computed the dot product of two input vectors by doing the following: 1. Each thread in each block multiplies two corresponding elements of the input vectors and stores the products in shared memory. 2. Although a block has more than one product, a thread adds two of the products and stores the result back to shared memory. Each step results in half as many values as it started with (this is where the term reduction comes from) 3. When every block has a final sum, each one writes its value to global memory and exits. 4. If the kernel ran with N parallel blocks, the CPU sums these remaining N