961 and Surprisingly Unused CUDA Threads

2024 May 4

Today's interesting number is 961.
Say you're writing a CUDA program and you need to accomplish some task for every element of a long array. Well, the classical way to do this is to divide up the job amongst several different threads and let each thread do a part of the array. (We'll ignore blocks for simplicity, maybe each block has its own array to work on or something.) The method here is as follows:
for (int i = threadIdx.x; i < array_len; i += 32) { arr[i] = ...; }
So the threads make the following pattern (if there are t = 8 threads):
⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫
This is for an array of length l = 26. We can see that the work is split as evenly as possible between the threads, except that threads 0 and 1 (black and brown) have to process the last two elements of the array while the rest of the threads have finished their work and remain idle. This is unavoidable because we can't guarantee that the length of the array is a multiple of the number of threads. But this only happens at the tail end of the array, and for a large number of elements, the wasted effort becomes a very small fraction of the total. In any case, each thread will loop
l
t
 = 
26
8
 = 4
times, though it may be idle during the last loop while it waits for the other threads to catch up.
One may be able to spend many happy hours programming the GPU this way before running into a question: What if we want each thread to operate on a continguous area of memory? (In most cases, we don't want this.) In the previous method (which is the canonical one), the parts of the array that each thread worked on were interleaved with each other. Now we run into a scenario where, for some reason, the threads must operate on continguous chunks. "No problem" you say, we simply need to break the array into chunks and give a chunk to each thread.
const int chunksz = (array_len + blockDim.x - 1)/blockDim.x; for (int i = threadIdx.x*chunksz; i < (threadIdx.x + 1)*chunksz; i++) { if (i < array_len) { arr[i] = ...; } }
If we size the chunks at 3 items, that won't be enough, so again we need l/t⌉ = 4 items per chunk. Here is the result:
⬛⬛⬛⬛🟫🟫🟫🟫🟥🟥🟥🟥🟧🟧🟧🟧🟨🟨🟨🟨🟩🟩🟩🟩🟦🟦
Beautiful. Except you may have noticed something missing. There are no purple squares. Though thread 6 is a little lazy and doing 2 items instead of 4, thread 7 is doing absolutely nothing! It's somehow managed to fall off the end of the array.
Unavoidably, some threads must be idle for l/tt - l = 6 loops. This is the conserved total amount of idleness. With the first method, the idleness is spread out across threads. Mathematically, the amount of idleness can be no greater than t-1 regardless of array length and thread number, and so each thread will be idle for at most 1 loop. But in the contiguous method, the idleness is concentrated in the last threads. There is nothing mathematically impossible about having l/tt - l as big as l/t or bigger, and so it's possible for an entire thread to remain unused. Multiple threads, even. Eg. take l = 9:
⬛⬛🟫🟫🟥🟥🟧🟧🟨
3 full threads are unused there! Practically, this shouldn't actually be a problem, though. The number of serial loops is still the same, and the total number of idle loops is still the same. It's just distributed differently. The reasons to prefer the interleaved method to the contiguous method would be related to memory coalescing or bank conflicts. The issue of unused threads would be unimportant.
We don't always run into this effect. If l is a multiple of t, all threads are fully utilized. Also, we can guarantee that there are no unused threads for l larger than a certain maximal value. Namely, take l = (t-1)² then ⌈(t-1)²/t⌉ = t-1 and so the idleness is t(t-1) - (t-1)² = t-1 ≥ ⌈l/t⌉ = t-1. But if l is larger than this, then one can show that all threads must be used at least a little bit.
So, if we're using t = 32 CUDA threads, then when the array size is 961, the contiguous processing method will leave thread 31 idle. And 961 is the largest array size for which that is true.