lundi 18 novembre 2019

How to nest for loops in CUDA?

I would like to ask for a complete example of CUDA code, one that includes everything someone may want to include so that it may be referenced by people trying to write such code such as myself.

My main concerns are whether or not it is possible to process multiple for loops at the same time on different threads in the same block. This is the difference between running (for a clear example) a total of 2016 threads divided into blocks of 32 on case 3 in the example code and running 1024 threads on each for loop theoretically with the code we have we could run even fewer taking of another 2 blocks by running the for loops of other cases under the same block. Otherwise separate cases would primarily be used for processing separate tasks such as a for loop. Currently it appears that the CUDA code simply knows when to run in parallel.

// note: rarely referenced, you can process if statements in parallel seemingly by block, I'd say that is the primary purpose of using more blocks instead of increasing thread count per block during call, other than the need of multiple SMs (Streaming Multiprocessors), capped at 2048 threads (also the cap for a block)//

If we have the following code including for loops and if statements then what would the code that optimizes parallelization be?

public void main(string[] args) {

    doMath(3); // we want to process each statement in parallel. For this we use different blocks.
}

void doMath(int question) {
    int[] x = new int{0,1,2,3,4,5,6,7,8,9};
    int[] y = new int{0,1,2,3,4,5,6,7,8,10};
    int[] z = new int{0,1,2,3,4,5,6,7,8,11};
    int[] w = new int{0,1,2,3,4,5,6,7,8,12};
    int[] q = new int[1000];
    int[] r = new int[1000];
    int[] v = new int[1000];
    int[] t = new int[1000];


    switch(question) {
        case 1: 
            for (int a = 0; a < x.length; a++) {
                for (int b = 0; b < y.length; b++) {
                    for (int c = 0; c < z.length; c++) {
                        q[(a*100)+(b*10)+(c)] = x[a] + y[b] + z[c];
                    }
                }
            }
        break;
        case 2: 
            for (int a = 0; a < x.length; a++) {
                for (int b = 0; b < y.length; b++) {
                    for (int c = 0; c < w.length; c++) {
                        r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
        break;
        case 3:
            for (int a = 0; a < x.length; a++) {
                for (int b = 0; b < z.length; b++) {
                    for (int c = 0; c < w.length; c++) {
                        v[(a*100)+(b*10)+(c)] = x[a] + z[b] + w[c];
                    }
                }
            }
            for (int a = 0; a < x.length; a++) {
                for (int b = 0; b < y.length; b++) {
                    for (int c = 0; c < w.length; c++) {
                        t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
        break;
    }
}

From the samples I have seen the CUDA code would be as follows:

// 3 blocks for 3 switch cases the third case requires 2000 threads to be done in perfect parallel while the first two only require 1000. blocks operate by multiples of 32 (threads). the trick is to take the greatest common denominator of all cases, or if/else statements as the... case... may be, and appropriate the number of blocks required to each case. (in this example we would need 127 blocks of 32 threads (1024 * 2 + 2048 - 32)//

//side note: each Streaming Multiprocessor or SM can only support 2048 threads and 2048 / (# of blocks * # of threads/block)//

public void main(string[] args) {

    int *x, *y *z, *w, *q, *r, *t;

    int[] x = new int{0,1,2,3,4,5,6,7,8,9};
    int[] y = new int{0,1,2,3,4,5,6,7,8,10};
    int[] z = new int{0,1,2,3,4,5,6,7,8,11};
    int[] w = new int{0,1,2,3,4,5,6,7,8,12};
    int[] q = new int[1000];
    int[] r = new int[1000];
    int[] t = new int[1000];

    cudaMallocManaged(&x, x.length*sizeof(int));
    cudaMallocManaged(&y, y.length*sizeof(int));
    cudaMallocManaged(&z, z.length*sizeof(int));
    cudaMallocManaged(&w, w.length*sizeof(int));
    cudaMallocManaged(&q, q.length*sizeof(int));
    cudaMallocManaged(&r, r.length*sizeof(int));
    cudaMallocManaged(&t, t.length*sizeof(int));

    doMath<<<127,32>>>(x, y, z, w, q, r, t); 

    cudaDeviceSynchronize();

    cudaFree(x);
    cudaFree(y);
    cudaFree(z);
    cudaFree(w);
    cudaFree(q);
    cudaFree(r);
    cudaFree(t);
}

__global__
void doMath(int *x, int *y, int *z, int *w, int *q, int *r, int *t) {

    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    switch(question) {
        case 1: 
            for (int a = index; a < x.length; a+=stride ) {
                for (int b = index; b < y.length; b+=stride) {
                    for (int c = index; c < z.length; c+=stride) {
                        q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
        break;
        case 2: 
            for (int a = index; a < x.length; a+=stride) {
                for (int b = index; b < y.length; b+=stride) {
                    for (int c = index; c < w.length; c+=stride) {
                        r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
        break;
        case 3:
            for (int a = index; a < x.length; a+=stride) {
                for (int b = index; b < y.length; b+=stride) {
                    for (int c = index; c < z.length; c+=stride) {
                        q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
            for (int a = index; a < x.length; a+=stride) {
                for (int b = index; b < y.length; b+=stride) {
                    for (int c = index; c < w.length; c+=stride) {
                        t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
                    }
                }
            }
        break;
    }
}

Aucun commentaire:

Enregistrer un commentaire