dimanche 10 mai 2020

programming defect with __syncthreads()? [duplicate]

In a previous post where I commented on a strange behaviour of a code that works on some architecture/cuda but hangs on other combinations, @Robert Crovella suggested that the root-cause was likely a programming defect in another area of the code, leading the code to the hang.

I spent some time diving into our code and I found a kernel with the following structure. I removed the body of the kernel for the sake of clarity. The dummy kernel below reproduces exactly the same behaviour as our code.

 #include <stdio.h>
 #include <stdlib.h>
 #include <thrust/device_vector.h>

 __global__ void kernel_test(int *ppc){

  int tx = threadIdx.x;
  int bx = blockIdx.x;


  if (ppc[bx] > 1) {

    printf("Inside first if condition. block %i thread %i\n", bx, tx);



   if (tx < ppc[bx]) {

          printf("Inside second if condition. block %i thread %i\n",  bx, tx);

          __syncthreads();

   } // if tx < ppc

  __syncthreads();


  } // if ppc > 1

}

int main(){

int tpb, blocks;

// Declare thrust device vector
thrust::device_vector<int> ppc;

// Declare pointer to first address of device vector ppc
int *raw_ppc;
nbblocks = 3;
tpb = 32;

// Set size of ppc vector
ppc.resize(nbblocks);

// Get the address of the first element of ppc on the device
raw_ppc = thrust::raw_pointer_cast(&ppc[0]);


// Fill ppc vector with values. Here 25.
thrust::fill(ppc.begin(), ppc.end(), 25);


kernel_test<<<nbblocks,tpb>>>(raw_ppc);

cudaDeviceSynchronize();


return 0;

}

It turns out that this code runs well on a K80 with cuda 10.0 but hangs on a V100 with cuda 10.1. Clearly, it has to do with the __syncthreads. If the first __syncthreads() is commented, it works on the V100+cuda 10.1.

Could someone explain to me why it is a programming defect? Why does it work on some architecture/cuda and not on others?

Aucun commentaire:

Enregistrer un commentaire