In a previous post where I commented on a strange behaviour of a code that works on some architecture/cuda but hangs on other combinations, @Robert Crovella suggested that the root-cause was likely a programming defect in another area of the code, leading the code to the hang.
I spent some time diving into our code and I found a kernel with the following structure. I removed the body of the kernel for the sake of clarity. The dummy kernel below reproduces exactly the same behaviour as our code.
#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
__global__ void kernel_test(int *ppc){
int tx = threadIdx.x;
int bx = blockIdx.x;
if (ppc[bx] > 1) {
printf("Inside first if condition. block %i thread %i\n", bx, tx);
if (tx < ppc[bx]) {
printf("Inside second if condition. block %i thread %i\n", bx, tx);
__syncthreads();
} // if tx < ppc
__syncthreads();
} // if ppc > 1
}
int main(){
int tpb, blocks;
// Declare thrust device vector
thrust::device_vector<int> ppc;
// Declare pointer to first address of device vector ppc
int *raw_ppc;
nbblocks = 3;
tpb = 32;
// Set size of ppc vector
ppc.resize(nbblocks);
// Get the address of the first element of ppc on the device
raw_ppc = thrust::raw_pointer_cast(&ppc[0]);
// Fill ppc vector with values. Here 25.
thrust::fill(ppc.begin(), ppc.end(), 25);
kernel_test<<<nbblocks,tpb>>>(raw_ppc);
cudaDeviceSynchronize();
return 0;
}
It turns out that this code runs well on a K80 with cuda 10.0 but hangs on a V100 with cuda 10.1. Clearly, it has to do with the __syncthreads. If the first __syncthreads() is commented, it works on the V100+cuda 10.1.
Could someone explain to me why it is a programming defect? Why does it work on some architecture/cuda and not on others?
Aucun commentaire:
Enregistrer un commentaire