Mergesort CUDA implementation

Nvidia_CUDA_Logo

In this post I have implemented Mergesort on the GPU with cuda. . If you don’t know, CUDA is a programming model designed for using your GPU for parallel computation. GPUs are optimized for running the same instructions on multiple threads at the same time.

Algorithm

Merge Sort is a Divide and Conquer algorithm. It divides input array into two halves, calls itself for the two halves and then merges the two sorted halves. The merge() function is used for merging two halves. The merge(arr, l, m, r) is the key process that assumes that arr[l..m] and arr[m+1..r] are sorted and merges the two sorted sub-arrays into one. See following C implementation for details.

MergeSort(arr[], l,  r)
If r > l
     1. Find the middle point to divide the array into two halves:  
             middle m = (l+r)/2
     2. Call mergeSort for first half:   
             Call mergeSort(arr, l, m)
     3. Call mergeSort for second half:
             Call mergeSort(arr, m+1, r)
     4. Merge the two halves sorted in step 2 and 3:
             Call merge(arr, l, m, r)

 

Code Implementation

The three main functions used in the sort are:

This is a device function that takes a pointer to a list of floats, the length of the list, and the number of list elements given to each thread. Puts the list into sorted order in place. 

__global__ void Device_Merge(float *d_list, int length, int elementsPerThread){

 int my_start, my_end; //indices of each thread's start/end

 //Declare counters requierd for recursive mergesort
 int left_start, right_start; //Start index of the two lists being merged
 int old_left_start;
 int left_end, right_end; //End index of the two lists being merged
 int headLoc; //current location of the write head on the newList
 short current_list = 0; /* Will be used to determine which of two lists is the
 most up-to-date */

 //allocate enough shared memory for this block's list...

 __shared__ float tempList[2][SHARED/sizeof(float)];

 //Load memory
 int index = blockIdx.x*blockDim.x + threadIdx.x;
 for (int i = 0; i < elementsPerThread; i++){
 if (index + i < length){
 tempList[current_list][elementsPerThread*threadIdx.x + i]=d_list[index + i];
 }
 }

 //Wait until all memory has been loaded
 __syncthreads();

 //Merge the left and right lists.
 for (int walkLen = 1; walkLen < length; walkLen *= 2) {
 //Set up start and end indices.
 my_start = elementsPerThread*threadIdx.x;
 my_end = my_start + elementsPerThread;
 left_start = my_start;

 while (left_start < my_end) {
 old_left_start = left_start; //left_start will be getting incremented soon.
 //If this happens, we are done.
 if (left_start > my_end){
 left_start = len;
 break;
 }

 left_end = left_start + walkLen;
 if (left_end > my_end) {
 left_end = len;
 }

 right_start = left_end;
 if (right_start > my_end) {
 right_end = len;
 }

 right_end = right_start + walkLen;
 if (right_end > my_end) {
 right_end = len;
 }

 solve(&tempList, left_start, right_start , old_left_start, my_start, int my_end, left_end, right_end, headLoc);
 left_start = old_left_start + 2*walkLen;
 current_list = !current_list;
 }
 }

 __syncthreads();//Wait until all thread completes swapping if not race condition will appear
 //as it might update non sorted value to d_list

 int index = blockIdx.x*blockDim.x + threadIdx.x;
 for (int i = 0; i < elementsPerThread; i++){
 if (index + i < length){
 d_list[index + i]=subList[current_list][elementsPerThread*threadIdx.x + i];
 }
 }
 __syncthreads();// //Wait until all memory has been loaded

 return;

}

This is our host function called by main() function. Takes a pointer to a list of floats.
the length of the list, the number of threads per block, and the number of blocks on which to execute. Allocate space for device copy, copy input to device and Launch a Device_Merge kernel on GPU.

void MergeSort(float *h_list, int len, int threadsPerBlock, int blocks) {

 //device copy
 float *d_list;
 //Allocate space for device copy
 cudaMalloc((void **) &d_list, len*sizeof(float));
 //copy input to device
 cudaMemcpy(d_list, h_list, len*sizeof(float), cudaMemcpyHostToDevice);

 int elementsPerThread = ceil(len/float(threadsPerBlock*blocks));

 // Launch a Device_Merge kernel on GPU
 Device_Merge<<<blocks, threadsPerBlock>>>(d_list, len, elementsPerThread);

 cudaMemcpy(h_list, d_list, len*sizeof(float), cudaMemcpyDeviceToHost);
 cudaFree(d_list);

}

__device__ solve (void), can only be executed on the device and can only be called from a device or kernel function.

__device__ solve(int **tempList, int left_start, int right_start , int old_left_start, int my_start, int my_end, int left_end, int right_end, int headLoc)
 {
 for (int i = 0; i < walkLen; i++){
 if (tempList[current_list][left_start] < tempList[current_list][right_start]) {
 tempList[!current_list][headLoc] = tempList[current_list][left_start];
 left_start++;
 headLoc++;
 //Check if l is now empty
 if (left_start == left_end) {
 for (int j = right_start; j < right_end; j++){
 tempList[!current_list][headLoc] = tempList[current_list][right_start];
 right_start++;
 headLoc++;
 }
 }
 }
 else {
 tempList[!current_list][headLoc] = tempList[current_list][right_start];
 right_start++;
 //Check if r is now empty
 if (right_start == right_end) {
 for (int j = left_start; j < left_end; j++){
 tempList[!current_list][headLoc] = tempList[current_list][right_start];
 right_start++;
 headLoc++;
 }
 }
 }
 }


 }

(For complete code press here.)

Performance:

Following programs are checked on Nsight Eclipse. Nsight Eclipse is Eclipse IDE for C/C++ bundled with the libraries of CUDA. These results are checked on a system which has GPU of compute capability of 1.2 (refer to chart for more details about Compute capability versions ).

Sorting 1 million integers 

c++ STL library function execution time
sort() -> 0.738 seconds
My CUDA mergesort, with 512 threads
./mergesort -> 1.871 seconds

While I didn’t really expect to beat the system, sort I was curious what parts of my program took the longest so.

parse argv 16 microseconds
cudaMemcpy list back to the host 1.28 seconds
cudaMalloc device lists 122 milliseconds
cudaMemcpy list to device 4.47 milliseconds
cudaMalloc metadata 122 microseconds
cudaMemcpy metadata to device 1.15 milliseconds
call mergesort kernel 39 microseconds

You could have guessed it, but the most time was spent in IO and after that was just moving memory around. The actual sorting would have taken lesser time, which is not too bad at all.

FFT Convolver

FFT Convolver is a C++ library for highly efficient convolution of audio data. FFT Convolution is a DSP technique. There are two important DSP techniques, the overlap-add method, and FFT convolution. The overlap-add method is used to break long signals into smaller segments for easier processing. FFT convolution uses the overlap-add method together with the Fast Fourier Transform, allowing signals to be convolved by multiplying their frequency spectra.

Cuda program for Matrix Multiplaction

In this post of “Cuda program for multiplication of two matrices”, I have considered two cases:
1. Two dimensional blocks and one thread per block.
2. One block and two dimensional threads in that block.

Following programs are checked on Nsight Eclipse. Nsight Eclipse is Eclipse IDE for C/C++ bundled with the libraries of CUDA. These results are checked on a system which has GPU of compute capability of 1.2 (refer to chart for more details about Compute capability versions ).

1. Two dimensional blocks and one thread per block.

__global__ void matproduct(int *l,int *m, int *n)
{
    int x=blockIdx.x;
    int y=blockIdx.y;
    int k;
    n[col2*y+x]=0;
    for(k=0;k<col1;k++)
    {
        n[col2*y+x]=n[col2*y+x]+l[col1*y+k]*m[col2*k+x];
    }
}
int main()
{
    int a[row1][col1];
    int b[row2][col2];
    int c[row1][col2];
    int *d,*e,*f;
    int i,j;

    printf("\n Enter elements of first matrix of size 2*3\n");
    for(i=0;i<row1;i++)
    {
        for(j=0;j<col1;j++)
            {
                scanf("%d",&a[i][j]);
            }
    }
    printf("\n Enter elements of second matrix of size 3*2\n");
        for(i=0;i<row2;i++)
        {
            for(j=0;j<col2;j++)
                {
                    scanf("%d",&b[i][j]);
                }
        }

    cudaMalloc((void **)&d,row1*col1*sizeof(int));
    cudaMalloc((void **)&e,row2*col2*sizeof(int));
    cudaMalloc((void **)&f,row1*col2*sizeof(int));

 cudaMemcpy(d,a,row1*col1*sizeof(int),cudaMemcpyHostToDevice);
 cudaMemcpy(e,b,row2*col2*sizeof(int),cudaMemcpyHostToDevice);

dim3 grid(col2,row1);
/* Here we are defining two dimensional Grid(collection of blocks) structure. Syntax is dim3 grid(no. of columns,no. of rows) */

    matproduct<<<grid,1>>>(d,e,f);

 cudaMemcpy(c,f,row1*col2*sizeof(int),cudaMemcpyDeviceToHost);
    printf("\nProduct of two matrices:\n ");
    for(i=0;i<row1;i++)
    {
        for(j=0;j<col2;j++)
        {
              printf("%d\t",c[i][j]);
        }
        printf("\n");
    }

    cudaFree(d);
    cudaFree(e);
    cudaFree(f);

    return 0;
}

output:

 Enter elements of first matrix of size 2*3
1 2 3 4 5 6

 Enter elements of second matrix of size 3*2
7 8 9 10 11 12

  Product of two matrices:
 58    64    
139    154

 

2. One block and two dimensional threads in that block.

__global__ void matadd(int *l,int *m, int *n)
{
    int x=threadIdx.x;
    int y=threadIdx.y;

    int k;

n[col2*y+x]=0; 
  for(k=0;k<col1;k++)
   {
    n[col2*y+x]=n[col2*y+x]+l[col1*y+k]*m[col2*k+x];
   }
}
int main()
{
    int a[row1][col1];
    int b[row2][col2];
    int c[row1][col2];
    int *d,*e,*f;
    int i,j;

    printf("\n Enter elements of first matrix of size 2*3\n");
    for(i=0;i<row1;i++)
    {
        for(j=0;j<col1;j++)
            {
                scanf("%d",&a[i][j]);
            }
    }
    printf("\n Enter elements of second matrix of size 3*2\n");
        for(i=0;i<row2;i++)
        {
            for(j=0;j<col2;j++)
                {
                    scanf("%d",&b[i][j]);
                }
        }

   cudaMalloc((void **)&d,row1*col1*sizeof(int));
    cudaMalloc((void **)&e,row2*col2*sizeof(int));
    cudaMalloc((void **)&f,row1*col2*sizeof(int));

 cudaMemcpy(d,a,row1*col1*sizeof(int),cudaMemcpyHostToDevice);
 cudaMemcpy(e,b,row2*col2*sizeof(int),cudaMemcpyHostToDevice);

dim3 threadBlock(col2,row1);
/* Here we are defining two dimensional Grid(collection of blocks) structure. Syntax is dim3 grid(no. of columns,no. of rows) */

    matadd<<<1,threadBlock>>>(d,e,f);

 cudaMemcpy(c,f,row1*col2*sizeof(int),cudaMemcpyDeviceToHost);
   
 printf("\nProduct of two matrices:\n ");
    for(i=0;i<row1;i++)
    {
        for(j=0;j<col2;j++)
        {
              printf("%d\t",c[i][j]);
        }
        printf("\n");
    }

    cudaFree(d);
    cudaFree(e);
    cudaFree(f);

    return 0;
}

output:

 Enter elements of first matrix of size 2*3
1 2 3 4 5 6

 Enter elements of second matrix of size 3*2
7 8 9 10 11 12

  Product of two matrices:
 58    64    
139    154

 

 

 

 

 

 

Getting started with CUDA Programming

Nvidia_CUDA_Logo

What is CUDA?

CUDA stands for Compute Unified Device Architecture. It is a package developed by Nvidia, specially for graphics processing. These libraries can be used with C or C++. Cuda is specific to Nvidia, this means you can not use Cuda for graphics cards(GPUs) of other companies like AMD or Intel.

Host and Device

  •        Host: The CPU and its memory (host memory) 
  •        Device: The GPU and its memory (device memory)

Kernel

A function to be executed on GPU is called a Kernel. While defining kernel, a function is prefixed with keyword __global__.
e.g.

 __global__ void add(int *a, int *b, int *c)
{  
    *c = *a + *b; 
}

__global__ is a CUDA keyword meaning

  • add() will execute on the device 
  • add() will be called from the host

add() runs on the device, so a, b and c must point to device memory.We need to allocate memory on the GPU.

Note: Only those kernels, which are called from CPU function(e.g. main() function), are prefixed with __global__ during their definition while kernels which are called from another kernel, are prefixed with __device__ during their definition.

Stream Multiprocessor(SM) and Stream Processor(SP)

           GPU consists of smaller components called as Stream Multiprocessors(SM). Each SM consists of many Stream Processors(SP) on which actual computation is done. Each SP is also called a Cuda core.

Terminology: Thread, Block, Grid

             Thread is a single instance of execution. On one SP, one or more threads can be executed. A group of threads is called a Block. On one SM, one or more blocks can be executed. A group of blocks is called a Grid.

Simple Processing Flow

300px-CUDA_processing_flow_(En)

  • Copy input data from CPU memory to GPU memory.
  • Load GPU program and execute, caching data on a chip for performance.
  • Copy results from GPU memory to CPU memory.

Addition on the Device:

__global__ void add(int *a, int *b, int *c) 
{ 
   *c = *a + *b; 
}

int main(void) {

int a, b, c; // host copies of a, b, c 
int *d_a, *d_b, *d_c; // device copies of a, b, c 
int size = sizeof(int); 

// Allocate space for device copies of a, b, c 
cudaMalloc((void **)&d_a, size); 
cudaMalloc((void **)&d_b, size); 
cudaMalloc((void **)&d_c, size); 

// Setup input values 
a = 2; 
b = 7;

// Copy inputs to device 
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); 
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); 

// Launch add() kernel on GPU 
add<<<1,1>>>(d_a, d_b, d_c); 

// Copy result back to host 
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); 

// Cleanup cudaFree(d_a); 
cudaFree(d_b); 
cudaFree(d_c); 

return 0; 
}

RUNNING IN PARALLEL

GPU computing is about massive parallelism.

  • Instead of executing add() once, add<<< 1, 1 >>>(); 
  • execute N times in parallel, add<<< 1, N>>>(); 

Vector Addition on the Device

With add() running in parallel we can do vector addition

#define N 512
__global__ void add(int *a, int *b, int *c) 
{ 
      c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; 
}
int main(void) 
{ 
int *a, *b, *c; // host copies of a, b, c 
int *d_a, *d_b, *d_c; // device copies of a, b, c 
int size = N * sizeof(int); 

// Alloc space for device copies of a, b, c 
cudaMalloc((void **)&d_a, size); 
cudaMalloc((void **)&d_b, size); 
cudaMalloc((void **)&d_c, size); 

// Alloc space for host copies of a, b, c and setup input values 
a = (int *)malloc(size); random_ints(a, N); 
b = (int *)malloc(size); random_ints(b, N); 
c = (int *)malloc(size);

// Copy inputs to device 
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); 
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); 

// Launch add() kernel on GPU with N blocks 
add<<<1,N>>>(d_a, d_b, d_c); 

// Copy result back to host 
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); 

// Cleanup 
free(a); free(b); free(c); 
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); 

return 0; 
}

Quicksort using OPENMP

In this post, we will see how to parallelize Quicksort algorithm using OPENMP. 
              In Quicksort, an input is any sequence of numbers and output is a Sorted array. QuickSort is a Divide and Conquer algorithm. It picks an element as pivot and partitions the given array around the picked pivot. We start with one number, mostly the first number, and finds its position in the sorted array. Then we divide the array into two parts considering this pivot element’s position in the sorted array. In each part, separately, we find the pivot elements. This process continues until all numbers are processed and we get the sorted array. 

/* low  --> Starting index,  high  --> Ending index */
quickSort(arr[], low, high)
{
    if (low < high)
    {
        /* pi is partitioning index, arr[p] is now
           at right place */
        pi = partition(arr, low, high);

        quickSort(arr, low, pi - 1);  // Before pi
        quickSort(arr, pi + 1, high); // After pi
    }
}

             In Parallel Quicksort, we divide it into two parts by partitions the given array around the first pivot where each part is processed by independent thread i.e. different threads will find out pivot element in each part recursively and sort the array independently.

Array                   ————————————————————-
                                                                              
Array                    ————-Thread 1 ——–PE——–Thread 2————–
                                                                     

Here PE stands for Pivot Element.

int j = partition(arr,0,n-1);
//here j is my first Pivot Element
#pragma omp parallel sections
{
      #pragma omp section
      {
           quicksort(arr,0, j - 1);//Thread 1
      }
      #pragma omp section
      {
           quicksort(arr, j + 1, n-1);//Thread2
      }
}

if we use #pragma omp parallel section inside quicksort function the total no of threads will be equal to the total no of calls to the quicksort function. The code is correct, but two threads are created and distroyed in each call which creates a lot of overhead which makes it slower than the serial code.

(For complete code press here.)

So we create only two threads instead of creating too many and got the performance gain from the serial version

(For complete code press here.)

The results obtained clearly show that the parallel operations are faster.

qucick

 

Array size

Serial(in sec)

Parallel(in sec)

10000 0.0024s 0.0013
100000 0.2300 0.1885
1000000 2.26379 1.4734
10000000 24.2066 18.8934

Getting started with OPENMP

In this post, we will see a simple example to create threads, how to create a required number of threads, how to parallelize “for” loop and how to allocate different work to a different thread.

 

1.A simple example of OPENMP:

#pragma omp parallel
{
    printf("Hello World\n");
}

This will print “Hello World” multiple times depending on how many threads have been created. By default, the number of threads created is equal to a number of Processor cores. 

output:

Hello World
Hello World
Hello World
Hello World

(For your convenience, I’ve place the code here.)

2.Create a required number of threads:

#pragma omp parallel num_threads(3)
{
    printf("Hello World\n");
}

If we want to create a specific number of threads, use num_threads() and a number indicating a number of threads to be created. In above example, three threads will be created. Each one will be printing “Hello World”.

output:

Hello World
Hello World
Hello World

(For your convenience, I’ve place the code here.)

3. Parallelize “for” loop using OPENMP:

int a[100000];
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
    a[i] = 2 * i;
}

#pragma omp parallel for tells the compiler to auto-parallelize the for loop. In above code, since we are not mentioning the number of threads, the number of threads will be equal to the number of cores by default.

(For your convenience, I’ve place the code here.)

4.Allocate different work to a different thread:

  #pragma omp parallel sections
   {
     #pragma omp section
      {
        int tid;
        tid=omp_get_thread_num();
        printf("X printed by thread with id=%d\n",tid);
      }
     #pragma omp section
      {
        int tid;
        tid=omp_get_thread_num();
        printf("Y printed by thread with id=%d\n",tid);
      }
     #pragma omp section
      {
        int tid;
        tid=omp_get_thread_num();
        printf("Z printed by thread with id=%d\n",tid);
      }
   }

I have used section to allocate different work to different threads. We have to use #pragma omp parallel sectionsI have provided a simple example, which creates three threads. Each thread does the distinct job. The first thread is printing ‘X’, second thread ‘Y’ and third thread ‘Z’. I am also printing thread id, so that you can easily understand, that threads are different and doing an independent job. omp_get_thread_num() is used to print the thread id

output:

Y printed by thread with id=0
X printed by thread with id=3
Z printed by thread with id=1

(For your convenience, I’ve place the code here.)

How To Run: 
To Compile:
gcc -fopenmp code.c   
To Run:
./a.out

References:

Wikipedia | Getting started with openmp(intel)