the first step to CUDA

Introduction

As we all known, GPU is especially suitable for massive data-parallel computation. Now GPU plays a vital role in machine leaning(e.g. deep learning), HPC and graphic. We can use a common-used library named CUDA, which developed by Nvidia and can use GPU hardware directly, to solve many complex computational problems in a more efficient way than on a CPU. But in order to write excellent CUDA programs, there is some important concepts we need to understand, like blocks, cache in chips, shared memory in GPU and etc. Below, I will list these key concepts and give some explanations and code examples.

A Tour of Different NVIDIA GPUs

//TODO some comparision between different Nvidal GPU hardware.

A hello world example

This is CUDE program print Hello world from different threads in different blocks. And this is a 3-D slices in block level and a 2-D slices in thread level.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <stdio.h>
#include <assert.h>
// CUDA runtime
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>
#ifndef MAX
#define MAX(a,b) (a > b ? a : b)
#endif
__global__ void testKernel()
{
printf("[%d, %d]:Hello world\n",\
blockIdx.y*gridDim.x+blockIdx.x,\
threadIdx.z*blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadIdx.x,\
val);
}
int main(int argc, char **argv)
{
int devID;
cudaDeviceProp props;
// This will pick the best possible CUDA capable device
devID = findCudaDevice(argc, (const char **)argv);
//Get GPU information
checkCudaErrors(cudaGetDevice(&devID));
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device %d: \"%s\" with Compute %d.%d capability\n",
devID, props.name, props.major, props.minor);
printf("printf() is called. Output:\n\n");
//Kernel configuration, where a two-dimensional grid and
//three-dimensional blocks are configured.
dim3 dimGrid(2, 2);
dim3 dimBlock(2, 2, 2);
testKernel<<<dimGrid, dimBlock>>>();
cudaDeviceSynchronize();
// cudaDeviceReset causes the driver to clean up all state. While
// not mandatory in normal operation, it is good practice. It is also
// needed to ensure correct operation when the application is being
// profiled. Calling cudaDeviceReset causes all profile data to be
// flushed before the application exits
cudaDeviceReset();
return EXIT_SUCCESS;
}

What’s Kernel?

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel must use the __global__ to declare this function is executing in CUDA threads. If you want to use this kernel function in other fields, you need to use <<<>>> to run kernel codes in host. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}

Get More example source codes

You can type cuda-install-samples-8.0.sh [installing path] in command line to get lots of useful example source codes to teach yourself how to make your CUDA programs be more perfect.