Introduction

As we all known, GPU is especially suitable for massive data-parallel computation. Now GPU plays a vital role in machine leaning(e.g. deep learning), HPC and graphic. We can use a common-used library named CUDA, which developed by Nvidia and can use GPU hardware directly, to solve many complex computational problems in a more efficient way than on a CPU. But in order to write excellent CUDA programs, there is some important concepts we need to understand, like blocks, cache in chips, shared memory in GPU and etc. Below, I will list these key concepts and give some explanations and code examples.

A Tour of Different NVIDIA GPUs

//TODO some comparision between different Nvidal GPU hardware.

A `hello world` example

This is CUDE program print Hello world from different threads in different blocks. And this is a 3-D slices in block level and a 2-D slices in thread level.

#include <stdio.h>
#include <assert.h>
// CUDA runtime
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>
#ifndef MAX
#define MAX(a,b) (a > b ? a : b)
#endif
__global__ void testKernel()
{
    printf("[%d, %d]:Hello world\n",\
            blockIdx.y*gridDim.x+blockIdx.x,\
            threadIdx.z*blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadIdx.x,\
            val);
}
int main(int argc, char **argv)
{
    int devID;
    cudaDeviceProp props;
    // This will pick the best possible CUDA capable device
    devID = findCudaDevice(argc, (const char **)argv);
    //Get GPU information
    checkCudaErrors(cudaGetDevice(&devID));
    checkCudaErrors(cudaGetDeviceProperties(&props, devID));
    printf("Device %d: \"%s\" with Compute %d.%d capability\n",
           devID, props.name, props.major, props.minor);
    printf("printf() is called. Output:\n\n");
    //Kernel configuration, where a two-dimensional grid and
    //three-dimensional blocks are configured.
    dim3 dimGrid(2, 2);
    dim3 dimBlock(2, 2, 2);
    testKernel<<<dimGrid, dimBlock>>>();
    cudaDeviceSynchronize();
    // cudaDeviceReset causes the driver to clean up all state. While
    // not mandatory in normal operation, it is good practice.  It is also
    // needed to ensure correct operation when the application is being
    // profiled. Calling cudaDeviceReset causes all profile data to be
    // flushed before the application exits
    cudaDeviceReset();
    return EXIT_SUCCESS;
}

What’s Kernel?

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel must use the __global__ to declare this function is executing in CUDA threads. If you want to use this kernel function in other fields, you need to use <<<>>> to run kernel codes in host. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
	int i = threadIdx.x;
	C[i] = A[i] + B[i];
}
int main()
{
	...
	// Kernel invocation with N threads
	VecAdd<<<1, N>>>(A, B, C);
	...
}

Get More example source codes

You can type cuda-install-samples-8.0.sh [installing path] in command line to get lots of useful example source codes to teach yourself how to make your CUDA programs be more perfect.

Introduction

A Tour of Different NVIDIA GPUs

A hello world example

What’s Kernel?

Get More example source codes

A `hello world` example