Introduction
As we all known, GPU is especially suitable for massive data-parallel computation. Now GPU plays a vital role in machine leaning(e.g. deep learning), HPC and graphic. We can use a common-used library named CUDA, which developed by Nvidia and can use GPU hardware directly, to solve many complex computational problems in a more efficient way than on a CPU. But in order to write excellent CUDA programs, there is some important concepts we need to understand, like blocks, cache in chips, shared memory in GPU and etc. Below, I will list these key concepts and give some explanations and code examples.
A Tour of Different NVIDIA GPUs
//TODO some comparision between different Nvidal GPU hardware.
A hello world
example
This is CUDE program print Hello world
from different threads in different blocks. And this is a 3-D slices in block level and a 2-D slices in thread level.
|
|
What’s Kernel?
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel must use the __global__
to declare this function is executing in CUDA threads. If you want to use this kernel function in other fields, you need to use <<<>>>
to run kernel codes in host. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx
variable.
As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:
|
|
Get More example source codes
You can type cuda-install-samples-8.0.sh [installing path]
in command line to get lots of useful example source codes to teach yourself how to make your CUDA programs be more perfect.