Saturday, March 8, 2014

A Brief Intorduction to GPU Computing and CUDA


Graphic Processing units (GPUs) are special purpose hardware devices designed to efficiently perform calculations related to rendering of computer graphics. Graphic processing units have a highly parallel architecture and contain hundreds of simple processing cores built in it. This massive computational power in GPUs can be used together with CPUs to improve the performance of compute intensive applications. This practice of using GPUs along with CPUs to boost application performance is known as GPU computing.
The strategy that GPUs use to accelerate application performance is exploiting the parallelism of the application. A modern GPU consists of 1000+  cores (NVIDIA GeForce GTX 780 Ti has 2880 cores) which can run in parallel, so that it can be programmed to execute a task using 10000+ threads executing in parallel and thereby boost the performance. The main objective is to maximizing the throughput of complete task rather than minimizing the latency of each operation. Furthermore to reduce the overhead of this massive parallelism, GPU threads are designed to be lightweight and to have very low management overhead . With this massive fine grained parallelism introduced by GPUs the throughput of the applications grows immensely.
Achieving such a higher degree of a parallelism using CPUs is an unprecedented task because CPUs are consisting of limited number of cores. And these cores are designed minimize the latency of each single operation by executing it as fast as possible in contrast to parallel approach of GPUs.
GPU computing will also help to increase the overall host system performance since it offloads computation from CPU. Therefore the host system has more time slices available to execute other processes. GPUs also consume less energy compared to CPUs therefore the usage of GPUs makes computer systems more energy efficient and environment friendly.
One of the major advantages of using large number relatively slow clock rates of cores with compared very fast small number of cores is scalability. Small number of cores do not scale linearly where as multicore systems scale almost linearly with the problem size.
Another important prospect of GPU computing is, it is expected GPU performance to increase at the scale of the “moore’s law”. When more transistors are fitting into a GPU, it will result in more number of cores and thereby increasing the performance. But this is not true for single stream processors since increase of clock speeds is stalled due to various limitations such as power consumption, heat generation, etc. Therefore modern CPUs are also designed to have higher number of cores instead of increased clock speeds.
GPU computing is used to enhance performance of diverse areas of applications. These areas include,
  • Higher education and supercomputing
  • Oil and Gas industry
  • Defense intelligence
  • Computational finance
  • Computer aided design (CAD) and Computer aided engineering(CAE)
  • Media and entertainment and etc.
SeqNFind” is a well known sequence analysis tool in bioinformatics and it is expected acceleration speed up is 400 times with the aid of GPUs. This tool is promoted as energy efficient and high performing product over its competitors based on its GPU implementation. And there are many other real word application that are already in practical use that are implemented on GPUs.

Compute Unified Device Architecture(CUDA)

CUDA is a parallel computing platform introduced by NVIDA cooperation, this platform provides set of extension to standard C/C++ language which can be used to develop programs that can be run on GPUs manufactured by NVIDIA cooperation.
In CUDA terminology the system in which the GPU is deployed is referred to as “host” whereas the GPU itself is referred to as “device”. A typical CUDA program will consist of parts that are to be run on host or on the CPU and parts that are to be run in parallel on device or the GPU. Special keywords are provided in CUDA extension to declare device data elements and device functions.
A function is declared to be run on the device (GPU) using __global__ directive and these functions are called “kernels”. A kernel usually performs tasks which are rich in data parallelism. A program written in CUDA should be compiled with nvcc, an extension to gcc compiler, provided with CUDA software development kit.
At the execution time, kernels are executed across hundreds of cores in the form of thousands of threads by the CUDA runtime. Application developer must specify the organization of thread hierarchy to be used. The creation, termination and other management tasks of threads are done by the CUDA runtime and these operations are transparent to the application developer.