Graphic
Processing units (GPUs) are special purpose hardware devices designed
to efficiently perform calculations related to rendering of computer
graphics. Graphic processing units have a highly parallel
architecture and contain hundreds of simple processing cores built in
it. This massive computational power in GPUs can be used together
with CPUs to improve the performance of compute intensive
applications. This practice of using GPUs along with CPUs to boost
application performance is known as GPU computing.
The
strategy that GPUs use to accelerate application performance is
exploiting the parallelism of the application. A modern GPU consists
of 1000+ cores (NVIDIA GeForce GTX 780 Ti has 2880 cores) which can run in parallel, so that it can be programmed
to execute a task using 10000+ threads executing in parallel and
thereby boost the performance. The main objective is to maximizing
the throughput of complete task rather than minimizing the latency of
each operation. Furthermore to reduce the overhead of this massive
parallelism, GPU threads are designed to be lightweight and to have
very low management overhead . With this massive fine grained
parallelism introduced by GPUs the throughput of the applications
grows immensely.
Achieving
such a higher degree of a parallelism using CPUs is an unprecedented
task because CPUs are consisting of limited number of cores. And
these cores are designed minimize the latency of each single
operation by executing it as fast as possible in contrast to parallel
approach of GPUs.
GPU
computing will also help to increase the overall host system
performance since it offloads computation from CPU. Therefore the
host system has more time slices available to execute other
processes. GPUs also consume less energy compared to CPUs therefore
the usage of GPUs makes computer systems more energy efficient and
environment friendly.
One
of the major advantages of using large number relatively slow clock
rates of cores with compared very fast small number of cores is
scalability. Small number of cores do not scale linearly where as
multicore systems scale almost linearly with the problem size.
Another
important prospect of GPU computing is, it is expected GPU
performance to increase at the scale of the “moore’s law”.
When more transistors are fitting into a GPU, it will result in more
number of cores and thereby increasing the performance. But this is
not true for single stream processors since increase of clock speeds
is stalled due to various limitations such as power consumption, heat
generation, etc. Therefore modern CPUs are also designed to have
higher number of cores instead of increased clock speeds.
GPU
computing is used to enhance performance of diverse areas of
applications. These areas include,
- Higher education and supercomputing
- Oil and Gas industry
- Defense intelligence
- Computational finance
- Computer aided design (CAD) and Computer aided engineering(CAE)
- Media and entertainment and etc.
“SeqNFind”
is a well known sequence analysis tool in bioinformatics and it is
expected acceleration speed up is 400 times with the aid of GPUs.
This tool is promoted as energy
efficient and high performing product over its competitors based on
its GPU implementation. And there are many other real word
application that are already in practical use that are implemented on
GPUs.
Compute Unified Device Architecture(CUDA)
CUDA is a parallel computing platform
introduced by NVIDA cooperation, this platform provides set of
extension to standard C/C++ language which can be used to develop
programs that can be run on GPUs manufactured by NVIDIA cooperation.
In
CUDA terminology the system in which the GPU is deployed is referred
to as “host” whereas the GPU itself is referred to as “device”.
A typical CUDA program will consist of parts that are to be run on
host or on the CPU and parts that are to be run in parallel on device
or the GPU. Special keywords are provided in CUDA extension to
declare device data elements and device functions.
A
function is declared to be run on the device (GPU) using __global__
directive and these functions are called “kernels”. A kernel
usually performs tasks which are rich in data parallelism. A
program written in CUDA should be compiled with nvcc, an extension to
gcc compiler, provided with CUDA software development kit.
At
the execution time, kernels are executed across hundreds of cores in
the form of thousands of threads by the CUDA runtime. Application
developer must specify the organization of thread hierarchy to be
used. The creation, termination and other management tasks of threads
are done by the CUDA runtime and these operations are transparent to
the application developer.