CUDA C programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more devices (frequently graphics adapter cards) with CUDA-enabled NVIDIA GPUs.
While NVIDIA devices are frequently associated with rendering graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution well.
However, the device is based on a distinctly different design from the host system, and it’s important to understand those differences and how they determine the performance of CUDA applications to use CUDA effectively.