CUDA devices use several memory spaces, which have different characteristics that reflect their distinct usages in CUDA applications. These memory spaces include global, local, shared, texture, and registers, as shown in Figure 1.
Of these different memory spaces, global and texture memory are the most plentiful; see Section F.1 of the CUDA C Programming Guide for the amounts of memory available in each memory space at each compute capability level. Global, local, and texture memory have the greatest access latency, followed by constant memory, registers, and shared memory.
The various principal traits of the memory types are shown in Table 1.
Memory | Location on/off chip | Cached | Access | Scope | Lifetime |
---|---|---|---|---|---|
Register | On | n/a | R/W | 1 thread | Thread |
Local | Off | † | R/W | 1 thread | Thread |
Shared | On | n/a | R/W | All threads in block | Block |
Global | Off | † | R/W | All threads in block + host | Host Allocation |
Constant | Off | Yes | R | All threads in block + host | Host Allocation |
Texture | Off | Yes | R | All threads in block + host | Host Allocation |
In the case of texture access, if a texture reference is bound to a linear (and, as of version 2.2 of the CUDA Toolkit, pitch-linear) array in global memory, then the device code can write to the underlying array. Reading from a texture while writing to its underlying global memory array in the same kernel launch should be avoided because the texture caches are read-only and are not invalidated when the associated global memory is modified.