GPU 101: Memory Hierarchy

Everything you need to know about GPU memory

Arc Compute
September 13, 2023
< Blog Home

Memory hierarchies in GPUs are crucial for optimizing the performance of parallel computing tasks. These memory hierarchies consist of various types of memory with different characteristics to cater to the diverse requirements of GPU workloads. Here are the primary memory hierarchies within GPUs: 

Basic memory hierarchy of an NVIDIA A100 40GB GPU
Basic memory hierarchy of an NVIDIA A100 40GB GPU

Global Memory: 

Size: Global memory is the largest memory pool in a GPU, often ranging from several gigabytes to tens of gigabytes. 

Access: It is accessible by all threads in a GPU, but access to global memory is relatively slow compared to other memory types. 

Purpose: It serves as the primary storage for data that needs to be shared across multiple blocks or threads, such as input data, output data, and global constants. 

Shared Memory: 

Size: Shared memory is a smaller, faster memory pool, typically measured in kilobytes per streaming multiprocessor (SM). 

Access: It is shared by threads within the same thread block (or workgroup). Threads can communicate and synchronize through shared memory. 

Purpose: Shared memory is used for inter-thread communication and storing data reused by multiple threads within a block, helping to reduce memory latency. 

Local Memory: 

Size: Local memory is specific to each thread and is usually a small cache or scratchpad memory. 

Access: It is private to individual threads and is often implemented using a portion of global memory. 

Purpose: Local memory stores temporary variables or spill data when there is insufficient register space for a thread's variables. Accessing local memory is slower than accessing registers. 

Texture Memory and Constant Memory: 

Size: These specialized memory types vary depending on the GPU architecture. 

Access: Texture memory and constant memory are optimized for specific read patterns. Texture memory is optimized for 2D or 3D texture fetches, while constant memory is read-only and optimized for broadcasting constants to multiple threads. 

Purpose: They are used for specific memory access patterns, such as texture sampling in graphics or read-only data shared across multiple threads in GPGPU applications. 

 

Register File: 

Size: Each thread in a GPU has access to a set number of registers. 

Access: Registers are the hierarchy's fastest and most private memory, storing local variables and intermediate results. 

Purpose: Minimizing register usage and maximizing register reuse is essential for achieving high GPU performance.  

L1 and L2 Cache: 

Size: GPU architecture's L1 and L2 cache sizes vary but are typically small compared to global memory. 

Access: They are hardware-managed caches that store frequently accessed data to reduce memory latency. 

Purpose: Caches help accelerate memory access by storing data that threads access frequently, improving overall performance. 


It's important to note that GPU memory hierarchies can differ between GPU architectures and manufacturers. Additionally, GPU programming models, such as CUDA or OpenCL, allow developers to manage data movement and optimize memory access to exploit these hierarchies effectively for different workloads. GPU memory hierarchies are critical for achieving high performance in parallel computing tasks, whether in graphics or general-purpose computing. 

 

Upgrade Your Infrastructure with NVIDIA H100 Servers

4 GPUs
Borealis H100 SXM5 5U
Borealis H100 Server - 5U
Does Have Feature
4 x NVIDIA H100 (80GB) SXM5 GPUs
Does Have Feature
2 x Intel® Xeon® Platinum 8458P Processor
Does Have Feature
3-year Hardware Coverage
Does Have Feature
NVLink + NVSwitch GPU connectivity
Does Have Feature
2,048 GB of system memory
Does Have Feature
2 x 4 TB SSD
Starting at
$225,000
Learn More
8 GPUs
Borealis H100 SXM5 8U
Borealis H100 Server - 8U
Does Have Feature
8 x NVIDIA H100 (80GB) SXM5 GPUs
Does Have Feature
2 x Intel® Xeon® Platinum 8458P Processor
Does Have Feature
3-year Hardware Coverage
Does Have Feature
NVLink + NVSwitch GPU connectivity
Does Have Feature
2,048 GB of system memory
Does Have Feature
2 x 4 TB SSD
Starting at
$314,900
Learn More