GPU 101: Memory Hierarchy

Everything you need to know about GPU memory

Date Published
Last Updated

Memory hierarchies in GPUs are crucial for optimizing the performance of parallel computing tasks. These memory hierarchies consist of various types of memory with different characteristics to cater to the diverse requirements of GPU workloads. Here are the primary memory hierarchies within GPUs: 

Basic memory hierarchy of an NVIDIA A100 40GB GPU
Basic memory hierarchy of an NVIDIA A100 40GB GPU

Global Memory: 

Size: Global memory is the largest memory pool in a GPU, often ranging from several gigabytes to tens of gigabytes. 

Access: It is accessible by all threads in a GPU, but access to global memory is relatively slow compared to other memory types. 

Purpose: It serves as the primary storage for data that needs to be shared across multiple blocks or threads, such as input data, output data, and global constants. 

Shared Memory: 

Size: Shared memory is a smaller, faster memory pool, typically measured in kilobytes per streaming multiprocessor (SM). 

Access: It is shared by threads within the same thread block (or workgroup). Threads can communicate and synchronize through shared memory. 

Purpose: Shared memory is used for inter-thread communication and storing data reused by multiple threads within a block, helping to reduce memory latency. 

Local Memory: 

Size: Local memory is specific to each thread and is usually a small cache or scratchpad memory. 

Access: It is private to individual threads and is often implemented using a portion of global memory. 

Purpose: Local memory stores temporary variables or spill data when there is insufficient register space for a thread's variables. Accessing local memory is slower than accessing registers. 

Texture Memory and Constant Memory: 

Size: These specialized memory types vary depending on the GPU architecture. 

Access: Texture memory and constant memory are optimized for specific read patterns. Texture memory is optimized for 2D or 3D texture fetches, while constant memory is read-only and optimized for broadcasting constants to multiple threads. 

Purpose: They are used for specific memory access patterns, such as texture sampling in graphics or read-only data shared across multiple threads in GPGPU applications. 


Register File: 

Size: Each thread in a GPU has access to a set number of registers. 

Access: Registers are the hierarchy's fastest and most private memory, storing local variables and intermediate results. 

Purpose: Minimizing register usage and maximizing register reuse is essential for achieving high GPU performance.  

L1 and L2 Cache: 

Size: GPU architecture's L1 and L2 cache sizes vary but are typically small compared to global memory. 

Access: They are hardware-managed caches that store frequently accessed data to reduce memory latency. 

Purpose: Caches help accelerate memory access by storing data that threads access frequently, improving overall performance. 

It's important to note that GPU memory hierarchies can differ between GPU architectures and manufacturers. Additionally, GPU programming models, such as CUDA or OpenCL, allow developers to manage data movement and optimize memory access to exploit these hierarchies effectively for different workloads. GPU memory hierarchies are critical for achieving high performance in parallel computing tasks, whether in graphics or general-purpose computing. 


NVIDIA H200 GPU Servers
8x H100 SXM5 Server

Buy the Latest NVIDIA GPU Servers

Leverage the power of the latest NVIDIA GPUs in your data center. Whether you need one server or thousands, we've got you covered with industry-best lead times on NVIDIA H200, H100, and L40S deployments. We also offer custom configurations based on your organization's unique requirements.

Accelerate LLMs with ArcHPC

Operate at Peak GPU Efficiency

Integrate ArcHPC into your infrastructure to achieve peak GPU performance. With accelerated training and inference you'll be able to bring your products to market faster than ever before.

H100 HGX

8x H100 SXM5 Cloud Instances

Enable large-scale model training with top-of-the-line NVIDIA H100 SXM5 GPUs. Arc Compute's cloud clusters are available for a minimum 2-year commitment and start at just $2.20/hr per GPU.