GPU Utilization & the TCO of Infrastructure

How CTOs and HPC managers are increasing GPU utilization, lowering the TCO of their on-premise infrastructure

Anton Allen
March 2, 2023
< Blog Home

Under-utilized Resources

IT decision-makers must rapidly adapt to new frameworks and technologies to best serve their clients when building, managing, and optimizing on-premise infrastructure. One of the primary issues faced across industries is the under-utilization of computing resources, especially GPUs. 

When working with AI/ML models, considerable investments in GPU servers are required to provide the necessary environments for testing and training complex algorithms. These environments encompass hundreds to tens of thousands of GPUs, where teams aim to squeeze the most PetaFLOPs out of the underlying chipsets as possible. With an average GPU utilization of only 10%, key decision-makers have been wary of adapting to new HPC hardware as their current resources remain stagnant.

Are Job Schedulers the Solution?

Job schedulers, like SLURM, have been one of the only tools available for addressing utilization issues. They can be great tools for queueing and organizing jobs but fail at maximizing utilization. Greedy code, human error, and static resources plague job schedulers. Without intensive professional intervention, GPUs consistently remain under-utilized. These utilization issues can only be thoroughly addressed by Arc Compute’s GPU/CPU hypervisor, ArcHPC.

ArcHPC enables "Real Utilization" by addressing and repurposing idle/under-utilized compute resources, such as execution capabilities and VRAM during runtime, allowing up to 100% utilization as long as there are workloads available for processing. This translates into faster job training times with far less opportunity cost of idle resources. ARC HPC can be fully integrated under most job schedulers within an organization's tech stack. 

Achieve 100% GPU utilization with ARC HPC

New & Improved GPUs

The explosive performance growth in NVIDIA’s H100s versus A100s and Intel’s Datacenter GPU Max Series breathes new excitement (and problems) into the world of Exascalers and supercomputers as they try to double, triple, quadruple, and quintuple PetaFLOPs. Breakthrough technology looks great, but many ask, “how do we ensure we get the most out of it given the total cost of ownership and technical investment requirements.” Spending hundreds of thousands of dollars on new GPUs can be hard to justify when overall utilization is so low.

 

The Solution: ArcHPC + Job Scheduler

For maximizing utilization, a job orchestration and scheduling tool is necessary to ensure a consistent funnel of work for HPC infrastructures but, without ArcHPC, you’re only addressing part of the underlying issue. Pairing a job scheduler with ArcHPC encompasses a complete solution for lowering the total cost of ownership of next-generation infrastructure and makes considerable investments far more justifiable to key decision-makers.  When both technologies are present in the tech stack, users can address both ends of the utilization problem, minimizing the complexity of job schedulers and maximizing the ROI of new hardware. 

Thanks to ArcHPC ensuring compute resources are automatically provisioned/re-provisioned, removing barriers to idle compute silos, it has never been easier to maximize utilization and lower the total cost of ownership of on-premise GPU-accelerated infrastructure. 

Upgrade Your Infrastructure with NVIDIA H100 Servers

4 GPUs
Borealis H100 SXM5 5U
Borealis H100 Server - 5U
Does Have Feature
4 x NVIDIA H100 (80GB) SXM5 GPUs
Does Have Feature
2 x Intel® Xeon® Platinum 8458P Processor
Does Have Feature
3-year Hardware Coverage
Does Have Feature
NVLink + NVSwitch GPU connectivity
Does Have Feature
2,048 GB of system memory
Does Have Feature
2 x 4 TB SSD
Starting at
$225,000
Learn More
8 GPUs
Borealis H100 SXM5 8U
Borealis H100 Server - 8U
Does Have Feature
8 x NVIDIA H100 (80GB) SXM5 GPUs
Does Have Feature
2 x Intel® Xeon® Platinum 8458P Processor
Does Have Feature
3-year Hardware Coverage
Does Have Feature
NVLink + NVSwitch GPU connectivity
Does Have Feature
2,048 GB of system memory
Does Have Feature
2 x 4 TB SSD
Starting at
$314,900
Learn More