GPU Utilization & the TCO of Infrastructure

How CTOs and HPC managers are increasing GPU utilization, lowering the TCO of their on-premise infrastructure

Date Published
Last Updated

Under-utilized Resources

IT decision-makers must rapidly adapt to new frameworks and technologies to best serve their clients when building, managing, and optimizing on-premise infrastructure. One of the primary issues faced across industries is the under-utilization of computing resources, especially GPUs. 

When working with AI/ML models, considerable investments in GPU servers are required to provide the necessary environments for testing and training complex algorithms. These environments encompass hundreds to tens of thousands of GPUs, where teams aim to squeeze the most PetaFLOPs out of the underlying chipsets as possible. With an average GPU utilization of only 10%, key decision-makers have been wary of adapting to new HPC hardware as their current resources remain stagnant.

Are Job Schedulers the Solution?

Job schedulers, like SLURM, have been one of the only tools available for addressing utilization issues. They can be great tools for queueing and organizing jobs but fail at maximizing utilization. Greedy code, human error, and static resources plague job schedulers. Without intensive professional intervention, GPUs consistently remain under-utilized. These utilization issues can only be thoroughly addressed by Arc Compute’s GPU/CPU hypervisor, ArcHPC.

ArcHPC enables "Real Utilization" by addressing and repurposing idle/under-utilized compute resources, such as execution capabilities and VRAM during runtime, allowing up to 100% utilization as long as there are workloads available for processing. This translates into faster job training times with far less opportunity cost of idle resources. ARC HPC can be fully integrated under most job schedulers within an organization's tech stack. 

Achieve 100% GPU utilization with ARC HPC

New & Improved GPUs

The explosive performance growth in NVIDIA’s H100s versus A100s and Intel’s Datacenter GPU Max Series breathes new excitement (and problems) into the world of Exascalers and supercomputers as they try to double, triple, quadruple, and quintuple PetaFLOPs. Breakthrough technology looks great, but many ask, “how do we ensure we get the most out of it given the total cost of ownership and technical investment requirements.” Spending hundreds of thousands of dollars on new GPUs can be hard to justify when overall utilization is so low.

 

The Solution: ArcHPC + Job Scheduler

For maximizing utilization, a job orchestration and scheduling tool is necessary to ensure a consistent funnel of work for HPC infrastructures but, without ArcHPC, you’re only addressing part of the underlying issue. Pairing a job scheduler with ArcHPC encompasses a complete solution for lowering the total cost of ownership of next-generation infrastructure and makes considerable investments far more justifiable to key decision-makers.  When both technologies are present in the tech stack, users can address both ends of the utilization problem, minimizing the complexity of job schedulers and maximizing the ROI of new hardware. 

Thanks to ArcHPC ensuring compute resources are automatically provisioned/re-provisioned, removing barriers to idle compute silos, it has never been easier to maximize utilization and lower the total cost of ownership of on-premise GPU-accelerated infrastructure. 

8x H100 SXM5 Server
Build

Buy the Latest NVIDIA H100 GPU Servers

Leverage the power of the latest NVIDIA GPUs in your data center. Whether you need one server or thousands, we've got you covered with industry-best lead times on NVIDIA H100 deployments.

Learn More
H100 HGX
Deploy

8x H100 SXM5 Cloud Instances

Enable large-scale model training with top-of-the-line NVIDIA H100 SXM5 GPUs. Arc Compute's cloud clusters are available for a minimum 2-year commitment and start at just $2.20/hr per GPU.

Learn More
ArcHPC
Optimize

Maximize GPU Utilization and Performance

Integrate ArcHPC into your infrastructure to achieve peak GPU performance. With accelerated training and inference you'll be able to bring your products to market faster than ever before.

Learn More