The Top 4 Utilization Issues with GPU Job Schedulers

How GVM Server Resolves Them

Anton Allen
Vice President of Sales
February 10, 2023

Introduction

A GPU Job Scheduler is a tool that manages and schedules the allocation of GPUs in a cluster environment. They enable the efficient utilization of GPU resources by allocating them to the jobs that need them. Schedulers also provide a unified interface for submitting, monitoring, and controlling the execution of GPU jobs in clusters. Although schedulers can be very useful to Systems Administrators, they have drawbacks when it comes to maximizing utilization and performance.

The Top 4 Utilization Issues with GPU Job Schedulers

1. How Schedulers measure and report utilization:

Schedulers measure and report utilization in terms of VRAM assignment, meaning they're susceptible to cores and execution capabilities being over-provisioned to jobs that don’t require as many resources due to unoptimized code eating up more VRAM than it should.

 2. Schedulers only suggest fixes to improve utilization:

When optimizing utilization, schedulers only give suggestions and ideas of what jobs should be reviewed, resized, and re-coded. It is then up to the user to implement these fixes on an ongoing basis. 

3. Limited virtualized GPU partitioning capabilities:

Virtualizing GPUs and partitioning them into smaller vGPUs can drastically increase utilization, thanks to multi-tenancy. Current legacy virtualization tools like MIG, used by job schedulers, max out at seven slices per GPU and can only attach a single slice to a virtual machine. MIG is also only available for NVIDIA A100 and H100 GPUs. 

4. No live resizing/redistribution of VRAM:

Schedulers cannot resize VRAM at runtime for jobs that would benefit from increases or be unfazed from decreases, resulting in missed opportunities for expedited job queue completions due to under utilized VRAM.

 

How Arc Compute Addresses these Issues

1. How schedulers measure utilization:

Arc Compute's hypervisor, GVM Server, can move cores and execution capabilities where they need to be during runtime, eliminating instances of under utilized or unused cores and execution capabilities. This feature, called Simultaneous Multi-Virtual GPU (SMVGPU), enables 90%-100% utilization of cores and execution capabilities, drastically expediting job completion times. 

2. Schedulers only suggest fixes to improve utilization:

GVM Server doesn’t report suggestions to resize, recode, or review jobs based on their utilization numbers as our technology optimizes during runtime automatically, fixing any issue jobs face while sharing the same underlying hardware. GVM Server eliminates the need to report on ways to improve utilization as it automates the process without the need for human intervention.

3. Limited virtualized GPU partitioning capabilities:

Unlike MIG, which is limited to a maximum of 7 slices per GPU, GVM Server + SMVGPU has no limit and can slice a GPU into an arbitrary number of vGPUs for multi-tenancy. It can size/resize and split without limitations. GVM Server can also attach multiple virtualized slices from numerous GPUs into a single VM and is not limited to MIG-enables GPU models.

4. No live resizing/redistribution of VRAM:

GVM Server will soon feature VRAM reallocation at runtime for workloads sharing GPU without needing to reboot. This feature will increase utilization and performance even more. 

View comparison showing how MIG and SMVGPU differ when training multiple jobs on a single GPU

Conclusion

GPU Schedulers are complementary to GVM Server for optimizing GPU utilization with VRAM assignment and ensuring that organizations consistently have jobs provisioned to their compute clusters. However, to address the missed opportunities in the over-provisioning of cores and execution capabilities due to poor code, the limitations of MIG, and VRAM resizing in GPUs across nodes and clusters, GVM Server is the only solution available.

Data center tech stack showing that GVM Server sits below GPU Job Schedulers

While the premier version of GVM Server doesn't completely replace all of the functionalities of a GPU Scheduler, it sits below them in the data center tech stack, meaning that a Scheduler can be seamlessly integrated into GVM Server.

Looking to learn more about Arc Compute?
Read our latest white papers and case studies.
GVM Server - 100% Utilization POC
The following results are from tests we ran to demonstrate the performance benefits and limitations of GVM Server, which provides a way forward for further proof of concept tests within your organization’s infrastructure.
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
GVM Server - Organization-Level Provisioning with Nested Roles
Organization-level provisioning is a nested roles feature that allows organizations to manage data and resources for teams/projects hierarchically.
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
GVM Server - Solution Brief
GVM Server is Arc Compute's GPU/CPU hypervisor which is an all-in-one GPU utilization and virtualization solution.
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
Arc Compute - Company Summary
Arc Compute's customers have one thing in common; they are all large consumers of GPUs who are tired of the current cloud business models and are looking for better, transparent pricing and better performance and security.
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
Arc Compute Powers GPU Cloud Offering with Liqid
"Arc Compute, the only cloud service provider to offer Liqid’s revolutionary composable disaggregated infrastructure (CDI) as a service, proposed a GPU cloud option that offered the immersive video company a far more flexible and cost-effective solution".
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
GVM Server - Superior GPU Utilization and Performance
As you will see in the following benchmarks, by utilizing GVM Server, your workloads can train up to 80% faster thanks to improved utilization of GPU resources.
Thank you for your submission!
Read Now
Oops! Something went wrong while submitting the form.
Connect with us
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Arc Blog

Arc Compute: a custom GPU cloud provider
February 27, 2023
Read More
GPU Utilization & Total Cost of Infrastructure Ownership

GPU Utilization & Total Cost of Infrastructure Ownership

Anton Allen
March 2, 2023
One of the primary issues faced across industries is the under-utilization of computing resources, especially GPUs. 
Read More
NVIDIA H100 PCIe vs. SXM5

NVIDIA H100 PCIe vs. SXM5

Erik Kimmerer
February 27, 2023
With NVIDIA being the leading player in the GPU market, it’s challenging to determine which NVIDIA GPU server is suitable for your company. In this blog post, I will compare the PCIe and SXM5 form factors for NVIDIA H100 GPUs, the highest-performing GPUs currently available, and contrast performance and costs to help you make an informed decision.‍
Read More
Addressing Utilization Issues with GPU Job Schedulers

Addressing Utilization Issues with GPU Job Schedulers

Anton Allen
February 10, 2023
A GPU Job Scheduler is a tool that manages and schedules the allocation of GPUs in a cluster environment, although, they have drawbacks when it comes to maximizing utilization and performance.
Read More
GVM Server - Nested Roles Explained

GVM Server - Nested Roles Explained

Erik Kimmerer
January 10, 2023
Learn all about one of GVM Server's primary benefits: organization-level provisioning, a nested roles feature that allows organizations to manage data and resources hierarchically for teams/projects.
Read More
LibVF.IO: Add GPU Virtual Machine Support

LibVF.IO: Add GPU Virtual Machine Support

Arthur Rasmusson
August 24, 2022
LibVF.IO (vGPU & SR-IOV on Consumer GPUs) has added support for GPU Virtual Machine (GVM).
Read More
Experience Better GPU Performance with GVM Server

Experience Better GPU Performance with GVM Server

Erik Kimmerer
August 23, 2022
Learn how Arc's GPU/CPU hypervisor, GVM Server, increases GPU performance and utilization through exclusive configurations made possible thanks to Simultaneous Multi-Virtual GPU
Read More
The Web Browser Landscape

The Web Browser Landscape

Arthur Rasmusson
June 4, 2021
As I’m sure many people have heard over the course of the last few days Chrome’s developers have chosen to change the way Chrome’s advertising, JavaScript, XHR connection, CSS, and iframe...
Read More
Closed Investment Round with OPN & Supporters Fund

Closed Investment Round with OPN & Supporters Fund

Justin Ritchie
June 5, 2021
Typically, when a GPU cloud consumer is utilizing their provider’s GPU compute, the provider must either run single physical devices per user or instead use expensive multi-user sharing...
Read More
Why Augmented Reality is Not Ready

Why Augmented Reality is Not Ready

Arthur Rasmusson
June 24, 2021
What enabled VR to become functionally capable of inducing reliable "presence" (the qualitative threshold for experiences that convince all the cognitive systems that make up your conscious...
Read More
Learning from OpenBSD to Make Computers Better

Learning from OpenBSD to Make Computers Better

Arthur Rasmusson & Louis Castricato
December 5, 2019
This is an attempt to consolidate down a number of threads spanning separate discussions from around the 'net I have been having on the subject of operating system development models and...
Read More
Looking to learn more about Arc Compute?
Read our latest white papers and case studies.
Arc Compute GPU Cloud Infrastructure

GVM Server - 100% Utilization POC

The following results are from tests we ran to demonstrate the performance benefits and limitations of GVM Server, which provides a way forward for further proof of concept tests within your organization’s infrastructure.
Download Now
Arc Compute GPU Cloud Infrastructure

GVM Server - Organization-Level Provisioning with Nested Roles

Organization-level provisioning is a nested roles feature that allows organizations to manage data and resources for teams/projects hierarchically.
Download Now
Arc Compute GPU Cloud Infrastructure

GVM Server - Solution Brief

GVM Server is Arc Compute's GPU/CPU hypervisor which is an all-in-one GPU utilization and virtualization solution.
Download Now
Arc Compute GPU Cloud Infrastructure

Arc Compute - Company Summary

Arc Compute's customers have one thing in common; they are all large consumers of GPUs who are tired of the current cloud business models and are looking for better, transparent pricing and better performance and security.
Download Now
Arc Compute GPU Cloud Infrastructure

Arc Compute Powers GPU Cloud Offering with Liqid

"Arc Compute, the only cloud service provider to offer Liqid’s revolutionary composable disaggregated infrastructure (CDI) as a service, proposed a GPU cloud option that offered the immersive video company a far more flexible and cost-effective solution".
Download Now
Arc Compute GPU Cloud Infrastructure

GVM Server - Superior GPU Utilization and Performance

As you will see in the following benchmarks, by utilizing GVM Server, your workloads can train up to 80% faster thanks to improved utilization of GPU resources.
Download Now