ResearchIT | Spring Break Updates: Downtime for maintenance 2019-03-20, New Servers & Pronto Scheduler

March 19, 2019

Downtime 2019-03-20

We will be taking down the ResearchIT servers Wed Mar 20th at 9AM for
some updates & maintenance.  We expect to be back up by late afternoon.

During this time, we will be physically relocating some servers, and
adding additional capacity to the /work lustre filesystem.

We've had issues with several of the servers crashing lately with a
kernel error during memory allocation.  We believe this to be a bug in
the OS, and we are working with RedHat to fix.  During this maintenance
window we will be making some changes to capture additional information
if the problem recurs.

New Servers & Scheduler

We have added several new servers to our lineup in the past several
months, and those servers are behind a new Slurm job scheduler:

The pronto system currently manages resources for the following
- speedy3-4
- biocrunch4-7
- gpu03 (2x v100 GPUs)
- singularity (4x v100 GPUs with nvlink)

The scheduler allows us to have better control over resources to reduce
interference between user jobs, provides better downtime scheduling,
and enables better utilization tracking.

During tomorrow's maintenance window, we will also be moving gpu01 and
the legion servers to be managed by pronto; future use of these servers
will require you to submit a job in the pronto slurm queue.

We plan to move the remaining ResearchIT servers behind pronto at the
end of the semester.

The below link provides additional information about how to use pronto:

This page will be updated with additional information as we get closer
to the end of the semester and the transition of the additional