Being a Good Citizen on Shared Computing Equipment


Congratulations! You've been granted access to high-end computation equipment that will help to drive your research forward. However, like any shared resource, there are things you can do to respect other users of the system.

Limit Your Resource Consumption

Many programs allow you to specify resource parameters, such as how many CPU cores you will use and how much RAM to allocate per process. If you are running concurrently with other users (i.e., you have not scheduled exclusive access to the machine) please ensure that these parameters are set to reasonable numbers.

Biocrunch has 80 CPU cores. Setting your program to run with all 80 cores will leave no resources for anything (or anyone) else.

Memory consumption parameters should be reasonable. The Biocrunch hardware uses non-uniform memory access (NUMA) which means that each processor has access to fast local memory, and (with special instructions) slow access to memory on other processors. Each of the four processors has access to a total of 196 GB of RAM. Here is an example of reasonable choices for the tiemyshoe program:

$ tiemyshoe --cores=10 --maxmemory=15GB > output.txt

This program will use ten cores of one processor and 15GB/core for a total memory footprint of 150GB, well under the 196GB limit. This is reasonable. The command-line options for the fictional tiemyshoe program specify that the --maxmemory option is a per-core parameter, not an aggregate total. You will need to read the documentation for the programs you are running closely to understand whether resource-related parameters are per-core or total, as this will vary.

Here is an example of a new graduate student thinking "if some is good, more is better!"

$ tiemyshoe --cores=150 --maxmemory=100GB > output.txt

Oops! What will happen? First, more cores have been specified than the 80 cores that are available; thus, the machine will do its best to schedule multiple tasks on each CPU core. Since cores are now oversubscribed, they will now spend a lot of time switching back and forth between multiple contexts rather than doing any actual work. The result is that the program will run much slower than it should.

But worse yet is the --maxmemory parameter value. 150 cores x 100GB/core = 1500GB of RAM needed! This is far more than the machine has, so all memory will be exhausted, starving other people's processes of resources and potentially bringing the whole server to a halt. Don't do this!

Instead, do your math ahead of time to figure out reasonable limits. If possible, run a small dataset first so you can learn the program's behavior, length of run, and so on.

Where Do I Keep My Data?

The standalone computational boxes (Biocrunch, Bigram, Speedy, Speedy2) have several TB of space across all home directories. Please be considerate of others and only move the data you need onto the computational box. When you are done with a run, clean up after yourself. Don't let unneeded data accumulate.

The long-term home for your data should be on the Isilon share used by your laboratory. You should create a symlink to this share for easy access while you are using a computational box like Biocrunch.. Here is an example of someone in the Dr. Joe Example laboratory creating a symlink to the Large Scale Storage share, which is named by concatenating the netid of the principal investigator with "-lab".

$ cd ~
$ ln -s /lss/research/example-lab example-lab

After the symlink is created, you can now access the LSS share at ~/example-lab.

Data on computational boxes is considered transient and is not backed up. The model we are using (using Biocrunch in this example) is:

1. Copy data from LSS onto Biocrunch.
2. Run your analysis.
3. Copy results onto LSS.
4. Clean up on Biocrunch.