Bridges-2 FAQ
Why do I get an error when I try to start an interactive session on an EM node?
Because there are only 4 EM nodes, interactive access is not permitted. Please submit a job through SLURM. For more information, see the Running Jobs section of the Bridges-2 User Guide.
What's the maximum time a job can run?
It depends on which partition you are submitting to. Each partition has a maximum time limit set. However, these limits can change at any time.
To see what the current limits are, type
sacctmgr show qos format=name%15,maxwall | grep partition
The output will show the maximim time limits for partitions, where the time format is Days-Hours:Minutes:Seconds.
The limit shown for ‘rmpartition’ applies to both the RM and RM-shared partitions. Similarly, the limit for ‘gpupartition’ applies to both the GPU and GPU-shared partitions.
rmpartition 2-00:00:00 gpupartition 2-00:00:00 empartition 5-00:00:00 rm512partition 2-00:00:00
Here you can see that the maximum time allowed in RM and RM-shared partitions is two days, or 24 hours. The maximum time allowed in the EM partition is five days.
All scheduling policies, including the time limits, are always under review to ensure the best turnaround for users, and are subject to change at any time.
Why was I charged for 128 RM cores when I used less than that?
Your job probably ran in the RM partition. Jobs in the RM partition use one or more full RM nodes, which have 128 cores each.
If you need 64 cores or less, you can use the RM-shared partition. Jobs in RM-shared use half or less of one RM node.
See the Partitions section of the Bridges-2 User Guide for more information.
What is the difference between the RM and RM-shared or GPU and GPU-shared partitions?
Jobs in the RM partition use one or more full RM nodes, and are allocated all 128 cores on those nodes. Jobs in the RM-shared partition use only half of the cores (or less) on one RM node, and share the node with other jobs.
Similarly, jobs in the GPU partition use one or more entire GPU nodes, and are allocated all 8 GPUs on each node. Jobs in GPU-shared use at most 4 GPUs and share the node with other jobs.
Jobs in RM and GPU partitions are charged for the entire node (128 cores or 8 GPUs, respectively). Jobs in RM-shared and GPU-shared are only charged for the cores or GPUs that they are allocated.
See the Partitions section of the Bridges-2 User Guide for more information.
Can I reserve nodes on Bridges-2?
Yes, if you have a significant reason that requires setting aside nodes for your exclusive use. Your account will be charged for the entire length of the reservation.
See the Reservation section of the Bridges-2 User Guide for more information.
SLURM error messages: What does this salloc or sbatch error mean?
Here are SLURM error messages for some common issues. If you have questions about these or other SLURM errors you see, please contact help@psc.edu.
salloc: error: Job submit/allocate failed: Invalid qos specification
This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use. To check, run the projects
command to verify that you have access to that resource.
It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.
In a batch job, use #SBATCH -A ChargeID
.
In an interact command , use interact -A ChargeID
.
sbatch: error: Allocation requesting N gpus, GPU-shared maximum is 4
sbatch: error: Batch job submission failed: Access/permission denied
You are asking for more than 4 GPUs in the GPU-shared partition. Jobs in GPU-shared can only request up to half of one GPU node, a total of 4 GPUs.
sbatch: error: Allocation requesting N nodes, use GPU partition for multiple nodes
sbatch: error: Batch job submission failed: Access/permission denied
You are asking for multiple GPU nodes in the GPU-shared partition. To request multiple GPU nodes, you must use the GPU partition.
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use. To check, run the projects
command to verify that you have access to that resource.
It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.
In a batch job, use #SBATCH -A ChargeID
In an interact command , use interact -A ChargeID
sbatch: error: QOSMaxCpuPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)
This error generally indicates that you are asking for more cores than allowed in the partition. For example, jobs in the RM-shared partition are limited to half of one node, which is 64 cores maximum.
sbatch: error: QOSMaxWallDurationPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)
Most often this error indicates that you are requesting more time than is allowed in a partition.
Make sure to check the maximum time allowed for a partition in the Running Jobs section of the Bridges-2 User Guide.
User Information
User Guide
FAQ
Apply
Research Allocations open 3/15 - 4/15
Early User Program Training
Watch the video
View the slides

Artificial Intelligence Learns to Judge Mass of Galaxy Clusters
Predicted mass of huge Coma Cluster agrees with earlier, human-intensive attempts; offers fast, accurate measurement needed to understand early Universe
SLURM environment change coming June 9
During the June 8-9 downtime, we will correct settings on Bridges-2 so that no variables defined...
Inode quota implemented on Bridges-2
Due to the current usage patterns on Bridges-2's Ocean filesystem, an inode allocation has been...
New tools to show job and partition information
Two new tools to show information about queued, running and completed jobs have been installed on...
NVIDIA Compilers Updated
The NVIDIA compilers (formerly PGI) have been updated on Bridges-2 to version 21.5. This is now...
Changes to the NVIDIA (PGI) compiler module names
The names of the modules that use the NVIDIA (formerly called PGI) compilers are changing. They...
Be mindful of your RM usage
Bridges RM users: note that the new Bridges-2 RM nodes have 128 cores per node, a significant...
Bridges to Bridges-2 Transition
Dear Bridges User, After five years of enabling research and discovery, PSC will be...