Beginner’s Guide to HPCs

What’s the Use Case

So I’ve always been interested in building out new tools, and the unique opportunity came along which allowed me to start building a computing platform for our internal team. Basically, the use case was we had a lot of GPU hardware and wanted a platform in which data scientists could deploy jobs without having to deal with stuff like linux commands, job scheduling, or any deployments related work in general.

The workflow was therefore simple: Users develop their models locally, and use simple tools to deploy their jobs without worrying about the deployment process.

What Were You Looking For

To be honest, even now I’m still learning on best practices and technology to deploy models. I knew cloud was out of the question because of costs, and for simple training jobs it would just be easier to work with on-premise hardware.

So the job was to find a technology which would do the following:

Schedule jobs across multiple compute hardware
Manage logging, services, security
Easy to setup as more hardware comes online

So naturally, I started looking into HPCs

What is an HPC

HPC (High Performance Computing) Platforms allows aggregating multiple compute nodes that your compute intensive job can utilize. If you were trying to train the next ChatGPT or 3D visualization model or whatever you would need something more powerful than your single GPU desktop. HPCs allow you to utilize a large cluster of compute hardware to run compute intensive tasks.

What did you go with?

There’s a lot of stuff out there, from cloud services like Azure Arc to workload managers like Torque, but the one I found to have the least barrier of entry (and seemed very popular in the academic world) was SLURM.

SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source, scalable job scheduler designed for high-performance computing (HPC) clusters. It manages the allocation of compute resources, queuing of jobs, and scheduling of tasks on a cluster.

SLURM is widely used in research institutions, universities, and large organizations that run workloads requiring powerful computing resources.

Architecture

SLURM’s architecture is pretty simple, you have a single (or multiple depending on your needs) control node and multiple compute nodes. The filesystem the nodes will share to run your project.

Control Node

The control node is the heart of the SLURM system, running the SLURM Controller (slurmctld). This daemon manages the entire cluster, including job scheduling, resource allocation, and communication with other nodes. The control node also communicates with the database (if configured with slurmdbd) to track job history and resource usage.

Compute Node

These are the nodes that perform the actual computation. SLURM runs the SLURM Daemon (slurmd) on each compute node, which accepts jobs from the control node, allocates resources, and executes the assigned tasks.

Database

If SLURM accounting is enabled, the slurmdbd service records job history, resource usage, and other statistics in a database. This is used for tracking, reporting, and job accounting. slurmdbd interacts with databases like MariaDB or MySQL.

These nodes allow users to interact with the SLURM system by submitting jobs, monitoring job status, and performing administrative tasks. You can also just have tools such as UIs or CLIs depending on how you want users to interact with the cluster

How It Works

SLURM allocates resources by defining partitions, which are logical groupings of compute nodes. Users submit jobs to these partitions, and SLURM schedules and dispatches the jobs according to priority, availability, and configured policies.

Job Submission: A job submission script specifies the resources (e.g., CPUs, memory, GPUs) required, the executable command, and the partition.
Job Scheduling: SLURM’s scheduler evaluates job requests based on priority and resource availability. Jobs are either dispatched immediately or placed in a queue.
Job Execution: The job is dispatched to the appropriate compute node(s), where it is executed, and results are stored or returned as needed.
Job Completion: Once the job is complete, SLURM releases the resources and records the job's statistics.

Building SLURM

References

https://slurm.schedmd.com/quickstart_admin.html#build_install

https://www.schedmd.com/download-slurm/

Prerequisites

SLURM is built on Linux based nodes, so familiarity in working with Linux and its commands is a must. Most of the commands and steps here involve installing deb or rpm packages, setting up systemd daemons, file system management, etc. Knowing how to use command line tools is also needed here.

It’s also going to be important to understand filesystems and utilizing a NFS server.

Later on, we will be going over utilizing the SLURM API, so being familiar with REST APIs and token management will be useful.

Install Packages

SLURM packages are versioned by year.month of release. So for example, I am using a stable build 23.11.10, which was built Nov, 2023. Feel free to choose newer builds but for the sake of this setup your version needs to be at least newer than that.

Hardware Requirements

To install SLURM, you need to have at least the following available

One Linux machine to act as the control node
One or more Linux machines to act as the compute node(s)
A open network connection between the control node and the compute nodes
A database server to store account information about the cluster (optionally, if your control node has a large amount of disk space you can use the control node as the db)
NFS or some shared filesystem to share data between the nodes
SSH needs to be available for all nodes for access
Root Access should be available to run sudo commands on all nodes

In terms of OS, it doesn’t matter as long as it can build either RPMs or Debian packages. There is an option to build manually from source but I find that to be outside the scope of this blog. The documentation to build from source can be found in this link https://slurm.schedmd.com/quickstart_admin.html#manual_build

This time though, we’re going to be setting up our cluster using Ubuntu 22.04.

Note: 22.04 will be best for installing packages straight from the deb builds (which we will be showing here). However, if you would just like to use FreeBSD package (slurm-wlm), I would suggest using 24.04 so that it downloads the latest stable SLURM build.

Setup the Control Node

Prerequisites

Before we install the binaries, we need to install some prerequisites. These include development tools, Munge, and MySQL/MariaDB (for the slurmdbd)

There is a full list of optional + recommended prerequisites here: https://slurm.schedmd.com/quickstart_admin.html#prereqs

Some of the basic ones we’re going to install include:

MariaDB
jwt (for JWT Token Auth)
MUNGE

Building

Sync the UID/GID

1# Create the slurm user/group for the control node
2sudo su
3groupadd -g 1005 slurm && useradd -m -u 1005 -g slurm slurm

This will create the user slurm:slurm with the UID/GID of 1005 (can be any UID/GID). It is important for all nodes to have the same UID/GID for the slurm user.

This user will have the necessary permissions to run any slurm daemon such as slurmctld or slurmd.

Hostname configuration via /etc/hosts

Next, let’s make sure all nodes are able to communicate with each other. In Linux systems nodes have a /etc/hosts file which lets it know about other nodes it can communicate with.

Here’s what the control node would look like.

# Localhost
127.0.0.1   localhost

# SLURM COMPUTE NODE
NODE_IP     NODE_NAME

This tells the node “Hey, here is a computer called NODE_NAME under IP of NODE_IP”. So when slurm calls the node via NODE_NAME we don’t get a Hostname not found error.

This will be very important when we are entering in the slurm configuration file.

For example, a control node with 2 compute nodes would have a /etc/hosts file like this:

# Localhost
127.0.0.1   localhost

# SLURM COMPUTE NODE
NODE_IP     NODE_NAME
NODE_IP1    NODE_NAME1
NODE_IP2    NODE_NAME2

Update package manager and install tooling

1sudo apt update && sudo apt upgrade

SSH should already be setup for all the nodes, but if not, install the openssh server and client

1sudo apt install openssh-server openssh-client

Setup Munge

Munge is an authentication service for validating credentials within a group of local or remote processes. Nodes share a cryptographic MUNGE key and allows authentication of a UID and GID.

 1sudo apt install munge libmunge2 libmunge-dev
 2
 3# test installation - should show STATUS: SUCCESS
 4munge -n | unmunge | grep STATUS
 5
 6# If you can't cat the munge key at /etc/munge/munge.key, create one using this command
 7sudo /usr/sbin/mungekey
 8
 9# Setup the correct permissions
10sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
11sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
12sudo chmod 0755 /run/munge/
13sudo chmod 0700 /etc/munge/munge.key
14sudo chown -R munge: /etc/munge/munge.key
15
16# Restart munge service and configure to run at startup
17# This will ask for auth per certain identity and login auth/psw
18systemctl enable munge
19systemctl restart munge
20
21# You can investigate munge service error with
22# systemctl status munge OR
23# sudo nano /var/log/munge/munged.log

Later on, we will be copying the MUNGE key over to the compute nodes.

Installation

Download the files and actually build them. You can download the tarball here: https://www.schedmd.com/download-slurm/

 1#Install basic Debian package build requirements:
 2apt-get install build-essential fakeroot devscripts equivs
 3#Unpack the distributed tarball:
 4tar -xaf slurm*tar.bz2
 5#cd to the directory containing the Slurm source
 6#Install the Slurm package dependencies:
 7mk-build-deps -i debian/control
 8#Build the Slurm packages:
 9debuild -b -uc -us
10
11# Once done, go to the folder containing the deb files and install
12# Example
13apt install slurm-smd_23.11.4-1_amd64.deb
14apt install slurm-smd-client_23.11.4-1_amd64.deb
15# You can find which node needs what here: https://slurm.schedmd.com/quickstart_admin.html#debuild
16

Configuration

SLURM uses a slurm.conf file to specify all required configurations to run your cluster. This file will be shared with all the nodes, and must be kept updated whenever a new node is added or new features added.

SLURM has a web configuration file generator located here: https://slurm.schedmd.com/configurator.html

You don't have to fill out all of the fields in the configuration tool since a lot of them can be left to their defaults. The following fields are the once we had to manually configure:

ClusterName: <YOUR-CLUSTER-NAME>
SlurmctldHost: <CONTROLLER-NODE-NAME>
NodeName: <WORKER-NODE-NAME>[1-4] (this would mean that you have four worker nodes called <WORKER-NODE-NAME>1, <WORKER-NODE-NAME>2, <WORKER-NODE-NAME>3, <WORKER-NODE-NAME>4)
Enter values for CPUs, Sockets, CoresPerSocket, and ThreadsPerCore according to $ lscpu (run on a worker node computer)
ProctrackType: LinuxProc

Once you press the submit button at the bottom of the configuration tool your config file text will appear in your browser. Copy this text into a new /etc/slurm/slurm.conf file and save

Note

At this point you should copy the conf file to your workers as well. You should use scp if you’re working with remote machines

1scp username@CONTROLLER-HOSTNAME:/etc/slurm/slurm.comf username@WORKER-HOSTNAME:/etc/slurm/slurm.conf

If your cluster requires the use of GPUs (as most HPC clusters will), you will need another configuration file to tell SLURM where the device files are. In this example, we will be assuming that we’re using NVIDIA GPUs and that we have nvidia-smi installed.
At the end of the slurm.conf file, you need to change the configuration in 2 places.

Under NodeName = …. Add Gres=gpu:NUMBER_OF_GPUS_IN_NODE
Uncomment GresTypes and set to gpu

Add gres.conf to your compute node so it can find the nvidia files

1# Find the nvidia files
2ls -l /dev/nvidia?
3
4# Add those as gres.conf to your compute node
5# The file must be in the same directory as slurm.conf
6AutoDetect=off
7Name=gpu File=/dev/nvidia0
8Name=gpu File=/dev/nvidia1 > gres.conf

You can also name your GPUs such as a100, or 5090, but that’s beyond the scope of this project. For now, let’s assume every user will be asking for a gpu name of gpu.

Now that should be enough for the control node. Next, let’s setup the compute nodes.

Setup the Compute Nodes

Startup

Startup

Now start the slurm controller node systemctl and configure it to boot on startup

1systemctl enable slurmctd
2systemctl restart slurmctld

To check your SLURM installation

1systemct status slurmctld # returns status of the systemctl daemon
2sinfo # returns cluster information

You can also check the cluster is correctly setup by running

1srun -N <NUMBER-OF-NODES> hostname

Where is the number of worker nodes that are currently Setup. If everything goes well, this should return the name of all of your nodes.

Conclusion

Congratulations! You've successfully built a basic SLURM HPC cluster. You now have:

A control node running slurmctld to manage your cluster
Compute nodes ready to execute jobs
Shared filesystem for data access across nodes
GPU support configured via gres.conf
Authentication via MUNGE

This foundational setup is just the beginning. In production environments, you'll want to add:

Job accounting - Track resource usage and job history
REST API access - Enable programmatic job submission
Advanced scheduling - Configure fair-share, QOS, and resource limits
Monitoring - Set up dashboards for cluster health and utilization
Backups - Protect configuration files and databases

What's Next?

This is Part 1 of a series on building and managing HPC clusters. Continue with:

Part 2: Setting Up SLURM Database (slurmdbd) - Add job accounting and usage tracking
Part 3: Setting Up SLURM REST API (slurmrestd) - Enable programmatic access to your cluster

If you're considering building an HPC cluster for your team, I hope this guide gives you a solid starting point. Good luck!