Some of our clients are interested in containerizing their own HPC workflows and scaling their HPC infrastructure using Docker containers. For this reason, we wanted to take a deep dive into how one can run a small HPC cluster using Slurm running as a set of containers.
Slurm is an open source job scheduling system for Linux clusters. With its numerous features for executing, managing, and monitoring parallel jobs, it’s widely used in High Performance Computing (HPC). Moreover, Slurm is supported by popular frameworks for machine learning, such as Caffe, TensorFlow, and others.
Traditionally, Linux clusters run Slurm directly in the operating system of each node in order to provide shared access to HPC software installed in the nodes’ systems for multiple users. This approach can provide the maximum computing performance, but may be sluggish in management and sensitive to changes during software updates.
Another approach is to use container technology. A container is a lightweight virtualization solution that allows isolated processes to be run using shared OS core. A container is a good tool to enhance Slurm-based clusters, because it has the following advantages:
- All in one. We can bundle the required HPC software and all of its dependencies in a single file — an image.
- Isolation. Containers run in the same system cannot affect each other in any unexpected way.
- Scalability. Using the image file, we can initiate as many container instances as we need without a complicated installation.
- Portability. Containers make it possible to run software developed for one system on another system.
Slurm is generally agnostic to containers, so it should work with any existing container management system.
In this article, we will demonstrate how to run a small HPC cluster using Slurm running as a set of containers. We selected Docker as the container management system because its flexible infrastructure assists with creating an experimental environment. Moreover:
- Docker is the most popular solution for running containers.
- Docker has a big community of users.
- Docker is supported by the most popular operating systems: Linux, Windows, MacOS, FreeBSD, and others.
- Docker containers may be deployed in popular cloud providers such as Azure, AWS, OCI, GCP, etc.
It is worth adding that there are caveats related to running Slurm in Docker:
- Docker is not officially supported by Slurm developers due to its multiple design points, which make it unfriendly to traditional HPC systems that have numerous non-privileged users.
- Running HPC jobs in containers in and of itself is connected with negative performance impact during computations.
The goals of this article are as follows:
- Show how to run Slurm in Docker in detail.
- Provide a basic knowledge of the Docker commands required for image building and running containers.
- Create a repository with examples of config files that you can use as a reference when building your own clusters in Docker.
All scripts and config files for Docker and Slurm described in this article are shared in the XTREME-D GitHub repository.
We are going to build a simple HPC cluster with the following configuration:
As you can see, in our cluster we are going to deploy one head node “axc-headnode”, four compute nodes “axc-compute-[01-04]”, and MariaDB RDBMS server “mariadb” for the Slurm’s accounting data.
For purposes of simplicity, we will deploy several processes in a single container. This is not optimal from the point of view of Docker’s guidelines, but it’s the best way to compare the Docker Slurm deployment with the classic “head node – compute nodes” hardware appliance of HPC clusters.
Note that we are going to create a single Docker image that will be used as a base for both head node and compute nodes. Also, we will store the Slurm spool data and MariaDB files in the file system of the host machine. It will help us to save job information after the cluster down and restore it on the cluster up.
We will use the following software in the cluster:
- Slurm v20.02.7
- Munge for inter-node authentication
- Supervisor as a lightweight process manager for containers
- MariaDB v10.4 for accounting data as an independent container
First, let’s decide the order we’ll run the containers of our cluster. It should be like this:
- Start the MariaDB container
- Start the head node container. Head node is dependent on MariaDB, because it is running the slurmdbd process.
- Start compute node containers. Compute nodes are dependent on the head node, so they must wait for the head node.
Also, there are important steps that must be completed before Slurm starts. It should be as follows:
- Generate the Munge key
- Ensure the existence of spool directories for Slurm daemons and create them if they do not exist. Slurm daemons cannot create directories by themselves, so this is a very important step.
- Initiate Slurm cluster entry in the slurmdbd. This must be done after Slurm is started.
Finally, container processes must be run in this defined order:
- run munged process
- run slurmd process
- run slurmctld process (head node only)
- run slurmdbd (head node only)
Having met the requirements listed above, let’s now decide how we will follow them.
- In order to maintain the order in which containers are running, we use the depends_on property of docker-compose services.
- In order to run container processes in the defined order, we use the priority property of the supervisor.
- In order to generate the initial data, we create the entry point script for containers.
- In order to generate the initial data for slurmdbd that requires communication with the slurmd process, we create a one-shot supervisor process and place it after slurmd.
- We will not include config files to the docker image of nodes, we will mount them to containers instead.
- We will include an extra configuration for the MariaDB container to provide stable running.
- We will direct all logs of container processes to the /dev/stdout to be able to see them in the output of the “docker logs” command.
2.2 Config Files
Next, let’s define which config files we need to add to our cluster.
2.2.1 slurm.conf file
The “slurm.conf” is the common config file for Slurm processes. As the base, we need to take the default slurmd.conf file and override some values. Let’s observe the most important overrides.
First, we need to define the name of our cluster:
Point out the host where the slurmctld is running:
Select the auth type for nodes:
Define the path for slurm spool data. It must be the directory-mounted from the host machine:
Define configs related to accounting:
AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageType=accounting_storage/slurmdbd
Finally, we need to define slurm nodes and partitions of our cluster. Let’s define just one partition that includes all compute nodes:
NodeName=axc-headnode CPUs=2 State=UNKNOWN NodeName=axc-compute-[01-04] CPUs=2 State=UNKNOWN PartitionName=compute Nodes=axc-compute-[01-04] Default=Yes MaxTime=24:00:00 State=UP OverSubscribe=Yes
2.2.2 slurmdbd.conf File
This config file contains specific settings for the slurmdbd process. The most important overrides here are the credentials for the database we are going to use for accounting data:
StorageType=accounting_storage/mysql StorageHost=mariadb StoragePort=3306 StoragePass=
2.2.3 supervisor configs
As mentioned previously, we are going to run processes in cluster containers as supervisor programs. So, we need to create a supervisor app config file for each process. Moreover, we have to maintain the defined order of running:
axc-headnode - (1) munged - (2) slurmd - (3) slurmctld - (4) slurmdbd - (5) slurmdbd_init__oneshotaxc-compute-[01-04] - (1) munged - (2) slurmd
The “slurmdbd_init__oneshot” command creates a cluster entry in the accounting database if it does not yet exist.
To provide the stability for the MariaDB container, we also need to add the config file with the following settings:
innodb_buffer_pool_size=1024M innodb_log_file_size=48M innodb_lock_wait_timeout=900
2.3 Build and Run
Having completed all preparations, let’s build and run our docker cluster.
Clone the repository:
git checkout email@example.com:xtreme-d/docker-slurm-cluster.git docker-slurm-cluster cd docker-slurm-cluster
Copy the env file and edit its content:
cp .env-sample .env vi .env
Type the password for mariadb inplace of <SECRET> placeholders and build node image:
Start the cluster:
docker-compose up -d
Let’s check the state of containers:
$ docker-compose ps Name Command State Ports -------------------------------------------------------------------------------------- axc-compute-01 /docker-entrypoint.sh /bin ... Up 6817/tcp, 6818/tcp, 6819/tcp axc-compute-02 /docker-entrypoint.sh /bin ... Up 6817/tcp, 6818/tcp, 6819/tcp axc-compute-03 /docker-entrypoint.sh /bin ... Up 6817/tcp, 6818/tcp, 6819/tcp axc-compute-04 /docker-entrypoint.sh /bin ... Up 6817/tcp, 6818/tcp, 6819/tcp axc-headnode /docker-entrypoint.sh /bin ... Up 6817/tcp, 6818/tcp, 6819/tcp axc-mariadb docker-entrypoint.sh mysqld Up 3306/tcp $
Everything is up, so we can proceed to the test.
After cluster starts, let’s test the Slurm commands. To do that, access the head node’s shell:
docker exec -it axc-hednode bash
Let’s ensure that Slurm processes are up:
[root@axc-headnode /]# supervisorctl status munged RUNNING pid 114, uptime 0:45:43 slurmctld RUNNING pid 209, uptime 0:45:29 slurmd RUNNING pid 116, uptime 0:45:43 slurmdbd RUNNING pid 196, uptime 0:45:32 slurmdbd_init__oneshot EXITED Jun 29 09:25 AM sshd RUNNING pid 113, uptime 0:45:43
As you can see, all Slurm-related processes are up and running. To ensure interconnection, let’s test munge:
[root@axc-headnode /]# munge -n; munge -n | unmunge; remunge MUNGE:AwQDAAArpg0TEMSR/tGLHgcMv147rFy/HUkAqav3bPZf2wkQ+aQMYOTanDD57KdjKyMWQMPslybNHwkzqdiveudyMWz79e+MTPos1yhRbSqc8HNhhhOAPAE=: STATUS: Success (0) ENCODE_HOST: axc-headnode (172.20.0.3) ENCODE_TIME: 2021-06-29 11:23:44 +0000 (1624965824) DECODE_TIME: 2021-06-29 11:23:44 +0000 (1624965824) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 2021-06-29 11:23:44 Spawning 1 thread for encoding 2021-06-29 11:23:44 Processing credentials for 1 second 2021-06-29 11:23:45 Processed 2248 credentials in 1.000s (2247 creds/sec) [root@axc-headnode /]#
Now, let’s test Slurm commands:
[root@axc-headnode /]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [root@axc-headnode /]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 1-00:00:00 4 idle axc-compute-[01-04] [root@axc-headnode /]# sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------
Everything works fine. Let’s create and execute the test batch job:
[root@axc-headnode /]# cat > test.batch <<EOF #!/bin/bash # #SBATCH --job-name=xs-test-job #SBATCH --ntasks=4 srun hostname srun sleep 10 EOF [root@axc-headnode /]# sbatch test.batch Submitted batch job 6 [root@axc-headnode /]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6 compute xs-test- root R 0:03 2 axc-compute-[01-02] [root@axc-headnode /]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 1-00:00:00 2 alloc axc-compute-[01-02] compute* up 1-00:00:00 2 idle axc-compute-[03-04] [root@axc-headnode /]#
Finally, let’s test accounting:
[root@axc-headnode /]# sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2 headnode compute root 1 FAILED 2:0 3 hostname compute root 1 COMPLETED 0:0 4 hostname compute root 4 COMPLETED 0:0 5 hostname compute root 5 CANCELLED+ 0:0 6 xs-test-j+ compute root 4 COMPLETED 0:0 6.batch batch root 2 COMPLETED 0:0 6.0 hostname root 4 COMPLETED 0:0 6.1 sleep root 4 COMPLETED 0:0 [root@axc-headnode /]#
As you can see, Slurm works well in the Docker infrastructure and there are many opportunities for its automated deployment. That said, the current Docker cluster cannot be considered a production-ready solution. This is not only because of the performance situation, but also because there are limited things you can do with only Slurm.
It would be much better if support were added for machine learning frameworks. We also need to add the following improvements:
- Add an authentication server for user management
- Provide the ability to deploy a cluster using container orchestration tools like Docker Swarm or Kubernetes
- Provide the ability to deploy a cluster to cloud services
And that is exactly what we will cover in our furure articles.
<About the author>
Yury Krapivko is a software engineer with more than 10 years of extensive working experience. He’s been engaged in different kinds of projects in full-stack web development, cloud engineering, and high-performance computing. Working on the HPC field more than 5 years.