1. HOME
  2. What's New
  3. Slurm in Docker: Proof of Concept

What's New

Some contents are available in Japanese only.

Tech Blog

Slurm in Docker: Proof of Concept

Some of our clients are interested in containerizing their own HPC workflows and scaling their HPC infrastructure using Docker containers. For this reason, we wanted to take a deep dive into how one can run a small HPC cluster using Slurm running as a set of containers.

Slurm is an open source job scheduling system for Linux clusters. With its numerous features for executing, managing, and monitoring parallel jobs, it’s widely used in High Performance Computing (HPC). Moreover, Slurm is supported by popular frameworks for machine learning, such as Caffe, TensorFlow, and others.

Traditionally, Linux clusters run Slurm directly in the operating system of each node in order to provide shared access to HPC software installed in the nodes’ systems for multiple users. This approach can provide the maximum computing performance, but may be sluggish in management and sensitive to changes during software updates.

Another approach is to use container technology. A container is a lightweight virtualization solution that allows isolated processes to be run using shared OS core. A container is a good tool to enhance Slurm-based clusters, because it has the following advantages:

  • All in one. We can bundle the required HPC software and all of its dependencies in a single file — an image.
  • Isolation. Containers run in the same system cannot affect each other in any unexpected way.
  • Scalability. Using the image file, we can initiate as many container instances as we need without a complicated installation.
  • Portability. Containers make it possible to run software developed for one system on another system.

Slurm is generally agnostic to containers, so it should work with any existing container management system.

In this article, we will demonstrate how to run a small HPC cluster using Slurm running as a set of containers. We selected Docker as the container management system because its flexible infrastructure assists with creating an experimental environment. Moreover:

  • Docker is the most popular solution for running containers.
  • Docker has a big community of users.
  • Docker is supported by the most popular operating systems: Linux, Windows, MacOS, FreeBSD, and others.
  • Docker containers may be deployed in popular cloud providers such as Azure, AWS, OCI, GCP, etc.

It is worth adding that there are caveats related to running Slurm in Docker:

  • Docker is not officially supported by Slurm developers due to its multiple design points, which make it unfriendly to traditional HPC systems that have numerous non-privileged users.
  • Running HPC jobs in containers in and of itself is connected with negative performance impact during computations.

The goals of this article are as follows:

  • Show how to run Slurm in Docker in detail.
  • Provide a basic knowledge of the Docker commands required for image building and running containers.
  • Create a repository with examples of config files that you can use as a reference when building your own clusters in Docker.

All scripts and config files for Docker and Slurm described in this article are shared in the XTREME-D GitHub repository.

1. Architecture

We are going to build a simple HPC cluster with the following configuration:

As you can see, in our cluster we are going to deploy one head node “axc-headnode”, four compute nodes “axc-compute-[01-04]”, and MariaDB RDBMS server “mariadb” for the Slurm’s accounting data.

For purposes of simplicity, we will deploy several processes in a single container. This is not optimal from the point of view of Docker’s guidelines, but it’s the best way to compare the Docker Slurm deployment with the classic “head node – compute nodes” hardware appliance of HPC clusters.

Note that we are going to create a single Docker image that will be used as a base for both head node and compute nodes. Also, we will store the Slurm spool data and MariaDB files in the file system of the host machine. It will help us to save job information after the cluster down and restore it on the cluster up.

We will use the following software in the cluster:

  • Slurm v20.02.7
  • Munge for inter-node authentication
  • Supervisor as a lightweight process manager for containers
  • MariaDB v10.4 for accounting data as an independent container

At the host machine, we need at least Docker CE and docker-compose.

2. Implementation

First, let’s decide the order we’ll run the containers of our cluster. It should be like this:

  1. Start the MariaDB container
  2. Start the head node container. Head node is dependent on MariaDB, because it is running the slurmdbd process.
  3. Start compute node containers. Compute nodes are dependent on the head node, so they must wait for the head node.

Also, there are important steps that must be completed before Slurm starts. It should be as follows:

  1. Generate the Munge key
  2. Ensure the existence of spool directories for Slurm daemons and create them if they do not exist. Slurm daemons cannot create directories by themselves, so this is a very important step.
  3. Initiate Slurm cluster entry in the slurmdbd. This must be done after Slurm is started.

Finally, container processes must be run in this defined order:

  • run munged process
  • run slurmd process
  • run slurmctld process (head node only)
  • run slurmdbd (head node only)

2.1 Conventions

Having met the requirements listed above, let’s now decide how we will follow them.

  • In order to maintain the order in which containers are running, we use the depends_on property of docker-compose services.
  • In order to run container processes in the defined order, we use the priority property of the supervisor.
  • In order to generate the initial data, we create the entry point script for containers.
  • In order to generate the initial data for slurmdbd that requires communication with the slurmd process, we create a one-shot supervisor process and place it after slurmd.

Also:

  • We will not include config files to the docker image of nodes, we will mount them to containers instead.
  • We will include an extra configuration for the MariaDB container to provide stable running.
  • We will direct all logs of container processes to the /dev/stdout to be able to see them in the output of the “docker logs” command.

2.2 Config Files

Next, let’s define which config files we need to add to our cluster.

2.2.1 slurm.conf file

The “slurm.conf” is the common config file for Slurm processes. As the base, we need to take the default slurmd.conf file and override some values. Let’s observe the most important overrides.

First, we need to define the name of our cluster:

ClusterName=docker-slurm-cluster

Point out the host where the slurmctld is running:

SlurmctldHost=axc-headnode

Select the auth type for nodes:

AuthType=auth/munge

Define the path for slurm spool data. It must be the directory-mounted from the host machine:

StateSaveLocation=/var/spool/slurm/ctld

Define configs related to accounting:

AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageType=accounting_storage/slurmdbd

Finally, we need to define slurm nodes and partitions of our cluster. Let’s define just one partition that includes all compute nodes:

NodeName=axc-headnode CPUs=2 State=UNKNOWN
NodeName=axc-compute-[01-04] CPUs=2 State=UNKNOWN
PartitionName=compute Nodes=axc-compute-[01-04] Default=Yes MaxTime=24:00:00 State=UP OverSubscribe=Yes
2.2.2 slurmdbd.conf File

This config file contains specific settings for the slurmdbd process. The most important overrides here are the credentials for the database we are going to use for accounting data:

StorageType=accounting_storage/mysql
StorageHost=mariadb
StoragePort=3306
StoragePass=
StorageUser=slurm
StorageLoc=slurm_acct_db
2.2.3 supervisor configs

As mentioned previously, we are going to run processes in cluster containers as supervisor programs. So, we need to create a supervisor app config file for each process. Moreover, we have to maintain the defined order of running:

axc-headnode
- (1) munged
- (2) slurmd
- (3) slurmctld
- (4) slurmdbd
- (5) slurmdbd_init__oneshotaxc-compute-[01-04]
- (1) munged
- (2) slurmd

The “slurmdbd_init__oneshot” command creates a cluster entry in the accounting database if it does not yet exist.

2.2.4. mariadb-slurmdbd.cnf

To provide the stability for the MariaDB container, we also need to add the config file with the following settings:

innodb_buffer_pool_size=1024M
innodb_log_file_size=48M
innodb_lock_wait_timeout=900

2.3 Build and Run

Having completed all preparations, let’s build and run our docker cluster.

Clone the repository:

git checkout git@github.com:xtreme-d/docker-slurm-cluster.git docker-slurm-cluster
cd docker-slurm-cluster

Copy the env file and edit its content:

cp .env-sample .env
vi .env

Type the password for mariadb inplace of <SECRET> placeholders and build node image:

docker-compose build

Start the cluster:

docker-compose up -d

Let’s check the state of containers:

$ docker-compose ps                                                                                                        
     Name                   Command               State              Ports            
--------------------------------------------------------------------------------------
axc-compute-01   /docker-entrypoint.sh /bin ...   Up      6817/tcp, 6818/tcp, 6819/tcp
axc-compute-02   /docker-entrypoint.sh /bin ...   Up      6817/tcp, 6818/tcp, 6819/tcp
axc-compute-03   /docker-entrypoint.sh /bin ...   Up      6817/tcp, 6818/tcp, 6819/tcp
axc-compute-04   /docker-entrypoint.sh /bin ...   Up      6817/tcp, 6818/tcp, 6819/tcp
axc-headnode     /docker-entrypoint.sh /bin ...   Up      6817/tcp, 6818/tcp, 6819/tcp
axc-mariadb      docker-entrypoint.sh mysqld      Up      3306/tcp                    
$

Everything is up, so we can proceed to the test.

3. Test

After cluster starts, let’s test the Slurm commands. To do that, access the head node’s shell:

docker exec -it axc-hednode bash

Let’s ensure that Slurm processes are up:

[root@axc-headnode /]# supervisorctl status                                                                                                                                          
munged                           RUNNING   pid 114, uptime 0:45:43
slurmctld                        RUNNING   pid 209, uptime 0:45:29
slurmd                           RUNNING   pid 116, uptime 0:45:43
slurmdbd                         RUNNING   pid 196, uptime 0:45:32
slurmdbd_init__oneshot           EXITED    Jun 29 09:25 AM
sshd                             RUNNING   pid 113, uptime 0:45:43

As you can see, all Slurm-related processes are up and running. To ensure interconnection, let’s test munge:

[root@axc-headnode /]# munge -n; munge -n | unmunge; remunge
MUNGE:AwQDAAArpg0TEMSR/tGLHgcMv147rFy/HUkAqav3bPZf2wkQ+aQMYOTanDD57KdjKyMWQMPslybNHwkzqdiveudyMWz79e+MTPos1yhRbSqc8HNhhhOAPAE=:
STATUS:           Success (0)
ENCODE_HOST:      axc-headnode (172.20.0.3)
ENCODE_TIME:      2021-06-29 11:23:44 +0000 (1624965824)
DECODE_TIME:      2021-06-29 11:23:44 +0000 (1624965824)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

2021-06-29 11:23:44 Spawning 1 thread for encoding
2021-06-29 11:23:44 Processing credentials for 1 second
2021-06-29 11:23:45 Processed 2248 credentials in 1.000s (2247 creds/sec)
[root@axc-headnode /]#

Now, let’s test Slurm commands:

[root@axc-headnode /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[root@axc-headnode /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 1-00:00:00      4   idle axc-compute-[01-04]
[root@axc-headnode /]# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- --------

Everything works fine. Let’s create and execute the test batch job:

[root@axc-headnode /]# cat > test.batch <<EOF
#!/bin/bash
#
#SBATCH --job-name=xs-test-job
#SBATCH --ntasks=4

srun hostname
srun sleep 10
EOF
[root@axc-headnode /]# sbatch test.batch
Submitted batch job 6
[root@axc-headnode /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 6   compute xs-test-     root  R       0:03      2 axc-compute-[01-02]
[root@axc-headnode /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 1-00:00:00      2  alloc axc-compute-[01-02]
compute*     up 1-00:00:00      2   idle axc-compute-[03-04]
[root@axc-headnode /]#

Finally, let’s test accounting:

[root@axc-headnode /]# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2              headnode    compute       root          1     FAILED      2:0 
3              hostname    compute       root          1  COMPLETED      0:0 
4              hostname    compute       root          4  COMPLETED      0:0 
5              hostname    compute       root          5 CANCELLED+      0:0 
6            xs-test-j+    compute       root          4  COMPLETED      0:0 
6.batch           batch                  root          2  COMPLETED      0:0 
6.0            hostname                  root          4  COMPLETED      0:0 
6.1               sleep                  root          4  COMPLETED      0:0 
[root@axc-headnode /]#

4. Conclusion

As you can see, Slurm works well in the Docker infrastructure and there are many opportunities for its automated deployment. That said, the current Docker cluster cannot be considered a production-ready solution. This is not only because of the performance situation, but also because there are limited things you can do with only Slurm.

It would be much better if support were added for machine learning frameworks. We also need to add the following improvements:

  • Add an authentication server for user management
  • Provide the ability to deploy a cluster using container orchestration tools like Docker Swarm or Kubernetes
  • Provide the ability to deploy a cluster to cloud services

And that is exactly what we will cover in our furure articles.

<About the author>
Yury Krapivko is a software engineer with more than 10 years of extensive working experience. He’s been engaged in different kinds of projects in full-stack web development, cloud engineering, and high-performance computing. Working on the HPC field more than 5 years.