Pwning Supercomputers - A 20yo vulnerability in Munge

A local buffer overflow vulnerability exists in the Munge authentication daemon used in High Performance Computing (HPC)

Read Full Article →

Introduction One of our customers asked for the security review of its High Performance Computing (HPC) infrastructure. During the assessment, the auditor discovered a buffer overflow vulnerability within the authentication daemon of the cluster, Munge . This bug has been present in the codebase for approximately 20 years and every version up to 0.5.17 are impacted. This vulnerability can be exploited locally to leak the Munge secret key, allowing attacker to forge arbitrary Munge token, valid across the cluster. In a way, this is a Local Privilege Escalation in the context of High Performance Computers. This article will provide some understanding about how High Performance Computers work in general, and give a quick overview of Slurm and Munge. If you are not interested in HPC, you can directly jump to the exploitation part. Context HPCs Blue Gene/P Intrepid Supercomputer High Performance Computers (HPC-ers), also known as Supercomputers, are used by multiple organisations to do all sort of heavy computations, from weather forecasting to biology research. The TOP500 website gather information and statistics about HPC-ers, and is a great source to learn about the current state of this unusual field. What is HPC ? How does it work ? Nowadays, an HPC-er is just a fancy name for a big cluster of Linux machines. That being said, let's give a bit more details about how they work. HPC-er should not be seen as 1 machine with a superfast CPU. It is a cluster of many computing machines, called computing nodes, designed to run parallelizable jobs. A group of machine does not automatically become an HPC-er. An HPC-er is defined as follows: A certain homogeneity amongst the computing nodes : They must all run the same software stack and offers approximately the same hardware. All nodes do not need to be the same, but they can be grouped by hardware capabilities: GPUs, FPGA... The general idea is that group of similar nodes can replace each others. A computing job is not written to work on one specific node. It must be able to run on the entire cluster. A high bandwidth, low latency interconnection between computing nodes : Computing nodes are linked together using dedicated technologies. HPC-er being composed of many computing nodes, it is very important that these nodes communicate effectively to synchronize job progress, or share processing data. An orchestration/scheduling mechanism to distribute jobs on the cluster : To be able to distribute jobs effectively on the computing nodes, they need to be orchestrated. A scheduler is responsible for dispatching new jobs to available nodes and ensure fairness between users. Many different schedulers exist, one of the most commonly used is Slurm , Quoting Schedmd, the company behind Slurm, approximately 65% of all HPC-er in the world uses it. Its main role is to allocate computing resources, such as CPUs, memory, and GPUs, to users’ jobs in an efficient and fair way. Users submit jobs to Slurm, specifying the resources they need, and Slurm decides when and where these jobs will run based on availability and scheduling policies. By managing job queues and resource usage across a cluster, Slurm helps maximize system utilization while ensuring that multiple users can share the HPC infrastructure smoothly. When launching a computing job on the HPC-er, the user does not choose which set of nodes it is going to run on. Therefore, Slurm is a critical component, as it plays the role of a kind of "meta" operating system, for a "meta" computer made up of multiple machines. The security bug discovered here impacts Munge , the authentication daemon used by default in Slurm installation. TOP500: 100% of the most powerfull HPC-er use Linux Slurm and Munge To understand the impact and exploitability of the security bug, a quick overview of Slurm and Munge is necessary. Slurm job creation First of all, note that HPC-er are often multi-user computers. For example, a research institute manages an HPC and allows every researcher to use it. The standard workflow of user on a Slurm controlled HPC-er is the following: - researcher A connects using SSH on a first machine dedicated to submit jobs to the cluster. - From there, it uses command such as srun , salloc to request nodes and start/enqueue new jobs. - These commands interact through a custom protocol with Slurmctld , often running on another machine. This is the scheduler daemon - Slurmctld return the list of available nodes to the user. - Then for every available nodes, srun asks the local Slurm agent, Slurmd , to start a linux process as researcher A . For Slurm, a job is a linux process running on multiple nodes. In the previous example, researcher A run the following command: $ srun id uid=1032(researcher A) gid=1032(researcher A) groups=1032(researcher A) Each allocated node will run its own /usr/bin/id, it is very important to ensure homogeneity of the computing nodes. Shared File System: Most of the time, to allow nodes to access th...

Read Full Article → ← Back to News

Pwning Supercomputers - A 20yo vulnerability in Munge

Related Articles

Share this article