Parallelize programs - ITS

Parallel programming

For a program to be able to use several CPU cores or even several computers (referred to as "nodes" on the cluster) of a cluster simultaneously, this behavior must be implemented in the source code of the program. It is said that the program must have been "parallelized" in order to be able to use several resources in parallel.

In most programming languages, parallelization with shared memory is considered separately from distributed memory. The program therefore uses either several CPU cores simultaneously or several nodes simultaneously. It is also possible to implement both in the same program.

In order to be able to process several tasks simultaneously in a program, i.e. in parallel, the problem must be broken down into subtasks. If, for example, all iterations of a for-loop can be processed in any order, it is a good idea to consider the iterations as individual tasks, which can be processed independently of each other in parallel. To speed up the program flow, you should of course look for parallelization options in those parts of the program that have a high runtime.

Our workload manager "Slurm" refers to the division of the program into different nodes as splitting into "tasks". Thus --ntasks indirectly specifies the number of nodes to be used. To specify how many sub-problems are to be processed in parallel on each node, --cpus-per-task or --tasks-per-node can be used.

If you have problems with parallelized source code or performance problems, please contact us.

Programs that use multiple nodes

In distributed computing, the individual networked nodes communicate with each other via messages. The MPI (Message Passing Interface) programming system, which can be used in various implementations on the Linux cluster, is classic in this respect and enables the exchange of messages between several started processes.

As an MPI process cannot directly access the data of another MPI process on another node, but must have it sent to it, this is referred to as a programming system with "distributed memory".

Programs that use multiple CPU cores

To parallelize calculations within a node, programming systems such as OpenMP (Open Multi-Processing), pthreads (POSIX threads) or Java Threads (now Java Concurrency), which enable parallelization using threads, should be used instead of MPI. Threads are parts of a process that can be processed simultaneously by the operating system. Threads process tasks, which in turn represent "sub-problems" of the task to be performed by the program.

Threads can access the memory of other threads directly, which is much faster than sending this data to each other as messages in a programming system such as MPI. This is referred to as "shared memory".

Programs that use "hybrid" multiple nodes and multiple CPUs on them

Although MPI only supports the distribution of calculations to processes, several processes can be started on one node. It is therefore possible in principle to use several nodes with MPI and to use several CPUs in these nodes at the same time. The advantage here is that no additional programming construct needs to be implemented, which of course makes programming easier.

In order for a program to work particularly efficiently, one of the above-mentioned programming languages that work with shared memory should be combined with MPI, whereby MPI is then only used for cross-node communication.

NUMA

All nodes in the Linux cluster are NUMA systems. NUMA stands for "Non-Uniform Memory Access". This means that there are several CPUs in each node, each with its own working memory. In most cases, 2 physical CPUs (=sockets) are installed on a mainboard in the Linux cluster, which in turn have a certain number of cores.

If a thread is located on one CPU and requires data from the working memory of the other CPU, access is possible without any problems, but takes longer than if the data had been located in the working memory of the requesting CPU.

If you want to optimize a program so that NUMA is also handled, the following must be observed: The so-called "first-touch policy" usually applies. This means that the data is only placed in the corresponding working memory when it is first accessed. The consequence of this is that data such as arrays should preferably be created and initialized by the thread that processes them later and not by a master thread.

Normally, the operating system is allowed to switch threads back and forth between CPU cores. If even the socket is changed, the threads are removed from "their data". The programmer can set the so-called "thread affinity" and thus bind threads exactly to certain CPU cores. He can therefore ensure that threads working on the same data are also located on the same CPU and thus share the same working memory (with fast access).