Altair Feko: Using the MoM and MLFMM solvers in parallel processing - parallel scaling efficiency and implications


What is parallel scaling efficiency?

"Parallel scaling efficiency" is a concept that frequently arises when solving problems on compute clusters. The solving of large models on cluster computing systems elicits a commonly asked question: "how many cores is enough?" Understanding the concepts and knowing typical solver performance on compute clusters helps to address this question.

The term, "parallel scaling efficiency",  can be summarized as how efficiently computational resources can be added to the existing, measured against a  performance reference. In simpler terms, and from a user's perspective, the expectation is that if the number of cores, or rather, the number of parallel processes are doubled, the solution time would be halved. Furthermore, in terms of memory consumption, the expectation is that there would be a minimal increase in memory consumption. In an ideal scenario, the memory consumption should stay constant with increasing number of parallel processes. In this article you will see that Feko has excellent parallel scaling and that memory saving algorithms help to keep the scaling performance near ideal.

Parallel scaling performance is highly dependent on the following two things:

  1. The selected solver
  2. The size of the model

MoM solver:

Consider firstly a model small enough to be solved on a modern laptop computer. It is a series fed patch array on a finite substrate. The longest dimension is approximately ten free space wavelengths and the model consists of 18 000 unknowns.

The model requires just over 7 GByte of memory. The model is solved with a varying number of parallel processes, from 2 to 16 with an increment of 2. In the graphs below the runtime and memory are displayed in terms of the parallel efficiency (blue) and the actual run time and memory consumption are displayed in red.

The graphs show that runtime is decreased by adding more parallel processes. However the reduced slope of the actual value and lower percentages for the efficiency indicates that around 8 to 10 processes is a good number to use for this model. If there are remaining idle processes on the machine, perhaps those could be used for other processing. The memory shows a marginal increase with increasing parallel processes. Therefore the parallel scaling efficiency in terms of memory is nearly 100%.

From an optimisation search perspective, where different iterations of the model could be farmed out, setting the number of parallel processes to four and farming out four different iterations of the model would solve much faster than using all 16 processes for a single iteration.

Next consider a much larger model, a car with rooftop antenna system at 2.6 GHz. The longest dimension is approximately 36 free space wavelengths and the model consists of 246 000 unknowns. The model requires about 450 GByte of memory.

The total wall time scales very well with a much larger number of parallel processes. The model is suitable to be solved on a compute cluster. 

Comparing the patch array with the car, and considering that a scaling efficiency of 60% is reasonable, the patch array reaches this threshold at around 14 parallel processes while the car reaches the same at between 64 and 128 processes.

MLFMM solver:

Again consider firstly a model small enough to be solved on a modern laptop computer. It is a helicopter at 400 MHz. The distance between the cockpit and tail is approximately 24 free space wavelengths and it consists of 95 000 unknowns. The model requires between 1 and 3 GByte of memory.

The parallel efficiency for this model is very good in terms of runtime. The MLFMM solver however shows an increase in memory relative to the increase in parallel processes. Nevertheless, for a computer with 16 cores it is clear that all the cores can efficiently be used in the solution.

Next consider a much larger model of an Airbus aircraft at 1.1 GHz with a stacked patch antenna on the fuselage above the cockpit. The wingspan is about 80 meters or 293 free space wavelengths.

The model mesh consists of roughly 23 million unknowns and memory requirements range from 400 to 700 GByte.

The runtime parallel scaling efficiency is good up to 16 processes but drops off with higher numbers of processes. While it is beyond the scope of this article, for the MLFMM it can be stated that performance is generally better for models where the iterative solution stage is short compared to the other stages of the solution, and vice versa.

While the memory scaling performance efficiency shows a decrease with increasing number of processes, an interesting inflection point is observed at 32 processes. The reason for this is that during the solution Feko automatically converts the parallel processes (MPI) to threads (openMP) if it is estimated that the memory will be insufficient. In terms of runtime performance only a few percent increase in runtime is observed if Feko applies this automatic switch.

Conclusion:

The parallel scaling of the MoM and MLFMM solvers were demonstrated with a few examples. Such tests could be useful to perform before embarking on a project that requires multiple solutions or variations of the same model to inform the selection of a suitable number of parallel processes.