Feko provides industry leading solver and High Performance Computing (HPC) technologies to efficiently analyze complex and electrically large problems, being parallel scaling efficiency a key metric to ensure engineers can efficiently solve such challenges.

"Parallel scaling efficiency" is a concept that frequently arises when solving problems on compute clusters. Using large servers or cluster computing systems elicits a commonly asked question: "How many cores is sufficient?" Understanding the concepts and knowing typical solver performance on compute clusters helps to address this question.

The term "parallel scaling efficiency" can be defined as how efficiently computational resources can be added to the existing, measured against a performance reference. In simpler terms, from a user's perspective, the expectation is that if the number of cores (or rather, the number of parallel processes) is doubled, the solution time would be halved. Furthermore, in terms of memory consumption, the expectation is that there would be a minimal increase in memory consumption. In an ideal scenario, the memory consumption should stay constant with an increasing number of parallel processes. In this article, you will see that Feko has excellent parallel scaling and that memory-saving algorithms help to keep the scaling performance near ideal.

Parallel scaling performance is highly dependent on the following two things:

- The selected solver
- The electrical size of the model

Defining a model as large does not necessarily depend on its physical size. For example, a ship could be considered a physically large problem, but if the frequency of interest is only a few MHz, then this is actually an electrically "small" problem in electromagnetic terms. The degree to which a problem is large depends mostly on its electrical size (in wavelengths). Further considerations are multi-physics analyses such as radome thermal deformation effects on RCS.

Feko is well suited for solving large problems in antenna placement and RCS. Moreover, Feko can be deployed on large clusters and efficiently utilize the available computing resources.

The MoM is a full wave solution of Maxwell’s integral equations in the frequency domain that operates on mainly a surface mesh representation of the model.

Consider first a model small enough to be solved on a modern laptop computer. It is a series-fed patch array on a finite substrate. The longest dimension is approximately ten free space wavelengths, and the model consists of 18,000 unknowns.

The model requires just over 7 GByte of memory. The model is solved with a varying number of parallel processes, from 2 to 16 with an increment of 2. In the graphs below, two important metrics are displayed.

Firstly total wall time which is the elapsed time (looking at a physical clock) between launching the Feko solver and the time it has finished.

Secondly the total memory which is the peak total memory consumed by Feko during any time of the solution. The runtime and memory are displayed in terms of parallel efficiency (blue), and the actual time and memory usage (red). The reference in the graphs is the lowest number of parallel processes used, which is two processes in this case. Therefore, for two processes, the efficiency is displayed as 100%.

The graphs show that runtime is decreased by adding more parallel processes. However, the reduced slope of the actual value and lower percentages for the efficiency indicate that around 8 to 10 processes is a good number to use for this model (where the efficiency is around 60%). If there are remaining idle processes on the machine, perhaps those could be used for solving another model. The memory shows a marginal increase with increasing parallel processes. Therefore, the parallel scaling efficiency in terms of memory is nearly 100%.

From an optimization search perspective, where different iterations of the model could be farmed out, setting the number of parallel processes to four and farming out four different iterations of the model would solve faster than using all 16 processes for a single optimization iteration.

Next, consider a much larger model, a car with a rooftop antenna system at 2.6 GHz. The longest dimension is approximately 36 free space wavelengths and the model consists of 246,000 unknowns - requiring about 450 GByte of memory.

The total wall time scales very well with a much larger number of parallel processes. The model is suitable to be solved on a compute cluster.

Comparing Case 1 with Case 2, and considering that a scaling efficiency of 60% is very good, the patch array reaches this threshold at around 14 parallel processes while the car reaches the same at between 64 and 128 processes.

The MLFMM is an acceleration technique for the MoM applicable to electrically large problems.

For the MLFMM, we will again first consider a model that is electrically small enough to be solved on a modern laptop computer. It is a helicopter at 400 MHz. The distance between the cockpit and tail is approximately 24 free space wavelengths and it consists of 95 000 unknowns, requiring between 1 and 3 GByte of memory.

The parallel efficiency for this model is very good in terms of runtime with efficiency above 80% throughout. The MLFMM solver, however, shows an increase in memory relative to the increase in parallel processes. Nevertheless, for a computer with 16 cores with sufficient memory, it is clear that all the cores can efficiently be used in the solution.

Consider a much larger model of a large airliner at 1.1 GHz with a stacked patch antenna mounted on the fuselage above the cockpit. The wingspan is about 80 meters or 293 free space wavelengths.

The model mesh consists of roughly 23 million unknowns and memory requirements range from 400 to 700 GByte.

The runtime or total wall time parallel scaling efficiency is good up to 16 processes, but drops off with higher numbers of processes. Nevertheless, considering the actual runtime (red curve), adding more processes up to 128 is certainly not futile. For the MLFMM it can be stated that performance is generally better for models where the iterative solution stage is short compared to the other stages of the solution, and vice versa.

While the memory scaling performance efficiency shows a decrease with an increasing number of processes, an interesting inflection point is observed at 32 processes. The reason for this is that during the solution Feko automatically converts the parallel processes (MPI) to threads (openMP) if it is estimated that the available memory for the solution will be insufficient. This automatic switch allows the problem to be solved within the available resources at the cost of a few percentage points of runtime performance impact.

Comparing the case 3 with case 4 the memory efficiency of the airliner seems better. But it must be remembered that if more memory was available for the airliner, the conversion to threads would not have occurred and the memory consumption would have been higher (and the memory scaling lower).

For RCS - and more specifically, monostatic RCS - it must be stated that the duration of the solution will mostly be determined by the number of incident angles for the plane wave. The solver sees each additional incident angle as a new source, which in essence triggers a new solution. For the MoM, additional angles do not impact the runtime as much as for the MLFMM. For the MoM, the solution is essentially complete, except for a slight modification^{1} for each new incident angle that takes only a few seconds. For the MLFMM, being an iterative solver, new sets of iterations are triggered for additional incident angles.

Consider a model of an F5 aircraft. Its length is 14 meters (or 47 free-space wavelengths). The aircraft is solved at 1 GHz with the MoM solver. It consists of 245,000 unknowns and requires just under 500 GByte of memory. A total of 361 incidence angles are solved.

The runtime efficiency shows an interesting inflection point going from 64 to 128 processes. With 64 cores per node, the number of nodes in the compute cluster was increased from 1 to 2 to be able to run with 128 processes. The distributed matrix solution algorithm is automatically adjusted to be more efficient when using 2 compute nodes. Note that the impact of this adjustment may vary depending on the properties of the model being solved.

Next, consider a large model of a car with windows modelled with a layered dielectric. The car is solved at 14 GHz for monostatic RCS with MLFMM.

The size of the car is only a few meters, but with the number of unknowns at 23 million, and with the longest dimension of roughly 200 wavelengths this model is also considered large. The memory requirement is roughly 260 GByte.

The MLFMM solver uses the solution of the previous incident angle as initial guess for the solution of the current angle. In the above graphs times are representative of solving a single incident angle. Employing the CFIE^{2} in the MLFMM solver will greatly speed up the iterative solution, but it requires that the model is a closed PEC object. Due to gaps in the structure and the windows of the car the CFIE could not be used for this example.

The memory scaling is excellent due to the SPAI^{3} preconditioner being employed in this case.

The wall time scaling efficiency appears poor, but for RCS it must be remembered that only the iterative solution stage is repeated for subsequent incident angles. And considering the employment of initial guess, considering only the iterative solution stage would be more representative of the parallel scaling one could expect for monostatic RCS with MLFMM.

Unlike RCS in the MLFMM solver, once the MoM matrix has been solved, each incident angle is solved for very quickly. For many incident angles, it would be prudent to rather focus on this part of the solution in the MoM solver, compared to including the matrix calculation and LU decomposition stages for consideration.

Comparing Case 5 and Case 6 it is seen that the MoM scales better than the MLFMM, both in terms of runtime and of memory.

Scattering and RCS problems require many independent solutions. Instead of using all available resources on a single solution, for example on a single incident angle, a much more efficient solution is to use fractions of the available compute resources to simultaneously solve subsets of the total number of incident angles. To this end, a farming script can be downloaded that features the splitting of the model into subsets of the total number of incidence angles, and the uploading of the jobs to a compute cluster managed by Altair PBS Professional. PBS then manages all the concurrent jobs where all jobs are submitted at the same time by the farming script.

While the MoM often scales better than the MLFMM, the drawback of the MoM is its increasing memory consumption with increasing model size - memory is proportional to the number of unknowns (usually the number of meshed triangles) squared. However being a full wave solver, there are no potential convergence problems that iterative solvers such as MLFMM sometimes present. The default solver in Feko is the MoM, and it is always a good starting point.

The decision on which solver to use could be guided by running the MoM and MLFMM versions of the model for a certain time and to observe the solver progress. In addition, if the MoM solver uses hard drive space (a notice will be given in this regard) due to insufficient memory, then it would be advisable to switch to MLFMM.

The recent proliferation of computing resources has enabled huge problems to be solved with these solvers. Naturally, for large problems such as the large airliner in this article, it would be solved on a large server or cluster - using a low number of processes such as two or four is non-sensical. But for the purposes of computing parallel scaling efficiency, it is useful to include these metrics.

For electrically huge problems - for example, in excess of 100-200 million unknowns for the MLFMM - a more reasonable scaling test would be to start with one compute node (with say 32 cores or 64 cores) on the compute cluster and use that as a reference for parallel scaling, and then to double the number of compute nodes to gain insight into the parallel scaling efficiency.

Performing parallel scaling tests, such as the MoM and MLFMM solvers demonstrated in this article with examples of antenna placement and RCS, before embarking on a project that requires multiple solutions or variations of the same model, is a good practice to inform the selection of a suitable number of parallel processes on the specific available hardware.

1 backward substitution into the MoM matrix

2 The Combined Field Integral Equation (CFIE) uses both the electric field integral equation (EFIE) and

magnetic field integral equation (MFIE) in the basis function setup. This produces a better-conditioned matrix leading to better

convergence or faster solve times.

3 The Sparse Approximate Inverse preconditioner (SPAI) is a memory efficient preconditioner, but in some cases

convergence could be poorer. A preconditioner is required for convergence, and a suitable default is automatically applied.