Optistruct - AWS cost optimization study
optistruct - aws cost optimization study
The purpose of this study is to find the most cost-effective means to run Optistruct on AWS infrastructure. The document looks at various instance types and any run options that can be used to improve this metric.
Models used
There were 2 models used primarily for this study.
- Model 1: Powertrain model setup for non-linear static analysis and sliding contacts. The model typically runs under an hour and is a good representation of smaller models used for analysis.
- Model 2: Large Model setup for non-linear static analysis, large displacement, and frictional contacts. This model requires a large amount of memory and processors to run.
Model Statistics
| Model 1 | Model 2 |
Total No. of Grids | 1.5 Million | 18 Million |
Total No. of Elements Excluding Contact | 841 Thousand | 12 Million |
Total No. of Contact Elements | 84 Thousand | 138 Thousand |
Total No. of Degrees of Freedom | 4.6 Million | 54 Million |
AWS instance types used
Instance | Cores | Memory (GiB) | Storage (GB) | GPU | Network | Cost (08/2021) $/hr |
R5d.24xlarge | 48 | 768 | 4 x 900 NVMe SSD |
| 25 Gigabit | 6.912 |
R5d.metal | 48 | 768 | 4 x 900 NVMe SSD |
| 25 Gigabit | 6.912 |
R5dn.metal | 48 | 768 | 4 x 900 NVMe SSD |
| 100 Gigabit | 8.016 |
R5ad.24xlarge | 48 | 768 | 4 x 900 NVMe SSD |
| 20 Gigabit | 6.288 |
R5d.16xlarge | 32 | 512 | 4 x 600 NVMe SSD |
| 20 Gigabit | 4.608 |
X1.32xlarge | 64 | 1952 | 2 x 1920 SSD |
| 25 Gigabit | 13.338 |
Z1d.metal | 24 | 384 | 2 x 900 NVMe SSD |
| 25 Gigabit | 4.464 |
R4.16xlarge | 32 | 488 | EBS |
| 25 Gigabit | 4.256 |
P2.8xlarge | 16 | 488 | EBS | 8 x Tesla K80 | 10 Gigabit | 7.200 |
P2.16xlarge | 32 | 732 | EBS | 16 x Tesla K80 | 25 Gigabit | 14.400 |
P3.8xlarge | 16 | 244 | EBS | 4 x Tesla V100 | 10 Gigabit | 12.240 |
P3.16xlarge | 32 | 488 | EBS | 8 x Tesla V100 | 25 Gigabit | 24.480 |
P3dn.24xlarge | 48 | 768 | 2 x 900 NVME SSD | 8 x Tesla V100 | 100 Gigabit | 31.212 |
P4d.24xlarge | 48 | 1152 | 8 x 1000 NVME SSD | 8 x Tesla A100 | 400 Gigabit | 32.773 |
The instances types were chosen to study various factors affecting performance:
CPU speed: Z1D instances have the fastest clocked (4.0 Ghz) CPU cores offered by AWS on the 2nd gen Intel Cascade lake architecture. While R5 and M5 instance family use the same CPU architecture, they are clocked lower at 3.1 Ghz at full load. The X1 instances use an older generation Intel Haswell architecture but offers 64 cores per node.
Memory: The X1 instance offers high memory/core. This allows larger models to run in-core without swapping to disk or needing to scale across multiple nodes. The R5 and Z1 instances offer 16GiB/core while the M5 instance offers 8GiB/core.
Storage: AWS offers EBS backed instances which is flexible in terms of capacity at the cost of latency and throughput. This study mostly uses instances with attached storage provided by NVMe SSDs. This is to eliminate any variability in the results due to EBS.
Network: AWS offers high speed interconnects between nodes with Elastic Fabric Adapter (EFA). The performance benefit is studied compared to the standard Elastic Network Adapter (ENA) interfaces when using multiple nodes.
GPU: Optistruct 2021.1 supports MUMPS solver on GPU along with DDM and this is studied with various GPU compute instances provided by AWS. Only GPUs with good FP64 performance is chosen as double precision performance is important. AWS P2, P3 and P4 instances were chosen for this reason.
Bare metal instances were used as much as possible as they have the same cost as their full instance counterpart while providing lower overhead. This will further improve performance for no additional cost.
Optistruct parameters used
DDM – Domain decomposition was used to scale the model to multiple nodes as well as improve the scaling in a single node.
In-core/out-core – Based on system memory, in-core or out-core was studied to find the sweet spot in terms of performance/cost
KEEP401=0 - Disable the L0 threading to reduce memory at the cost of performance. Only used for large models
IRELAXP=1 – Turn on relaxed pivoting. Could be useful for large models.
PARAM, IMPLOUT, ONLY – Enable result generation on the fly and disable the regular result generation at the end
Results and discussion
Model 1
Instance | np | nt | Cores | GPU | Memory Usage (MB) | Disk Usage (MB) | Cost | ||
1 | z1d.metal | 12 | 2 | 24 |
| 00:30:55 | 177906 | 52660 | $2.30 |
2 | z1d.metal | 8 | 3 | 24 |
| 00:31:34 | 145686 | 41141 | $2.35 |
3 | r5d.metal | 24 | 2 | 48 |
| 00:23:24 | 234016 | 85806 | $2.70 |
4 | z1d.metal** | 8 | 3 | 24 |
| 00:37:40 | 85556 | 105237 | $2.80 |
5 | r5d.metal | 16 | 3 | 48 |
| 00:24:47 | 194495 | 63629 | $2.86 |
6 | r5d.metal | 12 | 4 | 48 |
| 00:26:38 | 176333 | 52660 | $3.07 |
7 | z1d.metal | 16 | 3 | 48 |
| 00:22:16 | 194604 | 63629 | $3.31 |
8 | r4.16xlarge | 8 | 4 | 32 |
| 00:47:36 | 172474 | 44115 | $3.38 |
9 | r5d.metal | 8 | 6 | 48 |
| 00:29:34 | 174599 | 44115 | $3.41 |
10 | r5d.metal EFS* | 8 | 6 | 48 |
| 00:29:37 | 145049 | 41141 | $3.41 |
11 | z1d.metal | 12 | 4 | 48 |
| 00:23:22 | 176550 | 52660 | $3.48 |
12 | r5d.24xlarge | 8 | 6 | 48 |
| 00:30:28 | 169698 | 44115 | $3.51 |
13 | r5dn.metal EFA | 8 | 6 | 48 |
| 00:29:22 | 145049 | 41141 | $3.92 |
14 | r5d.metal** | 8 | 6 | 48 |
| 00:38:18 | 85317 | 105214 | $4.41 |
15 | r5d.metal | 16 | 6 | 96 |
| 00:22:07 | 193978 | 63629 | $5.10 |
16 | r5ad.24xlarge | 8 | 6 | 48 |
| 00:49:11 | 168001 | 44115 | $5.15 |
17 | r5dn.metal EFA | 16 | 6 | 96 |
| 00:20:59 | 193869 | 63629 | $5.61 |
18 | p2.8xlarge | 8 | 2 | 16 | 8 | 00:51:43 | 185136 | 44115 | $6.21 |
19 | r5d.metal | 24 | 6 | 144 |
| 00:19:42 | 232323 | 85806 | $6.81 |
20 | p2.16xlarge | 8 | 4 | 32 | 4 | 00:40:01 | 188990 | 44115 | $9.60 |
21 | p2.16xlarge | 8 | 4 | 32 | 16 | 00:40:57 | 186761 | 44115 | $9.83 |
22 | p2.16xlarge | 8 | 4 | 32 | 8 | 00:41:14 | 182650 | 44115 | $9.90 |
23 | p3.8xlarge | 8 | 2 | 16 | 4 | 00:51:21 | 186946 | 44115 | $10.48 |
24 | p2.16xlarge | 8 | 4 | 32 | 1 | 00:44:37 | 192376 | 44115 | $10.71 |
25 | p4d.24xlarge | 8 | 6 | 48 | 8 | 00:26:49 | 187546 | 44115 | $14.65 |
26 | p3dn.24xlarge | 8 | 6 | 48 | 8 | 00:28:50 | 193960 | 44115 | $15.00 |
27 | p3.16xlarge | 8 | 4 | 32 | 8 | 00:37:28 | 195478 | 44115 | $15.29 |
28 | p3.16xlarge | 8 | 4 | 32 | 8 | 00:37:58 | 187565 | 44115 | $15.49 |
29 | p3.8xlarge | 1 | 16 | 16 | 4 | 01:25:38 | 102451 | 19701 | $17.47 |
30 | p3.16xlarge | 4 | 8 | 32 | 8 | 00:43:59 | 143087 | 30065 | $17.95 |
* r5d metal EFS – uses NFS shared location as working directory where the results are saved but uses local scratch for saving matrices.
** These jobs ran out core. Rest was in-core
Running the job on the smaller model served two purposes: Firstly, it gives a good sense of scalability and performance for models without memory constraints. Secondly, several iterations of instances types and configurations could be studied in a smaller time frame and cost. Trends observed in this study could be used as a guide to reduce wasted resources for the large model.
The above table lists all the runs sorted from cheapest to costliest. Optistruct favors high clock speed per core which is shown by the performance of z1d compared to r5d. Even with half as many cores, the z1d instance is within 7% of the performance of r5d and therefore much more cost effective. The performance is still higher than r5d when we match number of MPI processes and threads (run6 and run11).
The on-demand pricing cost is also listed for each job. While run19 with 144 cores yields the fastest time, it is also most expensive CPU only run. The cost effectiveness goes down as we add multiple nodes to these smaller jobs as the performance improvement does not offset the additional cost incurred. Using EFA does improve the scalability for multi-node jobs. Once again, the added cost of this instance makes it unfavorable compared to the instances with ENA.
Baremetal instances (r5d.metal) tend to offer free performance improvement over their virtualized counterpart (r5d.24xlarge). In this case (run9 and run12), we see an uplift of 3% and leads to similar cost savings.
NFS backed storage such as the Elastic File System can be used as a working directory without any appreciable loss in performance. Run 9 stored scratch and result files in local NVMe storage, whereas run 10 stored result files in EFS, and scratch files in local NVMe storage. However, this may not hold true if the result files are written frequently or is large for cases such as transient and dynamic analysis.
Optistruct currently supports upto 8 GPUs per job. However, it is also important to have enough memory and CPU cores for tasks that cannot run on these GPUs. Therefore, only bigger instances were chosen even though the data shows limited benefit to using upto 8 GPUs. The performance of p2.16xlarge can be compared with r4.16xlarge as they use the same CPU, with the major difference between them being the K80 GPUs. The GPUs yield a performance benefit of 16%, but at a cost increase of 184%. Similarly, r5d.metal can be compared with p4d.24xlarge. Here the A100 GPUs improve the performance by 9%, but the cost increases by 330%.
Model 2
Number
| np | nt | Cores | Elapsed time (hh:mm:ss) | Memory Usage (MB) | Disk Usage (MB) | AWS Cost ($) | |
1 | r5d.metal* | 8 | 6 | 48 | 15:00:26 | 717373 | 2205824 | $103.73 |
2 | r5d.metal* | 8 | 6 | 48 | 15:02:52 | 717692 | 2205922 | $104.01 |
3 | z1d.metal* | 8 | 6 | 48 | 15:14:17 | 717716 | 2205861 | $136.05 |
4 | z1d.metal* | 12 | 6 | 72 | 11:47:05 | 961168 | 2444745 | $157.82 |
5 | r5d.16xlarge | 16 | 8 | 128 | 09:31:30 | 1897887 | 2677267 | $175.56 |
6 | r5d.metal | 18 | 8 | 144 | 08:50:52 | 2070137 | 2864499 | $183.47 |
7 | r5d.metal | 18 | 8 | 144 | 09:10:26 | 2069472 | 2864499 | $190.23 |
8 | x1.32xlarge | 8 | 8 | 64 | 15:08:05 | 1491149 | 1816795 | $202.05 |
* These jobs ran out-core. Rest was in-core
Based on the results of Model 1, the most cost-efficient nodes were chosen for study. Additionally, r5d.16xlarge and x1.32xlarge were chosen to allow the job to run in-core.
Job number 1 and 6 did not use any debug parameters such as KEEP401, KEEPINCO or IRELAXP. The difference between using these parameters and not using them is minimal.
The x1.32xlarge has enough memory to run this model in-core with a single node. However, the older 64 haswell cores cannot match the performance of r5d whether which can use more cores by adding nodes.
Optistruct can scale upto 144 cores with the r5d.metal which yields the fastest time. Like model 1, this is not cost effective. It is much cheaper to run out-core on r5d.metal if the increase in solve time is not a concern. The z1d metal is not cost effective or performant for this class of problem as it would get limited by network bottlenecks going across nodes and having to run out of core.
The in-core runs of r5d.metal and r5d.16xlarge show the best performance for this model. Using r5 instances without the NVMe storage might yield better cost of performance ratio as the storage speed is not an important factor for jobs running in-core.
GPU instances were not studied for this model as there was limited benefit shown in Model 1.
recommendations
The recommendations are valid for Optistruct 2021.2. As the solver continuously improves, these recommendations may change especially when it comes to scaling across nodes, memory usage and GPU performance.
Z1d.metal has a good balance of performance, cost and memory/core ratio which makes it suitable for small to medium sized models. The best performance is realized when running with a single node or up to 3 nodes and in-core. MPI process/thread split of np 6 and nt 4 is recommended. Higher np settings may yield better performance for smaller models but is more likely to run out of memory. There is a modest drop in performance if the job drops to out of cores, however the jobs still complete successfully. The m5 series is another good option with higher core count. The lower memory/core ratio makes it less suitable for medium sized jobs to run in-core.
A cost-effective means to run large models is to run them out-core using r5d.metal. The node’s resources are fully utilized and there is no need to contend with network performance. If performance is the primary concern, then the job can be executed in-core with upto 3 nodes to get the fastest solve time.
It is currently recommended to invest in faster CPUs rather than GPUs, if optimizing for only Optistruct. GPUs may be used if it is used as an upgrade to an older CPU platform. It can also make sense if Optistruct is run in the same environment used for other GPU applications such as NanofluidX or UltrafluidX.
Job Size | Instance | Nodes | -core flag | MPI setting |
< 20 Million DOF | z1d.metal | Upto 3 | N/A | np 6 nt 4 |
> 20 Million DOF | r5d.metal | Upto 3 | in | np 6 nt 8 |