Optistruct - AWS cost optimization study

pravarth · November 2021

optistruct - aws cost optimization study

The purpose of this study is to find the most cost-effective means to run Optistruct on AWS infrastructure. The document looks at various instance types and any run options that can be used to improve this metric.

Models used

There were 2 models used primarily for this study.

Model 1: Powertrain model setup for non-linear static analysis and sliding contacts. The model typically runs under an hour and is a good representation of smaller models used for analysis.
Model 2: Large Model setup for non-linear static analysis, large displacement, and frictional contacts. This model requires a large amount of memory and processors to run.

Model Statistics

	Model 1	Model 2
Total No. of Grids	1.5 Million	18 Million
Total No. of Elements Excluding Contact	841 Thousand	12 Million
Total No. of Contact Elements	84 Thousand	138 Thousand
Total No. of Degrees of Freedom	4.6 Million	54 Million

AWS instance types used

Instance	Cores	Memory (GiB)	Storage (GB)	GPU	Network	Cost (08/2021) $/hr
R5d.24xlarge	48	768	4 x 900 NVMe SSD		25 Gigabit	6.912
R5d.metal	48	768	4 x 900 NVMe SSD		25 Gigabit	6.912
R5dn.metal	48	768	4 x 900 NVMe SSD		100 Gigabit	8.016
R5ad.24xlarge	48	768	4 x 900 NVMe SSD		20 Gigabit	6.288
R5d.16xlarge	32	512	4 x 600 NVMe SSD		20 Gigabit	4.608
X1.32xlarge	64	1952	2 x 1920 SSD		25 Gigabit	13.338
Z1d.metal	24	384	2 x 900 NVMe SSD		25 Gigabit	4.464
R4.16xlarge	32	488	EBS		25 Gigabit	4.256
P2.8xlarge	16	488	EBS	8 x Tesla K80	10 Gigabit	7.200
P2.16xlarge	32	732	EBS	16 x Tesla K80	25 Gigabit	14.400
P3.8xlarge	16	244	EBS	4 x Tesla V100	10 Gigabit	12.240
P3.16xlarge	32	488	EBS	8 x Tesla V100	25 Gigabit	24.480
P3dn.24xlarge	48	768	2 x 900 NVME SSD	8 x Tesla V100	100 Gigabit	31.212
P4d.24xlarge	48	1152	8 x 1000 NVME SSD	8 x Tesla A100	400 Gigabit	32.773

The instances types were chosen to study various factors affecting performance:

CPU speed: Z1D instances have the fastest clocked (4.0 Ghz) CPU cores offered by AWS on the 2^nd gen Intel Cascade lake architecture. While R5 and M5 instance family use the same CPU architecture, they are clocked lower at 3.1 Ghz at full load. The X1 instances use an older generation Intel Haswell architecture but offers 64 cores per node.

Memory: The X1 instance offers high memory/core. This allows larger models to run in-core without swapping to disk or needing to scale across multiple nodes. The R5 and Z1 instances offer 16GiB/core while the M5 instance offers 8GiB/core.

Storage: AWS offers EBS backed instances which is flexible in terms of capacity at the cost of latency and throughput. This study mostly uses instances with attached storage provided by NVMe SSDs. This is to eliminate any variability in the results due to EBS.

Network: AWS offers high speed interconnects between nodes with Elastic Fabric Adapter (EFA). The performance benefit is studied compared to the standard Elastic Network Adapter (ENA) interfaces when using multiple nodes.

GPU: Optistruct 2021.1 supports MUMPS solver on GPU along with DDM and this is studied with various GPU compute instances provided by AWS. Only GPUs with good FP64 performance is chosen as double precision performance is important. AWS P2, P3 and P4 instances were chosen for this reason.

Bare metal instances were used as much as possible as they have the same cost as their full instance counterpart while providing lower overhead. This will further improve performance for no additional cost.

Optistruct parameters used

DDM – Domain decomposition was used to scale the model to multiple nodes as well as improve the scaling in a single node.

In-core/out-core – Based on system memory, in-core or out-core was studied to find the sweet spot in terms of performance/cost

KEEP401=0 - Disable the L0 threading to reduce memory at the cost of performance. Only used for large models

IRELAXP=1 – Turn on relaxed pivoting. Could be useful for large models.

PARAM, IMPLOUT, ONLY – Enable result generation on the fly and disable the regular result generation at the end

Results and discussion

Model 1

Number	Instance	np	nt	Cores	GPU	Elapsed time (hh:mm:ss)	Memory Usage (MB)	Disk Usage (MB)	Cost
1	z1d.metal	12	2	24		00:30:55	177906	52660	$2.30
2	z1d.metal	8	3	24		00:31:34	145686	41141	$2.35
3	r5d.metal	24	2	48		00:23:24	234016	85806	$2.70
4	z1d.metal**	8	3	24		00:37:40	85556	105237	$2.80
5	r5d.metal	16	3	48		00:24:47	194495	63629	$2.86
6	r5d.metal	12	4	48		00:26:38	176333	52660	$3.07
7	z1d.metal	16	3	48		00:22:16	194604	63629	$3.31
8	r4.16xlarge	8	4	32		00:47:36	172474	44115	$3.38
9	r5d.metal	8	6	48		00:29:34	174599	44115	$3.41
10	r5d.metal EFS*	8	6	48		00:29:37	145049	41141	$3.41
11	z1d.metal	12	4	48		00:23:22	176550	52660	$3.48
12	r5d.24xlarge	8	6	48		00:30:28	169698	44115	$3.51
13	r5dn.metal EFA	8	6	48		00:29:22	145049	41141	$3.92
14	r5d.metal**	8	6	48		00:38:18	85317	105214	$4.41
15	r5d.metal	16	6	96		00:22:07	193978	63629	$5.10
16	r5ad.24xlarge	8	6	48		00:49:11	168001	44115	$5.15
17	r5dn.metal EFA	16	6	96		00:20:59	193869	63629	$5.61
18	p2.8xlarge	8	2	16	8	00:51:43	185136	44115	$6.21
19	r5d.metal	24	6	144		00:19:42	232323	85806	$6.81
20	p2.16xlarge	8	4	32	4	00:40:01	188990	44115	$9.60
21	p2.16xlarge	8	4	32	16	00:40:57	186761	44115	$9.83
22	p2.16xlarge	8	4	32	8	00:41:14	182650	44115	$9.90
23	p3.8xlarge	8	2	16	4	00:51:21	186946	44115	$10.48
24	p2.16xlarge	8	4	32	1	00:44:37	192376	44115	$10.71
25	p4d.24xlarge	8	6	48	8	00:26:49	187546	44115	$14.65
26	p3dn.24xlarge	8	6	48	8	00:28:50	193960	44115	$15.00
27	p3.16xlarge	8	4	32	8	00:37:28	195478	44115	$15.29
28	p3.16xlarge	8	4	32	8	00:37:58	187565	44115	$15.49
29	p3.8xlarge	1	16	16	4	01:25:38	102451	19701	$17.47
30	p3.16xlarge	4	8	32	8	00:43:59	143087	30065	$17.95

* r5d metal EFS – uses NFS shared location as working directory where the results are saved but uses local scratch for saving matrices.

** These jobs ran out core. Rest was in-core

Running the job on the smaller model served two purposes: Firstly, it gives a good sense of scalability and performance for models without memory constraints. Secondly, several iterations of instances types and configurations could be studied in a smaller time frame and cost. Trends observed in this study could be used as a guide to reduce wasted resources for the large model.

The above table lists all the runs sorted from cheapest to costliest. Optistruct favors high clock speed per core which is shown by the performance of z1d compared to r5d. Even with half as many cores, the z1d instance is within 7% of the performance of r5d and therefore much more cost effective. The performance is still higher than r5d when we match number of MPI processes and threads (run6 and run11).

The on-demand pricing cost is also listed for each job. While run19 with 144 cores yields the fastest time, it is also most expensive CPU only run. The cost effectiveness goes down as we add multiple nodes to these smaller jobs as the performance improvement does not offset the additional cost incurred. Using EFA does improve the scalability for multi-node jobs. Once again, the added cost of this instance makes it unfavorable compared to the instances with ENA.

Baremetal instances (r5d.metal) tend to offer free performance improvement over their virtualized counterpart (r5d.24xlarge). In this case (run9 and run12), we see an uplift of 3% and leads to similar cost savings.

NFS backed storage such as the Elastic File System can be used as a working directory without any appreciable loss in performance. Run 9 stored scratch and result files in local NVMe storage, whereas run 10 stored result files in EFS, and scratch files in local NVMe storage. However, this may not hold true if the result files are written frequently or is large for cases such as transient and dynamic analysis.

Optistruct currently supports upto 8 GPUs per job. However, it is also important to have enough memory and CPU cores for tasks that cannot run on these GPUs. Therefore, only bigger instances were chosen even though the data shows limited benefit to using upto 8 GPUs. The performance of p2.16xlarge can be compared with r4.16xlarge as they use the same CPU, with the major difference between them being the K80 GPUs. The GPUs yield a performance benefit of 16%, but at a cost increase of 184%. Similarly, r5d.metal can be compared with p4d.24xlarge. Here the A100 GPUs improve the performance by 9%, but the cost increases by 330%.

Model 2

Number	Instance	np	nt	Cores	Elapsed time (hh:mm:ss)	Memory Usage (MB)	Disk Usage (MB)	AWS Cost ($)
1	r5d.metal*	8	6	48	15:00:26	717373	2205824	$103.73
2	r5d.metal*	8	6	48	15:02:52	717692	2205922	$104.01
3	z1d.metal*	8	6	48	15:14:17	717716	2205861	$136.05
4	z1d.metal*	12	6	72	11:47:05	961168	2444745	$157.82
5	r5d.16xlarge	16	8	128	09:31:30	1897887	2677267	$175.56
6	r5d.metal	18	8	144	08:50:52	2070137	2864499	$183.47
7	r5d.metal	18	8	144	09:10:26	2069472	2864499	$190.23
8	x1.32xlarge	8	8	64	15:08:05	1491149	1816795	$202.05

* These jobs ran out-core. Rest was in-core

Based on the results of Model 1, the most cost-efficient nodes were chosen for study. Additionally, r5d.16xlarge and x1.32xlarge were chosen to allow the job to run in-core.

Job number 1 and 6 did not use any debug parameters such as KEEP401, KEEPINCO or IRELAXP. The difference between using these parameters and not using them is minimal.

The x1.32xlarge has enough memory to run this model in-core with a single node. However, the older 64 haswell cores cannot match the performance of r5d whether which can use more cores by adding nodes.

Optistruct can scale upto 144 cores with the r5d.metal which yields the fastest time. Like model 1, this is not cost effective. It is much cheaper to run out-core on r5d.metal if the increase in solve time is not a concern. The z1d metal is not cost effective or performant for this class of problem as it would get limited by network bottlenecks going across nodes and having to run out of core.

The in-core runs of r5d.metal and r5d.16xlarge show the best performance for this model. Using r5 instances without the NVMe storage might yield better cost of performance ratio as the storage speed is not an important factor for jobs running in-core.

GPU instances were not studied for this model as there was limited benefit shown in Model 1.

recommendations

The recommendations are valid for Optistruct 2021.2. As the solver continuously improves, these recommendations may change especially when it comes to scaling across nodes, memory usage and GPU performance.

Z1d.metal has a good balance of performance, cost and memory/core ratio which makes it suitable for small to medium sized models. The best performance is realized when running with a single node or up to 3 nodes and in-core. MPI process/thread split of np 6 and nt 4 is recommended. Higher np settings may yield better performance for smaller models but is more likely to run out of memory. There is a modest drop in performance if the job drops to out of cores, however the jobs still complete successfully. The m5 series is another good option with higher core count. The lower memory/core ratio makes it less suitable for medium sized jobs to run in-core.

A cost-effective means to run large models is to run them out-core using r5d.metal. The node’s resources are fully utilized and there is no need to contend with network performance. If performance is the primary concern, then the job can be executed in-core with upto 3 nodes to get the fastest solve time.

It is currently recommended to invest in faster CPUs rather than GPUs, if optimizing for only Optistruct. GPUs may be used if it is used as an upgrade to an older CPU platform. It can also make sense if Optistruct is run in the same environment used for other GPU applications such as NanofluidX or UltrafluidX.

Job Size	Instance	Nodes	-core flag	MPI setting
< 20 Million DOF	z1d.metal	Upto 3	N/A	np 6 nt 4
> 20 Million DOF	r5d.metal	Upto 3	in	np 6 nt 8