Optistruct - AWS cost optimization study

pravarth
pravarth
Altair Employee
edited April 2022 in Altair HyperWorks

optistruct - aws cost optimization study

 

The purpose of this study is to find the most cost-effective means to run Optistruct on AWS infrastructure. The document looks at various instance types and any run options that can be used to improve this metric.

Models used

 

There were 2 models used primarily for this study.

  1. Model 1: Powertrain model setup for non-linear static analysis and sliding contacts. The model typically runs under an hour and is a good representation of smaller models used for analysis.
  2. Model 2: Large Model setup for non-linear static analysis, large displacement, and frictional contacts. This model requires a large amount of memory and processors to run.

 

Model Statistics

 

Model 1

Model 2

Total No. of Grids

1.5 Million

18 Million

Total No. of Elements Excluding Contact

841 Thousand

12 Million

Total No. of Contact Elements

84 Thousand

138 Thousand

Total No. of Degrees of Freedom

4.6 Million

54 Million

 

 

AWS instance types used

 

Instance

Cores

Memory (GiB)

Storage (GB)

GPU

Network

Cost (08/2021) $/hr

R5d.24xlarge

48

768

4 x 900 NVMe SSD

 

25 Gigabit

6.912

R5d.metal

48

768

4 x 900 NVMe SSD

 

25 Gigabit

6.912

R5dn.metal

48

768

4 x 900 NVMe SSD

 

100 Gigabit

8.016

R5ad.24xlarge

48

768

4 x 900 NVMe SSD

 

20 Gigabit

6.288

R5d.16xlarge

32

512

4 x 600 NVMe SSD

 

20 Gigabit

4.608

X1.32xlarge

64

1952

2 x 1920 SSD

 

25 Gigabit

13.338

Z1d.metal

24

384

2 x 900 NVMe SSD

 

25 Gigabit

4.464

R4.16xlarge

32

488

EBS

 

25 Gigabit

4.256

P2.8xlarge

16

488

EBS

8 x Tesla K80

10 Gigabit

7.200

P2.16xlarge

32

732

EBS

16 x Tesla K80

25 Gigabit

14.400

P3.8xlarge

16

244

EBS

4 x Tesla V100

10 Gigabit

12.240

P3.16xlarge

32

488

EBS

8 x Tesla V100

25 Gigabit

24.480

P3dn.24xlarge

48

768

2 x 900 NVME SSD

8 x Tesla V100

100 Gigabit

31.212

P4d.24xlarge

48

1152

8 x 1000 NVME SSD

8 x Tesla A100

400 Gigabit

32.773

 

The instances types were chosen to study various factors affecting performance:

 

CPU speed: Z1D instances have the fastest clocked (4.0 Ghz) CPU cores offered by AWS on the 2nd gen Intel Cascade lake architecture. While R5 and M5 instance family use the same CPU architecture, they are clocked lower at 3.1 Ghz at full load. The X1 instances use an older generation Intel Haswell architecture but offers 64 cores per node.

 

Memory: The X1 instance offers high memory/core. This allows larger models to run in-core without swapping to disk or needing to scale across multiple nodes. The R5 and Z1 instances offer 16GiB/core while the M5 instance offers 8GiB/core.

 

Storage: AWS offers EBS backed instances which is flexible in terms of capacity at the cost of latency and throughput. This study mostly uses instances with attached storage provided by NVMe SSDs. This is to eliminate any variability in the results due to EBS.

 

Network: AWS offers high speed interconnects between nodes with Elastic Fabric Adapter (EFA). The performance benefit is studied compared to the standard Elastic Network Adapter (ENA) interfaces when using multiple nodes.

 

GPU: Optistruct 2021.1 supports MUMPS solver on GPU along with DDM and this is studied with various GPU compute instances provided by AWS. Only GPUs with good FP64 performance is chosen as double precision performance is important. AWS P2, P3 and P4 instances were chosen for this reason.

 

Bare metal instances were used as much as possible as they have the same cost as their full instance counterpart while providing lower overhead. This will further improve performance for no additional cost.

 

Optistruct parameters used

 

DDM – Domain decomposition was used to scale the model to multiple nodes as well as improve the scaling in a single node.

In-core/out-core – Based on system memory, in-core or out-core was studied to find the sweet spot in terms of performance/cost

KEEP401=0 - Disable the L0 threading to reduce memory at the cost of performance. Only used for large models

IRELAXP=1 – Turn on relaxed pivoting. Could be useful for large models.

PARAM, IMPLOUT, ONLY – Enable result generation on the fly and disable the regular result generation at the end

Results and discussion

 

Model 1

Number

Instance

np

nt

Cores

GPU

Elapsed time (hh:mm:ss)

Memory Usage (MB)

Disk Usage (MB)

Cost

1

z1d.metal

12

2

24

 

 00:30:55

177906

52660

$2.30

2

z1d.metal

8

3

24

 

 00:31:34

145686

41141

$2.35

3

r5d.metal

24

2

48

 

 00:23:24

234016

85806

$2.70

4

z1d.metal**

8

3

24

 

 00:37:40

85556

105237

$2.80

5

r5d.metal

16

3

48

 

 00:24:47

194495

63629

$2.86

6

r5d.metal

12

4

48

 

 00:26:38

176333

52660

$3.07

7

z1d.metal

16

3

48

 

 00:22:16

194604

63629

$3.31

8

r4.16xlarge

8

4

32

 

 00:47:36

172474

44115

$3.38

9

r5d.metal

8

6

48

 

 00:29:34

174599

44115

$3.41

10

r5d.metal EFS*

8

6

48

 

 00:29:37

145049

41141

$3.41

11

z1d.metal

12

4

48

 

 00:23:22

176550

52660

$3.48

12

r5d.24xlarge

8

6

48

 

 00:30:28

169698

44115

$3.51

13

r5dn.metal EFA

8

6

48

 

 00:29:22

145049

41141

$3.92

14

r5d.metal**

8

6

48

 

 00:38:18

85317

105214

$4.41

15

r5d.metal

16

6

96

 

 00:22:07

193978

63629

$5.10

16

r5ad.24xlarge

8

6

48

 

 00:49:11

168001

44115

$5.15

17

r5dn.metal EFA

16

6

96

 

 00:20:59

193869

63629

$5.61

18

p2.8xlarge

8

2

16

8

 00:51:43

185136

44115

$6.21

19

r5d.metal

24

6

144

 

 00:19:42

232323

85806

$6.81

20

p2.16xlarge

8

4

32

4

 00:40:01

188990

44115

$9.60

21

p2.16xlarge

8

4

32

16

 00:40:57

186761

44115

$9.83

22

p2.16xlarge

8

4

32

8

 00:41:14

182650

44115

$9.90

23

p3.8xlarge

8

2

16

4

 00:51:21

186946

44115

$10.48

24

p2.16xlarge

8

4

32

1

 00:44:37

192376

44115

$10.71

25

p4d.24xlarge

8

6

48

8

 00:26:49

187546

44115

$14.65

26

p3dn.24xlarge

8

6

48

8

 00:28:50

193960

44115

$15.00

27

p3.16xlarge

8

4

32

8

 00:37:28

195478

44115

$15.29

28

p3.16xlarge

8

4

32

8

 00:37:58

187565

44115

$15.49

29

p3.8xlarge

1

16

16

4

 01:25:38

102451

19701

$17.47

30

p3.16xlarge

4

8

32

8

 00:43:59

143087

30065

$17.95

 

* r5d metal EFS – uses NFS shared location as working directory where the results are saved but uses local scratch for saving matrices.

** These jobs ran out core. Rest was in-core

 

Running the job on the smaller model served two purposes: Firstly, it gives a good sense of scalability and performance for models without memory constraints. Secondly, several iterations of instances types and configurations could be studied in a smaller time frame and cost. Trends observed in this study could be used as a guide to reduce wasted resources for the large model.

 

The above table lists all the runs sorted from cheapest to costliest. Optistruct favors high clock speed per core which is shown by the performance of z1d compared to r5d. Even with half as many cores, the z1d instance is within 7% of the performance of r5d and therefore much more cost effective. The performance is still higher than r5d when we match number of MPI processes and threads (run6 and run11).

 

The on-demand pricing cost is also listed for each job. While run19 with 144 cores yields the fastest time, it is also most expensive CPU only run. The cost effectiveness goes down as we add multiple nodes to these smaller jobs as the performance improvement does not offset the additional cost incurred. Using EFA does improve the scalability for multi-node jobs. Once again, the added cost of this instance makes it unfavorable compared to the instances with ENA.

 

Baremetal instances (r5d.metal) tend to offer free performance improvement over their virtualized counterpart (r5d.24xlarge). In this case (run9 and run12), we see an uplift of 3% and leads to similar cost savings.

 

NFS backed storage such as the Elastic File System can be used as a working directory without any appreciable loss in performance. Run 9 stored scratch and result files in local NVMe storage, whereas run 10 stored result files in EFS, and scratch files in local NVMe storage. However, this may not hold true if the result files are written frequently or is large for cases such as transient and dynamic analysis.

 

Optistruct currently supports upto 8 GPUs per job. However, it is also important to have enough memory and CPU cores for tasks that cannot run on these GPUs. Therefore, only bigger instances were chosen even though the data shows limited benefit to using upto 8 GPUs. The performance of p2.16xlarge can be compared with r4.16xlarge as they use the same CPU, with the major difference between them being the K80 GPUs. The GPUs yield a performance benefit of 16%, but at a cost increase of 184%. Similarly, r5d.metal can be compared with p4d.24xlarge. Here the A100 GPUs improve the performance by 9%, but the cost increases by 330%.


Model 2

Number

 

 

Instance

np

nt

Cores

Elapsed time (hh:mm:ss)

Memory Usage (MB)

Disk Usage (MB)

AWS Cost ($)

1

r5d.metal*

8

6

48

 15:00:26

717373

2205824

$103.73

2

r5d.metal*

8

6

48

 15:02:52

717692

2205922

$104.01

3

z1d.metal*

8

6

48

 15:14:17

717716

2205861

$136.05

4

z1d.metal*

12

6

72

 11:47:05

961168

2444745

$157.82

5

r5d.16xlarge

16

8

128

 09:31:30

1897887

2677267

$175.56

6

r5d.metal

18

8

144

 08:50:52

2070137

2864499

$183.47

7

r5d.metal

18

8

144

 09:10:26

2069472

2864499

$190.23

8

x1.32xlarge

8

8

64

 15:08:05

1491149

1816795

$202.05

 

* These jobs ran out-core. Rest was in-core

 

Based on the results of Model 1, the most cost-efficient nodes were chosen for study. Additionally, r5d.16xlarge and x1.32xlarge were chosen to allow the job to run in-core.

 

Job number 1 and 6 did not use any debug parameters such as KEEP401, KEEPINCO or IRELAXP. The difference between using these parameters and not using them is minimal.

 

The x1.32xlarge has enough memory to run this model in-core with a single node. However, the older 64 haswell cores cannot match the performance of r5d whether which can use more cores by adding nodes.

 

Optistruct can scale upto 144 cores with the r5d.metal which yields the fastest time. Like model 1, this is not cost effective. It is much cheaper to run out-core on r5d.metal if the increase in solve time is not a concern. The z1d metal is not cost effective or performant for this class of problem as it would get limited by network bottlenecks going across nodes and having to run out of core.

 

The in-core runs of r5d.metal and r5d.16xlarge show the best performance for this model. Using r5 instances without the NVMe storage might yield better cost of performance ratio as the storage speed is not an important factor for jobs running in-core.

 

GPU instances were not studied for this model as there was limited benefit shown in Model 1.

 

recommendations

 

The recommendations are valid for Optistruct 2021.2. As the solver continuously improves, these recommendations may change especially when it comes to scaling across nodes, memory usage and GPU performance.

 

Z1d.metal has a good balance of performance, cost and memory/core ratio which makes it suitable for small to medium sized models. The best performance is realized when running with a single node or up to 3 nodes and in-core. MPI process/thread split of np 6 and nt 4 is recommended. Higher np settings may yield better performance for smaller models but is more likely to run out of memory. There is a modest drop in performance if the job drops to out of cores, however the jobs still complete successfully. The m5 series is another good option with higher core count. The lower memory/core ratio makes it less suitable for medium sized jobs to run in-core.

 

A cost-effective means to run large models is to run them out-core using r5d.metal. The node’s resources are fully utilized and there is no need to contend with network performance. If performance is the primary concern, then the job can be executed in-core with upto 3 nodes to get the fastest solve time.

 

It is currently recommended to invest in faster CPUs rather than GPUs, if optimizing for only Optistruct. GPUs may be used if it is used as an upgrade to an older CPU platform. It can also make sense if Optistruct is run in the same environment used for other GPU applications such as NanofluidX or UltrafluidX.

 

Job Size

Instance

Nodes

-core flag

MPI setting

< 20 Million DOF

z1d.metal

Upto 3

N/A

np 6 nt 4

> 20 Million DOF

r5d.metal

Upto 3

in

np 6 nt 8