AWS Improves Write Performance by 4X With Mistral I/O Profiling

LydiaW
LydiaW
Altair Employee
edited August 2022 in Altair HPCWorks

An AWS customer wanted to scale their machine learning (ML) workload to hundreds of thousands of machine instances. Their goal was to download large images, including people and cars, from S3 to EBS storage to process for training a self-driving car. Optimizing and scaling storage usage was key, but a bottleneck was created when writing the images to disc. The AWS team profiled the application using Altair Mistral™ to see how the workflow could be improved, with results that were well worth the effort.

The Workflow Cloud storage users commonly encounter I/O problems as they switch between types of storage and upload/download from various storage tiers. I/O profiling using tools such as Altair Breeze™ and Mistral is becoming even more vital as big data gets bigger and more and more companies turn to platforms such as AWS for compute power. There are many factors to consider for cloud workloads, including the bursty/non-bursty nature of I/O and the choices that must made between EBS volume types — gp2 does well for 80% of cases, but using sc1 and st1 is cheaper. In this case, the customer had opted for the latter. The application under scrutiny was an ML workload that processes images for training a self-driving car. It is a highly parallel workload that the customer wanted to scale to hundreds of thousands of machine instances on AWS.

Please see attached PDF to read more.