Custom hardware to accelerate machine learning inference births a dozen unicorns

Rosemary Francis_21150 · August 2022

Are the days of CPUs and GPUs numbered in the AI space as the availability of custom microchips for inference explodes? Who will win as the VHS vs. Betamax story replays in the machine learning hardware arena? Who will see a return from their investment in these billion-dollar unicorn startups?

As algorithms are gradually replaced with machine learning and smart technology becomes ubiquitous, there has been a seismic shift in the compute landscape. While machine learning workloads are not usually at the scale of the big scientific workloads of HPC or as cumbersome as the high-throughput workloads in the semiconductor industry, they require substantial silicon real estate. There is a huge demand for hardware to accelerate machine learning inference.

To capitalize on the demand, startups building new hardware solutions are blossoming from the usual places: some spinning out of the R&D centers of Silicon Valley and others emerging from the tech-rich environments in Cambridge and elsewhere in the UK. They are attracted to the recent boom in global investment in machine learning hardware, spurring on many unicorn startups and multi-billion-dollar acquisitions before anything has really hit the market at any kind of scale.

Before we dig into the who’s who of this growing space, let’s take moment to remind ourselves what the hardware is for. All the new, enthusiastic entrants to the market focus on either training or inference, which are critical to deep learning. Some address both. In simple terms, training is the process of building the model and inference is the process of applying it to unseen datasets to infer an answer.

All about training

Training is performed on pre-annotated data. Usually, one set is used to train a model and another used to test it, with a third unseen set used to rank the model for precision. The training phase differs from the way humans learn in that the network is applied to the data without any context and the answer is simply right or wrong. It is a powerful way to build complex decision-making tools, but it does require specialized knowledge to ensure that the data presented is going to result in the model required.

For example, in the US an algorithm for granting credit cards was found to be biased against women, granting credit cards only to their husbands who had equal credit scores. Women are often lower earners and the algorithm had been trained that being female was sufficient reason to deny the credit card application, irrespective of income. Of course, humans can and do make the same kinds of mistakes, but in this case the bank took the service offline for retraining.

As well as being aware that machines make mistakes, it’s essential to factor context into a machine’s response. For example, around the world we often say “great start” in response to a request for feedback. What this implies can differ materially. In the US a great start is just that: a great start. In the UK it is a polite and supportive way of letting someone know that they still have a lot of work ahead of them. Similarly, a “bold move” in the US is a form of praise, but in the UK is it how we let our friends know that they have had a lapse in common sense.

When training a model, it really matters where the data comes from. A dataset that is lacking geographic tagging will never be able to train a model to correctly identify the meaning behind those phrases, any more than I knew what my German colleague meant when she said I had made a great start on a document for her. Did she mean a “great start” or a “great start”?

While we build machine learning into our everyday lives, from cancer research to public transport, it is important to continually evaluate, improve, and iterate on our models. All of this means more compute. This is certainly what some new entrants to the world of machine learning hardware are betting on.

Custom hardware for inference

Inference is the process of applying the model to the real world, or at least to some real data. AI-centric chips built for this purpose, designed for the data center or for use at the edge, are arriving in a variety of flavors. Until recently, GPUs have been the hardware to beat, but they are quickly being left behind by emerging, more specialized architectures.

Inference is usually performed with 8-bit integers, meaning integers less than 256. This means that small, parallel, highly tuned architectures can be used at enormous scales. Single machines are emerging with close to a million cores. This is at odds with GPU workloads, which are usually floating point and so are performing calculations with almost arbitrarily large numbers.

Between training and inference most models are optimized to remove decision paths that are no longer used and to “fuse” together layers of the decision tree that can be done in a single step. This tuning phase is important for inference models that are going to run on the edge. It is common to use a rack of GPUs to train a model, but the model needs to be optimized if the inference step is going to take place on edge devices. For example, the speech recognition software that comes with your mobile phone has been trained in the data center, but the model generated needs to be lightweight it if it to be practical for use on a phone.

When designing custom hardware for this market it is important to identify where in the market the device will sit. Is it for training, inference, or both? Will it be sitting in the data center, in the cloud, on a mobile device, or in the emerging “near edge” space of fairly powerful compute that is deployed in places like retail or factory floors?

Who’s who in the disruptive space

Cerebras is a unicorn startup that addresses compute, memory, and interconnect in a single, whole wafer. Inevitably they are based in Sunnyvale, California, but are working with TSMC to build their frisbee-sized device. It targets both training and inference, and with a device that large they can have no ambitions to move out of the data center.

Graphcore is a UK startup spinning up in Bristol with more conventionally sized devices. Affectionally known as Silicon South West, the M4 motorway travelling from London to Bristol has taken us to various innovative compute architectures over the years such as XMOS, Inmos, and Picochip, and so we have great expectations of this latest offering. Their units are called IPUs or Intelligent Processing Units, with the IPU-M2000 packing one petaflop of computing power into a unit the size of a laptop.

Groq is a Google spin-out competing with Graphcore with another inference engine that they call their Tensor Streaming Processor. They are focusing on the real-time market. With more than $300 million raised, they have the potential to disrupt the space.

Intel Nervana NNP attacked both training and inference following Intel’s $400 million acquisition of Nervana in 2016. The success was short-lived before Intel spent $2 billion on Habana Labs, replacing the former machine learning program with the newcomer. The Intel Habana Gaudi chips are now available on AWS via the snappily named DL1.24xlarge instances, so go ahead and give them a go.

Samanova made the headlines last year, coming out of stealth mode to raise $676 million in financing, a Series D that values the company at $5.1 billion. Their DataScale technology focuses on a subscription service that aims to deliver turnkey machine learning to enterprises short on data scientists.

Not to be left out of the party, NVIDIA considered the acquisition of Arm in Cambridge. While Arm has previously dominated in the embedded space, their neural processing unit targets 8-bit and 16-bit integer-quantized convolutional neural networks. Now that acquisition didn’t work out we can ask ourselves what is next for these giants in the AI world. Now Arm is certainly levelling up for an IPO, and Nvidia show no sign of slowing production of their AI-centric GPUs so they look to have a stake in this space irrespective of what happens next.

Salience, an Oxford startup, takes a different approach with photonics instead of silicon. It takes me back to my days at Intel when we were pumping data into optical switching networks with FPGAs. In those days the kit filled the room, but Salience can fit their photonics on top of the silicon device for ultra-high throughput. Their photonic tensor processing unit allows data to be modulated at up to 100 GHZ, far beyond the capabilities of silicon, so while photonics devices are much larger, they make up for that in throughput.

These are just a few of the companies we’re watching at the moment and we have no doubt we will hear of more in 2022. The need for hardware is showing no signs of slowing yet.

About the author

Dr Rosemary Francis founded Ellexus in 2010, which was acquired by Altair in 2020. Rosemary obtained her PhD in Computer Architecture from the University of Cambridge and after working in the semiconductor industry founded Ellexus, the I/O profiling company. Rosemary is now Chief Scientist for HPC at Altair, responsible for the future roadmap of workload managers Altair PBS Professional and Altair Grid Engine. She continues to be the product manager of the I/O profiling products and is shaping our analytics and reporting solutions across our HPC portfolio. Outside of Altair, Rosemary is a member of the Raspberry Pi Foundation, an educational charity that promotes access to technology education and digital making. Rosemary has two small children, is a keen gardener and windsurfer.

Custom hardware to accelerate machine learning inference births a dozen unicorns

All about training

Custom hardware for inference

Who’s who in the disruptive space

About the author

Categories