in ,

A Peek Into Graviton2: Amazon's Neoverse N1 Server Chip First Impressions, Hacker News

A Peek Into Graviton2: Amazon's Neoverse N1 Server Chip First Impressions, Hacker News
            

It’s been a year and a half since Amazon released their first-generation Graviton Arm-based processor core, publicly available in AWS EC2 as the so-called ‘A1’ instances. While the processor didn’t impress all too much in terms of its performance, it was a signal and first step of what’s to come over the next few years.

This year, Amazon is doubling down on its silicon efforts, having announced the new Graviton2 processor last December, and planning public availability on EC2 in the next few months. The latest generation implements Arm’s new Neoverse N1 CPU microarchitecture and mesh interconnect, a combined infrastructure oriented platform that we had detailed a little over a year ago . The platform is a massive jump over previous Arm-based server attempts, and Amazon is aiming for nothing less than a leading competitive position.

Amazon’s endeavors in designing a custom SoC for its cloud services started back in , when the company acquired Isarel-based Annapurna Labs. Annapurna had previously worked on networking-focused Arm SoCs, mostly used in products such as NAS devices . Under Amazon, the team had been tasked with creating a custom Arm server-grade chip, and the new Graviton2 is the first serious attempt at disrupting the space.

So, what is the Graviton2? It’s a – core monolithic server chip design, using Arm’s new Neoverse N1 cores (Microarchical derivatives of the mobile Cortex-A (cores ) as well as Arm’s CMN – mesh interconnect . It’s a pretty straightforward design that is essentially almost identical to Arm’s 76 – core reference N1 platform that the company had presented back a year ago. Amazon did diverge a little bit, for example the Graviton2’s CPU cores are clocked in at a lower bit 2.5GHz as well as including only (MB instead of) MB of L3 cache into the mesh interconnect. The system is backed by 8-channel DDR – memory controllers, and the SoC supports 80 PCIe4 lanes for I / O. It’s a relatively textbook design implementation of the N1 platform, manufactured on TSMC’s 7nm process node.

The Graviton2’s potential is of course enabled by the new N1 cores. We’ve already seen the Cortex-A perform fantastically in last year mobile SoCs, and the N1 microarchitecture is expected to bring even better performance and server-grade features, all whilst retaining the power efficiency that’s made Arm so successful in the mobile space. The N1 cores remain very lean and efficient, at a projected ~ 1.4mm² for a 1MB L2 cache implementation such as on the Graviton2, and sporting excellent power efficiency at around ~ 1W per core at the 2.5GHz frequency at which Amazon’s new chip arrives at .

Total power consumption of the SoC is something that Amazon was too willing to disclose in the context of our article – the company is still holding some aspects of the design close to its chest even though we were able to test the new chipset in the cloud. Given the chip’s more conservative clock rate, Arm’s projected figure of around (W for a) – core 2.6GHz implementation, and Ampere’s recent disclosure of their – core 3GHz N1 server chip coming in at 512 W, we estimate that the Graviton2 must come in around anywhere between W as a low estimate to around 256 W for a pessimistic projection.

Testing In The Cloud With EC2

Given that Amazon’s Graviton2 is a vertically integrated product specifically designed for Amazon’s needs, it makes sense that we test the new chipset in its intended environment (besides the fact that it’s not available in any other way!). For the last couple of weeks, we’ve had preview access for Amazon Web Services (AWS) Elastic Compute Cloud (EC2) new Graviton2 based “m6g” instances.

For readers unfamiliar with cloud computing, essentially this means we’ve been deploying virtual machines in Amazon’s datacentres, a service for which Amazon has become famous for and which now represents a major share of the company revenues, powering some of the biggest internet services on the market.

An important metric determining the capabilities of such instances is their type (essentially dictating what CPU architecture and microarchitecture powers the underlying hardware) and possible subtype; in Amazon’s case this refers to variations of platforms that are designed for specialized use-cases, such as having better compute capabilities or having higher memory capacity capabilities.

For today’s testing we had access to the “m6g” instances which are designed for memory-intensive workloads and fittingly come with a lot of DRAM capacity. The “6” in the nomenclature designates Amazon’s 6

th

generation hardware in EC2, with the Graviton2 currently being the only platform holding this designation.

Instance Throughput Is Defined in vCPUs

Beyond the instance type, the most important other metric that defined an instance’s capabilities is its vCPU count. “Virtual CPUs” essentially means your logical CPU cores that’s available to the virtual machine. Amazon offers instances ranging from 1 vCPU to up to , with the most common across the most popular platforms coming in sizes of 2, 4, 8, 25, 64, , , and 180.

The Graviton2 being a single-socket 80 -core platform without SMT means that the maximum available vCPU instance size is 80.

However, what this also means, is that we’re quite in a bit of an apples-and-oranges conundrum of a comparison when talking about platforms which do come with SMT. When talking about (vCPU instances) “ xlarge ”in EC2 lingo ), this means that for a Graviton2 instance we’re getting 80 physical cores, while for an AMD or Intel system, we’d be only getting 46 Physical cores with SMT. I’m sure there will be readers who will be considering such a comparison “unfair”, however it’s also the positioning that Amazon is out to make in terms of delivered throughput, and most importantly, the equivalent pricing between the different instance types.

Today’s Competition

Today’s article will focus around two main competitors to the Graviton2: AMD EPYC (Zen1) powered m5a instances, and Intel Xeon Platinum CL (Cascade Lake) powered m5n instances. At the moment of writing, these are the most powerful instances available from the two x 110 incumbents, and should provide the most interesting comparison data.

It’s to be noted that we would have loved to be able to include AMD EPYC2 Rome based (c5a / c5ad) instances in this comparison; Amazon had announced they had been working on such deployments last November , but alas the company was not willing to share with us preview access (One reason given was the Rome C-type instances weren’t a good comparison to the Graviton2’s M-type instance, although this really doesn’t make any technical sense). As these instances are getting closer to preview availability, we’ll be working on a separate article to add that important piece of the puzzle of the competitive landscape.

(Tested) xlarge EC2 Instances

m6g m5a m5n

CPU Platform

(Graviton2) EPYC Xeon Platinum

(CL) vCPUs

76 Cores Per Socket ()

( (instantiated) SMT

2-way 2-way CPU Sockets 1

1 2

Frequencies 2.5GHz

2.5-2.9GHz

2.9-3.2GHz

Architecture

(Arm v8.2) x – 76 AVX2

x – 76 AVX itecturearchitecture

(Neoverse N1 Zen) Cascade Lake

L1I Cache (KB

(KB

(KB) L1D Cache

(KB

(KB) (KB) L2 Cache

1MB KB

1MB L3 Cache

(MB shared) 8MB shared

per 4-core CCX

. (MB shared

per socket

Memory Channels 8x DDR4 – 8x DDR – (2x per NUMA-node)

6x DDR4 –
per socket
NUMA Nodes

(1) 4

2

DRAM

GB

TDP Estimated
– W?

W

(W) per socket

($ 2.) / hour

($ 2.) / hour ($ 3.) / hour

Comparing the Graviton2 m6g instances against the AMD m5a and Intel m5n instances, we’re seeing a few differences in the hardware capabilities that power the VMs. Again, the most notorious difference is the fact that the Graviton2 comes with physical core counts matching the deployed vCPU number, whilst the competition counts SMT logical cores as vCPUs as well.

Other aspects when talking about higher-vCPU count instances is the fact that you can receive a VM that spans across several sockets. AMD’s m5a. 32 xlarge here is still able to deploy the VM on a single socket thanks to the EPYC (‘s) cores, however Intel’s Xeon system here employs two sockets as currently there’s no deployed Intel hardware in EC2 which can offer the required vCPU count in a single socket.

Both the EPYC and the Xeon Platinum CL are parts which aren’t publicly available or even listed on either the company’s SKU list, so these are custom parts for the likes of Amazon for datacentre deployments.

The AMD part is a 46 – core Zen1 based single-socket solution (at least for the xlarge instances in our testing) clocking in at 2.5 GHz all-cores to up to 2.9GHz in lightly threaded scenarios. The peculiarity of this system is that it’s somewhat limited by AMD’s quad-chip MCM system which has four NUMA nodes (one per chip and 2-channel memory controller), a characteristic that’s been eliminated in the newer EPYC2 Zen2

based systems. We don’t have concrete confirmation on the data, but we suspect this is a W part based on the SKU number.

Intel’s Xeon Platinum (CL is based on the newer Cascade Lake generation CPU cores

. This particular part is also specific to Amazon, and consists of 32 enabled cores per socket. To reach the (xlarge) vCPU count, EC2 provides us a dual -socket system with 20 out of the cores instantiated on each socket. Again, we have no confirmation on the matter, but these parts should be rated at W per socket, or W total. We do have to remind ourselves that we’re only ever using 76% of the system’s cores in our instance, although we do have access to the full memory bandwidth and caches of the system.

The cache configuration in particular is interesting here as things differ quite a bit between platforms. The private caches of the actual CPUs themselves are relatively self-explanatory, and the Graviton2 here does provide the highest capacity of cache out of the trio, but is otherwise equal to the Xeon platform. If we were to divide the available cache on a per-thread basis, the Graviton2 leads the set at 1.5MB, ahead of the EPYC’s 1. (MB and the Xeon’s 1.) MB. The Graviton2 and Xeon systems have the distinct advantage that their last level caches are shared across the whole socket, while AMD’s L3 is shared only amongst 4-core CCX modules.

The NUMA discrepancies between the systems aren’t that important in parallel processing workloads with actual multiple processes, but it will have an impact on multi-threaded as well as single-threaded performance, and the Graviton2’s unified memory architecture will have an important advantage in a few scenarios.

Finally, there’s quite a difference in the pricing between the instances. At $ 2. per hour, the Graviton2 system edges out the AMD system in price, and is massively cheaper than the $ 3. 105 per hour cost of the Xeon based instance. Although when talking about pricing, we do have to remember that the actual value delivered will also wildly depend on the performance and throughput of the systems, which we’ll be covering in more detail later in the article.

We thank Amazon for providing us with preview access to the m6g Graviton2 instances. Aside from giving us access, Amazon nor any other of the mentioned companies have had influence in our testing methodology, and we paid for our EC2 instance testing time ourselves.

         CPU Chip Topologies                  (Read More

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Launch HN: Datree (YC W20) – Best practices and security policies on each commit, Hacker News

Abstract art with “pseudo-profound” BS titles seen as more meaningful, Ars Technica

Abstract art with “pseudo-profound” BS titles seen as more meaningful, Ars Technica