Exploring the Ultimate GPU Interconnect Technology: The Disappearing Memory Wall

Estimated read time 17 min read

The article ” The Achilles Heel of AI Computing Power: The Memory Wall ” pointed out that in the past 20 years, the peak hardware computing power has increased by 90,000 times, but the DRAM/hardware interconnect bandwidth has only increased by 30 times. Under this trend, data transmission within or between chips will quickly become a bottleneck in training large-scale AI models.

Last month, “larger GPUs” were released at the NVIDIA GTC 2024 conference: the new generation of Blackwell architecture B200 and GB200 GPUs. The B200 uses TSMC’s 4nm process with a transistor count of up to 208 billion, while the GB200 integrates a Grace CPU and 2 B200 GPUs.

Currently, leading AI chip manufacturers are pushing the limits of existing chip design and manufacturing technology, but the question is, what to do when these “tricks” run out?

Among the cutting-edge chip startups targeting AI loads, we mentioned SambaNova, Tenstorrent and Ascenium in previous articles . The core issues they want to solve are placement and routing.

Founded in 2021, Eliyan focuses on chipset interconnection technology. They have made architectural innovations in the physical layer (PHY) and launched NuLink technology, which claims to be able to achieve ultra-large system-level packaging on standard packaging technology. , thus improving the performance of artificial intelligence loads by 10 times by eliminating the computing power memory wall . In March this year, Eliyan received $60 million in Series B financing.

While people look to the floating-point arithmetic and integer processing architectures of various compute engines, we begin to focus on memory hierarchies and interconnect hierarchies. This is because the calculation itself is simple, but data handling and memory management are becoming increasingly difficult.

To put it simply, some numbers illustrate this: Over the past two decades, CPU and GPU computing power has increased 90,000 times, but DRAM memory bandwidth and interconnect bandwidth have only increased 30 times. While we’ve made progress in some areas in recent years, the balance between compute and memory remains seriously imbalanced. This means that for many AI and HPC workloads, we are overinvesting in compute engines that are out of memory.

With this in mind, we began to pay attention to Eliyan’s architectural innovations in the physical layer (PHY). At MemCon 2024, they demonstrated these innovations in new and practical ways. Ramin Farjadrad, co-founder and CEO of Eliyan, took some time to show us how NuLink PHYs and their use cases have evolved over time, and how they can be used to build better than the current silicon interposer-based Packaging technology better, cheaper, more powerful computing engines.

A PHY is a physical network transport device that connects switch chips, network interfaces, or any other type of interface inside a compute engine to a physical medium (copper wire, fiber optics, radio signals), allowing them to connect to each other or to a network .

The silicon interposer is a special circuit bridge used to connect HBM stacked DRAM memory to computing engines such as GPUs and custom ASICs commonly used in HPC and AI fields. Sometimes, HBM is also used on ordinary CPUs that require high-bandwidth memory.

Eliyan was founded in 2021 and is headquartered in San Jose. The company has grown to 60 people and has just completed its second round of financing of US$60 million. Memory manufacturer Samsung and Tiger Global Capital led the Series B financing. In November 2022, Eliyan completed a $40 million Series A round of financing, led by Tracker Capital Management, with Celesta Capital, Intel, Marvell and memory manufacturer Micron Technology also participating in the investment.

Farjadrad worked as a design engineer at Sun Microsystems and LSI Logic during the dot-com boom, later becoming lead engineer and co-founder of the Switch ASIC at Velio Communications (now part of LSI Logic) and co-founder and CTO at Aquantia , which makes Ethernet PHY chips for the automotive market. In September 2019, Marvell acquired Aquantia and appointed Farjadrad to be responsible for PHY chips in the networking and automotive fields. Marvell has become one of the largest PHY chip manufacturers, competing with Broadcom, Alphawave Semi, Nvidia, Intel, Synopsys, Cadence and now Eliyan to design key components of these systems.

Eliyan’s other co-founders include Syrus Ziai, head of engineering and operations, who previously served as vice president of engineering at Ikanos, Qualcomm, PsiQuantum, and Nuvia; and Patrick Soheili, head of business and corporate development, who previously led eSilicon’s product management and artificial intelligence strategy. eSilicon is famous for creating the ASIC chip inside Apple’s iPod music player and developing the 2.5D ASIC package and HBM memory controller. At the end of 2019, the company was acquired by Inphi for US$213 million, thus expanding its PHY capabilities ; And in April 2021, Marvell acquired Inphi for US$10 billion, completing its October 2020 acquisition closed loop.

PHY, I/O SerDes and retimers all contain business opportunities. SerDes is a special type of PHY used in switch ASICs to convert parallel data output by a device into serial data and transmit it over wires, fiber optics, or wireless signals. From a certain perspective, retimer is also a special PHY. As the bandwidth increases and the distance of copper wires shortens, the frequency of use of retimer will gradually increase.

Now that we understand PHY, let’s talk about 2.5D packaging technology.

As the growth of transistor density slows down under Moore’s Law, and with the advancement of each generation of process technology, the cost of transistors does not decrease, but increases. We are all aware that the existence of masks in the photolithography process of modern chip manufacturing processes Reticle limit. Using common extreme ultraviolet (EUV) water immersion lithography technology, the maximum size of transistors we can etch on silicon wafers is 26mm x 33mm.

However, perhaps many people are not aware of the size limitations of the silicon interposer, which limits the chiplets to each other on the organic substrate (similar to the motherboard underneath each compute engine socket and its attached HBM memory). The ability to link.

The size of the silicon interposer depends on the technology used to manufacture the interposer. Although the interposer is fabricated using the same photolithography process as the chip, unlike the chip’s 858 mm2 mask limit, some technologies today can expand the area of ​​the interposer to 2500 mm2, with other technologies closer to 1900 mm2. mm, and according to Farjadrad, they plan to expand it to 3,300 square millimeters. Organic substrate slots are not subject to such area restrictions, which is very important for 2.5D packaging of chipsets.

Farjadrad introduced the parameters, speed and limitations of Eliyan’s NuLink PHY’s competing 2.5D packaging method.

Here’s how TSMC does 2.5D packaging using its Chip on Wafer on Silicon (CoWoS) process, which is used to create products like Nvidia and AMD’s GPUs and their HBM stacks:

On a technical level, the picture above shows TSMC’s CoWoS-S interposer technology, which is used to connect GPUs, CPUs, and other accelerators to HBM memory. The silicon interposer size of the predecessor CoWoS-R was limited to about two mask cells, which happens to be the same size as Nvidia’s just-launched “Blackwell” B100 and B200 GPUs, but this GPU uses a more advanced and less eye-catching The CoWoS-L technology, whose fabrication is more complex, is similar to the embedded bridges used in other methods. CoWoS-L has a size limit of three mask units.

There is also a bridging technology called wafer level fan out with embedded bridge, which is promoted by chip packaging company Amkor Technology and also provided by ASE Holdings. A variant FOCoS-B was developed. The following are the parameters and speed of this encapsulation method:

Using this 2.5D packaging technology, it is possible to create a package with a size of approximately three mask units. High-density wiring means that we can obtain higher inter-chip bandwidth at low power consumption, but its connection range is limited and the routing capabilities of the wiring are also limited. In fact, this technology has not been truly promoted on a large scale.

Intel embeds silicon bridges directly into the organic substrate that holds the chip—without using an interposer—similar to how Eliyan uses NuLink:

However, EMIB (Embedded Multi-chip Interconnect Bridge) technology has a series of problems, including long production cycles, low yields, limited coverage and routing capabilities, and its supply chain is dominated by one company, Intel, which is the leader in today’s advanced semiconductors. The field has a poor reputation. To be fair, while Intel is gradually getting back on track, it’s not yet where it expected to be.

This makes the NuLink technology proposed by Eliyan an improved 2D multi-chip module (MCM) process:

A year ago, Farjadrad said the NuLink PHY’s data transfer rate was about 10 times that of PHYs used in traditional MCM packages. In addition, the trace length between NuLink PHY can reach 2cm to 3cm, which is 20 to 30 times longer than the 0.1mm trace length supported by CoWoS and other 2.5D packaging technologies. The increased trace lengths and the bidirectional signal transmission characteristics of NuLink PHY on these traces are significant for compute engine designs. There are currently faster PHYs available for competing devices on the market, reducing the NuLink PHY advantage to a 4x gap.

Farjadrad told The Next Platform that according to the current architecture, when transmitting data packets between memory and ASIC, the data packets are one-way and do not have the ability to transmit data in both directions at the same time, either reading data from the memory or writing data to the memory. . **But if there is a port that can transmit or receive data at the same time, you can get twice the bandwidth from the same I/O resource, and this is what NuLink achieves. **This way, we won’t waste half of the ASIC’s I/O resources.

Farjadrad further proposed that in order to maintain memory consistency, a special custom protocol needs to be used to ensure that there are no conflicts between reads and writes. When designing a PHY, relevant protocols need to be developed for specific applications, which is one of the important advantages of NuLink. Having the best PHY technology is only one thing. On the other hand, it is important to combine it with the right expertise for AI applications, and Eliyan knows how to do that.

In November 2022, when NuLink was first introduced, there was no such name, and Eliyan had not yet proposed a method of using PHY to create a universal memory interface (UMI). NuLink started out as a way to implement the UCI-Express chip-to-chip interconnect protocol and supports the original Bunch of Wires (BoW) that Farjadrad and his team created years ago and donated to the Open Compute Project as a proposed standard. ) Any protocol supported by the chip-to-chip interconnect. In the table below, Eliyan compares NuLink to various memory and chip-to-chip interconnect protocols:

This table is very clear.

Intel’s “MDFIO” is the abbreviation of “Multi-Die Fabric I/O (Multi-Die Fabric Interconnect Input and Output)” and is used to connect the four computing chipsets in the “Sapphire Rapids” Xeon SP processor. EMIB is used to connect these chipsets to the HBM memory stack, primarily for the Max series CPU variants of Sapphire Rapids with HBM.

OpenHBI is based on the JEDEC HBM3 electrical interconnect standard and is also an OCP (Open Compute Project) standard. UCI-Express is a new PCI-Express with CXL coherency overlay, designed to be the chip-to-chip interconnect between chipsets.

Nvidia’s NVLink is currently used to “glue” chipsets together on Blackwell GPU complex structures not listed in the table above; similarly, Intel’s XeLink is used to “glue” GPU chipsets together on “Ponte Vecchio” Max series GPUs “Sticked” together and not included in the table. Unlike UCI-Express, NuLink PHY is bidirectional, meaning that it can use the same number or more wires as UCI-Express but is able to achieve twice or more the bandwidth of UCI-Express.

As shown, there is a high-cost packaging option that uses a bump pitch of 40 microns to 50 microns, and the distance between the chips is only about 2mm. The bandwidth density of the PHY can be very high, reaching Tb/sec levels per millimeter of edge area on each chip. The power efficiency varies depending on the method, with latency below 4 nanoseconds in all cases.

The PHY on the right side of the diagram can be used with standard organic substrate packaging and uses 130 micron bumps, making it a lower cost option. These options include Cadence’s Ultralink PHY, AMD’s Infinity Fabric PHY, Alphawave Semi’s OIF Extra Short Reach (XSR) PHY, and a version of NuLink that can achieve high signal transmission without using low bump pitch.

On the right side of the diagram is the die spacing, and with a 2cm connection distance we can do a lot that is not possible with 2mm spacing and 0.1mm spacing between ASICs and HBM stacks or adjacent chips. These longer connectivity links expand the geometry of complex compute and memory structures and eliminate thermal crosstalk effects between ASICs and HBMs. Stack memory is very sensitive to heat, and as the GPU gets hotter, the HBM needs to be cooled to ensure it works properly. If you can move the HBM further away from the ASIC, you can run the ASIC faster (Farjadrad estimates the speed increase by about 20%) and run it at a hotter temperature because the memory is farther away and will not be directly affected by the increased heat of the ASIC. Influence.

Additionally, by removing the silicon interposer or its equivalent in GPU-like devices in favor of organic substrates, and using larger bumps and increased component spacing, dual ASIC devices with twelve HBM stacks could be The manufacturing cost has been reduced from about US$12,000 (the yield rate of chips plus packaging is about 50%) to about US$6,800, and a yield rate of 87% has been achieved.

Let us compare UCI-Express, BoW and UMI through two pictures, and then we can design the system architecture a little bit.

As shown in the figure, Eliyan has been continuously promoting the bidirectional transmission capabilities of its PHY, and now has the capability of simultaneous bidirectional data transmission, called UMI-SBD.

The following figure shows the bandwidth and ASIC chip edge area (beachfront) of these four options:

So, the NuLink PHY, now called UMI, is smaller and faster than UCI-Express, and can transmit and receive data at the same time. What can be done with it?

**First, larger computing engines can be built. **For example, a compute engine package with 24 or more HBM stacks and 10 to 12 chipsets. Thanks to the use of standard organic substrates, the device takes only 1/4 to 1/5 the time to manufacture.

1989 was IBM’s peak period. In the early 1990s, it began to decline. At that time, people often said: You can find better products than IBM at the same price.

Of course, Nvidia is not IBM or Intel, at least not yet. Regardless, making money too easily can have dire consequences for a company and its roadmap.

Here’s how Eliyan thinks HBM4 might develop in the future:

The JEDEC PHY of HBM4 memory is larger in size. If you use UCI-Express, its area can be reduced by half. On this basis, using NuLink UMI PHY can further reduce its area by half, leaving room for the logic circuits on the XPU you choose. More space. Alternatively, if you want to ditch the interposer and make a larger device and accept a 13mm2 UMI PHY, you can also make a lower cost device and still achieve 2TB/s per HBM4 stack Transmission rate.

Now comes the fun part.

When Eliyan presented his idea in November 2022, he compared a GPU with an interposer connected to its HBM memory to one that removes the interposer and doubles the ASIC (similar to what Blackwell did) and stacks 24 HBMs. devices placed on these ASIC chips. As shown below:

The left side of the picture above is the architectural diagram of Nvidia A100 and H100 GPUs and their HBM memory. In the middle is a chart from Nvidia showing how performance improves as more HBM memory capacity and greater HBM memory bandwidth are made available to AI applications. As we know, the H200 with 141GB HBM3E memory and 4.8TB/sec bandwidth is 1.6x to 1.9x more efficient than the H100 with the same GH100 GPU but only 80GB HBM3 memory and 3.35TB/sec bandwidth.

Assume you have a device like the one shown above, with 576GB of HBM3E memory and 19TB/sec of bandwidth. **Remember: the main power consumption is not from the memory, but from the GPU, and some of the evidence we have seen so far does suggest that the GPUs launched by Nvidia, AMD and Intel are limited by HBM memory capacity and bandwidth. The problem has been around for a long time because making this kind of stack memory is difficult. **These companies make GPUs, not memory, and they maximize revenue and profits by delivering as little HBM memory as possible while maintaining massive computing power. While they have always demonstrated advantages over previous generations, GPU computing has always grown faster than memory capacity and bandwidth. The design proposed by Eliyan could rebalance computing and memory to make these devices more affordable.

Perhaps this design was too radical for GPU manufacturers, so with the introduction of UMI, Eliyan took a step back and showed how to use interposers and organic substrates along with NuLink PHY to create a larger, more balanced Blackwell GPU complex system.

The left side of the image below is how to create a Blackwell-Blackwell Superchip with a single NVLink port that connects two dual-chip Blackwell GPUs at 1.8TB/sec.

With the NuLink UMI method shown on the right side of the image above, six ports are provided between the two Blackwell GPUs, with a transfer bandwidth of about 12 TB/sec, which is slightly higher than Nvidia’s B100 and B200 that use NVLink ports to connect two Blackwell GPUs 10TB/sec provided. This means that the bandwidth in the Eliyan superchip design is 6 times that of the Nvidia B200 superchip design. If Nvidia continues to use its CoWoS manufacturing process, Eliyan can place the same eight HBM3E memory chipsets as Nvidia in the interposer, and then can place another eight HBM3E memory chipsets on each Blackwell device, for a total of 32 HBM3E Memory bank with a capacity of 768GB and a bandwidth of 25 TB/sec.

Think about it again.

But that’s not all. This UMI approach works on any XPU and any type of memory, and you can try something similar on a bulky organic substrate without the need for an interposer.

Any type of memory, any co-packaged optoelectronics, any PCI-Express or other controller, can be connected to any XPU using NuLink. At this point, the socket has literally turned into the motherboard.

For larger, complex systems, Eliyan can build NuLink switches.

**[Language large model inference acceleration up to 11 times]**SiliconLLM is an efficient, easy-to-use, and scalable LLM inference acceleration engine developed by Silicon Flow. It aims to provide users with out-of-the-box inference acceleration capabilities, significantly Reduce the cost of large model deployment and accelerate the implementation of generative AI products. (For technical cooperation and communication, please add WeChat: SiliconFlow01)

SiliconLLM’s throughput is increased by up to nearly 4 times, and latency is reduced by up to nearly 4 times.

Data center + PCIe : Silicon LLM’s throughput is increased by up to nearly 5 times; consumer card scenario : Silicon LLM’s throughput is increased by up to nearly 3 times.

System Prompt scenario : SiliconLLM’s throughput is increased by up to 11 times; MoE model : Inference SiliconLLM’s throughput is increased by up to nearly 10 times

You May Also Like

More From Author

+ There are no comments

Add yours