Another "Sun Strategy" of NVIDIA Corporation (NVDA.US)

date
13:39 19/10/2025
avatar
GMT Eight
In the AI era, computing power is no longer in the chips, but in the connections.
Over the past twenty years, the performance improvement of data centers has mainly relied on the continuous evolution of computing chips - CPUs, GPUs, FPGAs. However, with the advent of the generative AI era, the entire computing system has begun to be redefined by networking. In large-scale model training, the communication latency and bandwidth bottleneck between GPUs have become key constraints for training efficiency. Especially when model parameters exceed trillions, a single GPU can no longer handle the task, and training must be completed through parallel coordination of thousands or tens of thousands of GPUs. In this process, the importance of networking has become increasingly prominent. Recently, a major news in the industry is that Meta/Oracle, two tech giants, have chosen NVIDIA Spectrum-X Ethernet switches and related technologies. This move is seen by the industry as an important step towards Ethernet moving towards AI-specific interconnection. It also reflects that NVIDIA Corporation (NVDA.US) is accelerating its penetration into the open Ethernet ecosystem, binding cloud giants and enterprise customers. NVIDIA Corporation has already controlled the closed high-end network with InfiniBand and is now setting up a second firewall in the "open" Ethernet ecosystem. Spectrum-X, Ethernet for AI For the past few decades, Ethernet has been the most widely used network in data centers. But in the AI-centric era, the core challenge of AI is not in the computing power of individual nodes, but in the collaborative efficiency of distributed architectures. Training a basic model (such as GPT, BERT, DALL-E) requires synchronizing massive gradient parameters across nodes. The speed of the entire training process depends on the slowest node - this is the root cause of the "tail latency" problem. Therefore, the design goal of AI networks is not "average performance," but to ensure that even in extreme cases, they do not lag behind. This presents far greater requirements for network latency, packet loss rate, traffic scheduling, congestion control, and even cache architecture than traditional Ethernet. Therefore, NVIDIA Corporation has introduced Spectrum-X, the first Ethernet solution optimized for AI. So, what specific improvements has Spectrum-X made? In NVIDIA's latest white paper "Networking for the Era of AI: The Network Defines the Data Center," NVIDIA Corporation provides a detailed introduction. First, create lossless Ethernet. In traditional Ethernet, packet loss and retransmission are considered "acceptable costs." But in AI training, any packet loss can lead to GPU idleness, synchronization failure, or increased energy consumption. Spectrum-X achieves this through: RoCE (RDMA over Converged Ethernet) technology for bypass communication with CPUs; PFC (Priority Flow Control) + DDP (Direct Data Placement) to ensure end-to-end lossless transmission; and in conjunction with Spectrum-X SuperNIC, it implements hardware-level congestion detection and dynamic traffic scheduling. This allows Ethernet to have close to InfiniBand transmission determinism for the first time. Second, adaptive routing and packet scheduling. The biggest difference between AI workloads and traditional cloud computing is that AI generates a small but very large amount of "Elephant Flows." These flows are prone to creating hotspots in the network, leading to severe congestion. Spectrum-X uses packet-level adaptive routing and packet spraying techniques to dynamically select the best path by monitoring link loads in real time and complete disorderly reordering at the SuperNIC layer. This mechanism breaks the limitations of Ethernet static hash routing (ECMP), allowing AI clusters to maintain linear scalability even in uneven traffic scenarios. Third, addressing congestion control issues. The biggest problem with traditional ECN congestion control is the high response latency. When a switch detects congestion and issues an ECN mark, the buffer is often already full, and the GPU is idle. Spectrum-X utilizes hardware-based In-band Telemetry to report network status in real-time, allowing the SuperNIC to immediately execute Flow Metering for sub-microsecond feedback loops. NVIDIA Corporation claims that its technology has demonstrated record efficiency, achieving 95% data throughput with its congestion control technology, while typical large-scale Ethernet throughput is around 60%. Fourth, performance isolation and security. AI clouds often need to run training tasks from different users or departments on the same infrastructure. Spectrum-X ensures fair access to different ports by using a shared cache architecture (Universal Shared Buffer) to prevent "noisy neighbor" tasks from affecting others. Paired with BlueField-3 DPU, it provides MACsec/IPsec encryption (data in transit security), AES-XTS 256/512 encryption (data at rest security), and Root-of-Trust and Secure Boot (hardware security boot). This allows AI clouds to have security isolation capabilities similar to private clusters. In conclusion, Spectrum-X gives Ethernet an "AI gene." Therefore, it has won the favor of Meta and Oracle, although the two companies have chosen different deployment strategies for Spectrum-X, optimizing around their own business demands. Meta's route is more focused on "open programmable network platforms" - combining the Spectrum series with FBOSS and implementing deployments on open-source switch designs like Minipack3N. This reflects Meta's continued investment in software-defined networking (SDN) and programmable control plane. For Meta, the goal is to support its generative AI services for billions of users with open standards that are both efficient and controllable. Oracle, on the other hand, uses Vera Rubin as an accelerator architecture, with Spectrum-X as the interconnect backbone, aiming to aggregate dispersed data centers and thousands of nodes into a unified programmable supercomputing platform, providing end-to-end training and inference services for enterprise customers. Oracle management refers to such deployments as "Giga-Scale AI factories," and sees them as a differentiating cornerstone in cloud competition. Regardless of the different routes taken, the common point between the two is clear: as computing power continues to grow exponentially, the networking layer determines whether this "theoretical computing power" can be transformed into "actual usable throughput and business value." The Impact of Spectrum-X? From the perspective of the industry competition landscape, the introduction of NVIDIA Spectrum-X is indeed a "dimensional attack" on the structure of the Ethernet network industry. Firstly, it is important to understand that Spectrum-X is not a standalone switch product, but rather a system strategy. It binds the following three components into a "soft-hard integrated" ecosystem: Spectrum-X switch ASIC (enabling lossless Ethernet and adaptive routing) ; Spectrum-X SuperNIC (responsible for packet reordering, congestion control, and telemetry feedback) ; BlueField-3 DPU (providing security isolation and RoCE optimization). In other words, NVIDIA combines the three-layer network ecosystem (switches, network cards, accelerators) traditionally belonging to independent vendors into a single system, making the network an extension of the GPU and achieving a vertical loop of Compute-Network-Storage. Therefore, this strategy has almost shaken the entire Ethernet ecosystem. This means that network companies that have survived based on Ethernet standards - whether chip sellers, switch sellers, or optimization software sellers - are forced to enter a new game: either integrate into NVIDIA's AI network system, or be marginalized. The enterprises directly affected are first and foremost data center Ethernet chip manufacturers, such as Broadcom (Trident/Tomahawk series) and Marvell (Teralynx, Prestera). Spectrum-X's RDMA over Ethernet capability fundamentally challenges the value of all high-end Ethernet chips. These vendors have long monopolized the "switch chip + NIC" dual ecosystem. Their selling points have been "open + cost-effective." However, when NVIDIA embeds AI optimization features such as DDP, Telemetry, and Lossless Routing into the GPU/DPU coordination system, Spectrum-X actually opens up the "black box" of Ethernet in relation to computing power, to a certain extent, affecting these vendors. Another group that may be impacted are traditional network equipment suppliers, such as Cisco, Arista Networks, and Juniper Networks, who have always represented the "Ethernet standard" in ultra-large-scale cloud data centers. Their high-end products mainly provide 400/800 GbE support, a rich set of programmable features, and software-defined network (SDN) management capabilities. However, under the Spectrum-X architecture, NVIDIA Corporation forms a closed but ultimate performance chain, consisting of "GPU + SuperNIC + Switch + DPU." Clients no longer need to rely on traditional optimization solutions from Cisco/Arista, especially in environments like AI factories that require "single-tenant + extreme performance." NVIDIA Corporation can gradually replace their role. Arista's market value already comes largely from AI network expectations, but if Spectrum-X is fully adopted by big clients like Meta, Oracle, AWS, Arista's growth model may be weakened. The third group is startup chip companies focusing on interconnects. Companies such as Astera Labs, Cornelis Networks, Liqid, and Rockport Networks, Lightmatter, and Celestial AI are developing custom interconnect solutions with low latency, high topology scalability. First, let's briefly analyze the significance of these companies' existence. In NVIDIA Corporation's world, interconnects are vertically integrated: GPU NVLink Spectrum-X/InfiniBand BlueField. However, for other vendors (AMD, Intel, Google TPU), they do not have the ability to control the entire stack, so they urgently need these "neutral interconnect suppliers" to provide alternative solutions. For example: Astera Labs' Leo/Cosmos series controllers have been used in AMD MI300 and Intel Gaudi platforms to manage interconnects between GPUs and memory pools. Cornelis Networks cooperates with European supercomputing centers to launch the Omni-Path 200G network, aiming to replace InfiniBand; Liqid's Composable Fabric solution is integrated by Dell Technologies, HPE for "AI Infrastructure as a Service (AI IaaS)". Lightmatter and Celestial AI have set their sights on a more distant future - when optical interconnects replace electrical interconnects, the entire architecture of AI computing clusters will be rewritten. Once large cloud vendors choose the Spectrum-X architecture, their entire cluster relies on NVIDIA for drive, telemetry, and QoS control, causing a dependency on NVIDIA. It becomes challenging for these independent innovators with open Fabric to be compatible. In the short term, the integration speed of Spectrum-X and deep binding with clients indeed reduces the market space for these independent innovators. InfiniBand Reigns Supreme in High-Performance Computing If Spectrum-X is Ethernet becoming AI-centric, then NVIDIA Corporation's Quantum InfiniBand is a super network that is native to AI. From the beginning, Ethernet has pursued openness and universality - tolerating a certain amount of packet loss and latency in exchange for cost and compatibility. In contrast, InfiniBand's design philosophy is the opposite: it pursues ultimate determinism and zero-loss transmission. It emerged as the data interconnect standard in the HPC (High-Performance Computing) field as early as 1999 and has since become the de facto standard in global supercomputing centers. With three main characteristics, InfiniBand has maintained its performance peak for over twenty years: Lossless Networking: Ensuring no data is lost during training; Ultra-Low Latency: Communication latency is in microseconds, significantly lower than traditional Ethernet; Native RDMA and In-Network Computing: Executing computation aggregation in the network layer, freeing up host loads. These capabilities make InfiniBand the "backbone" for communication in the AI training era, especially in architectures with thousands of GPU nodes, where it still maintains linear scalability and stable synchronization performance. After NVIDIA's acquisition of Mellanox for nearly $7 billion in 2019, it gained control of the full-stack ecosystem of InfiniBand. The latest Quantum-2 is NVIDIA Corporation's seventh generation product of the InfiniBand architecture and is seen as the most representative high-performance network platform. It provides up to 400 Gb/s of bandwidth for each port, twice that of the previous generation, and the port density of its switch chips has been increased threefold, allowing for connections to over a million nodes within a three-hop Dragonfly+ topology. Most importantly, Quantum-2 introduces the third-generation NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology - a mechanism for embedding computing capabilities within the network, making the network itself a "coprocessor." Under this architecture, the acceleration of AI model training has increased by 32 times compared to the previous generation, supporting multiple tenants and parallel applications to share the same infrastructure without sacrificing performance, truly realizing the resource pooling of "network-level virtualization." However, behind the brilliance of InfiniBand, there are also structural challenges. On one hand, it is led by NVIDIA and maintains a strong eco-closedness - this "vertical integration" architecture brings performance advantages but also raises concerns from cloud service providers and OEMs: high costs, limited ecosystems, limited compatibility, and limited bargaining power. This is why NVIDIA chose to introduce Spectrum-X, proactively integrating its own advanced algorithms, telemetry, and congestion control mechanisms into the Ethernet standard system, in order to maintain the discourse power of the networking layer in the Ethernet ecosystem. Ultra Ethernet Consortium's leading members In conclusion From InfiniBand to Spectrum-X, NVIDIA Corporation is completing a seemingly open but deeper "monopoly restructuring." It is building a dual-track system between closed and open - one aimed at HPC and supercomputing (InfiniBand), and one aimed at cloud and enterprise AI (Spectrum-X). Finally, let's end with a quote from NVIDIA Corporation's white paper: "The network defines the data center." In the era of AI, the computing power is no longer between chips, but within the connections. This article is from the WeChat public account "Semiconductor Industry" and written by Du Qinqu. GMTEight Editor: Chen Qiuda.