Home - Upscale AI https://upscaleai.com/ Wed, 21 Jan 2026 15:35:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://upscaleai.com/wp-content/uploads/2025/09/favicon.webp Home - Upscale AI https://upscaleai.com/ 32 32 From $100M Seed to Unicorn in Months Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company https://upscaleai.com/from-100m-seed-to-unicorn-in-months-upscale-ai-closes-oversubscribed-200m-series-a-to-build-the-first-pure-play-ai-networking-company/ Wed, 21 Jan 2026 11:57:22 +0000 https://upscaleai.com/?p=3601 From $100M Seed to Unicorn in Months Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company Santa Clara, Calif. – Jan. 21, 2026 – Upscale AI, Inc., a category-defining pure-play AI networking infrastructure company, today announced $200 million Series A financing led by Tiger Global, Premji Invest, and Xora Innovation with Read more From $100M Seed to Unicorn in Months Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company

The post From $100M Seed to Unicorn in Months Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company appeared first on Upscale AI.

]]>
From $100M Seed to Unicorn in Months

Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company

Santa Clara, Calif. – Jan. 21, 2026 – Upscale AI, Inc., a category-defining pure-play AI networking infrastructure company, today announced $200 million Series A financing led by Tiger Global, Premji Invest, and Xora Innovation with participation from Maverick Silicon, StepStone Group, Mayfield, Prosperity7 Ventures, Intel Capital, and Qualcomm Ventures. This new investment brings the total funding received to over $300 million.

The rapid investor backing reflects a growing industry consensus: networking is a critical bottleneck to scaling AI, and traditional network architectures designed to connect general purpose compute and storage are fundamentally unsuited for the AI era. The distinction is essential: traditional networking connects endpoints, whereas AI networking unifies the cluster. As specialized AI compute continues to scale, it is increasingly constrained by retrofitted or proprietary networking architectures. Legacy data center network solutions were designed for a pre-AI world, not for the massive, tightly synchronized scale-up required at rack scale.

Upscale AI was built from day one to solve this problem.

Upscale AI unifies GPUs, AI accelerators, memory, storage, and networking into a single, synchronized AI engine. As a core element of Upscale’s AI strategy, the purpose-built SkyHammer™scale-up solution enables the unified rack by collapsing the distance between accelerators, memory, and storage, transforming the entire stack into one cohesive, synchronized system. Upscale’s AI platform is built on and actively advances open standards and open source technologies, including ESUN, Ultra Accelerator Link (UAL), Ultra Ethernet (UEC), SONiC, and the Switch Abstraction Interface (SAI). The company is an active participant in the Ultra Accelerator Link Consortium, Ultra Ethernet Consortium, Open Compute Project, and SONiC Foundation.

With an additional $200M in new financing, Upscale AI will be delivering the first full-stack, turnkey platform spanning silicon, systems, and software built to interconnect the heterogeneous future of AGI. The company will use the funding to rapidly expand its engineering, sales, and operations teams as it moves into commercial deployment. Upscale AI is experiencing strong early traction for its solutions for hyperscalers and AI infrastructure operators pursuing scalable, open networking alternatives.

Upscale AI’s networking solutions are slated to ship this year. 

“This investment accelerates our mission to fundamentally re-architect networking for the AI era,” said Barun Kar, CEO of Upscale AI. “With a world-class team and strong customer pull, we have a once-in-a-generation opportunity to build the open AI networking platform the industry has been waiting for.”

“Upscale AI has built extraordinary momentum in an exceptionally short time,” said Rajiv Khemani, Executive Chairman of Upscale AI. “The market is demanding open, scalable AI networking solutions, and Upscale AI is uniquely positioned to help customers break through today’s networking constraints.”

Supporting Quotes

  • “AI is fundamentally reshaping infrastructure, and networking is at the center of that transformation,” said Rohit Iragavarapu, Partner at Tiger Global. “Upscale AI combines deep technical execution with strong industry relationships to build the next generation of AI networking.”
  • “As AI systems scale, interconnect efficiency has become a defining driver of performance and economics—not just raw compute. What stands out about Upscale AI is not only their purpose-built approach to scale-up networking silicon, but the depth of experience of a team that has successfully built and deployed this class of infrastructure before,” said Sandesh Patnam, Managing Partner at Premji Invest. “Their ability to deliver proprietary-grade performance while embracing open standards reflects both strong technical conviction and real-world operating insight. We are excited to partner with Rajiv, Barun, and team again as they build critical infrastructure for the next generation of AI systems.”
  • “AI workloads are outpacing the capabilities of traditional networks,” said Phil Inagaki, Managing Partner and Chief Investment Officer, Xora Innovation. “Upscale AI is rebuilding the entire stack from the ground up and leading the effort to democratize AI networking infrastructure at scale.”
  • “Since our seed investment, Upscale AI has executed with impressive speed and discipline,” said Andrew Homan, Managing Partner at Maverick Silicon. “Our continued support reflects strong conviction in both the market opportunity and the team’s ability to deliver.”
  • “From the outset, Upscale AI stood out as a startup well positioned to define its own path in the AI networking space,” said John Avirett, Partner at StepStone Group. “Upscale AI has already drawn significant customer interest, and we expect the team to translate that momentum into a strong roster of deployments with leading hyperscalers and AI infrastructure operators.”
  • “Every computing era has its chokepoint. In the AI era, it’s networking,” said Navin Chaddha, Managing Partner at Mayfield. “Barun, Rajiv, and their world-class team are reimagining the entire stack from first principles and building an open platform for networking that can support the next generation of AI infrastructure.”
  • “Upscale AI is perfectly aligned with Prosperity7’s vision of backing transformative companies that are pushing the edge of what’s possible,” said Abhishek Shukla, U.S. Managing Director at Prosperity7 Ventures. “We take a long-term approach to investing, and we see significant potential for Upscale AI to capture strong market demand in the AI networking space in the coming years.”
  • “Scalability is the top priority for large enterprises evaluating networking solutions, and Upscale AI has architected its entire portfolio for scalability in the AI era,” said Srini Ananth, Managing Director at Intel Capital. “By delivering an open, turnkey solution, Upscale AI is removing complexity from networking infrastructure and unlocking scaled AI adoption for enterprises.”
  • “As AI infrastructure scales, advanced networking is essential to meet the demand of AI workloads,” said Quinn Li, Senior Vice President, Qualcomm Technologies, Inc., and global head of Qualcomm Ventures. “As an early investor, we’re excited to support Upscale AI’s team as they push the boundaries of AI networking infrastructure.” 
  • “Collaboration is key for accelerating the future of AI infrastructure,” said Robert Hormuth, Corporate Vice President, Architecture and Strategy, Data Center Solutions Group, AMD. “By adopting UALink early, Upscale AI is helping to foster an ecosystem centered on openness, choice, and interoperability.”
  • “AI networking is on track to become a $100B annual market by the end of the decade,” said Alan Weckel, Co-Founder and Technology Analyst at 650 Group. “Upscale AI is entering at exactly the right moment with open, high-performance solutions built to scale.”
  • “The networking market has reached a clear inflection point with AI,” said Sameh Boujelbene, Vice President at Dell’Oro Group. “Upscale AI’s SkyHammer architecture makes them one of the most important companies to watch.”

About Upscale AI 

Upscale AI is a high-performance AI networking company accelerating AI democratization through open-standard, full-stack, turnkey solutions. Its portfolio of silicon, systems, and software is purpose-built for ultra-low-latency networking, enabling breakthrough performance and scalability across AI training, inference, generative AI, edge computing, and cloud-scale deployments.

For more information, visit https://upscaleai.com.

Media Contact: upscaleai@racepointglobal.com 

The post From $100M Seed to Unicorn in Months Upscale AI Closes Oversubscribed $200M Series A to Build the First Pure-Play AI Networking Company appeared first on Upscale AI.

]]>
Upscale AI Unveils SkyHammer™ Architecture https://upscaleai.com/upscale-ai-unveils-skyhammer-architecture/ Tue, 14 Oct 2025 14:43:57 +0000 https://upscaleai.com/?p=3456   Upscale AI Unveils SkyHammer™ Architecture            A Ground-Up Design for Scale-Up AI Networking at Massive Scale Upscale AI today offers a first glimpse of SkyHammer™, a clean-slate architecture purpose-built to overcome the fundamental limits of scale-up AI networking. With SkyHammer, Upscale AI reinvents what “scale” truly means in scale-up networking. Read more Upscale AI Unveils SkyHammer™ Architecture

The post Upscale AI Unveils SkyHammer™ Architecture appeared first on Upscale AI.

]]>
 

Upscale AI Unveils SkyHammer™ Architecture           

A Ground-Up Design for Scale-Up AI Networking at Massive Scale

Upscale AI today offers a first glimpse of SkyHammer™, a clean-slate architecture purpose-built to overcome the fundamental limits of scale-up AI networking. With SkyHammer, Upscale AI reinvents what “scale” truly means in scale-up networking. While traditional front-end switches, PCIe fabrics, and proprietary topologies strain under the demands of explosive AI growth, SkyHammer has been designed from day one to power the world’s largest AI clusters with deterministic latency, extreme bandwidth, and predictable performance at scale.


Why Must Scale-Up Start from the Ground-Up?

Scale-up interconnects fuse CPUs, GPUs, and memory into a unified compute domain where every device shares data through native load/store access at nanosecond latency.

The result: an entire AI cluster operates as one coherent unit.

Conventional data center networks, built for servers and storage, can’t sustain the collective, synchronized demands of trillion-parameter AI models. Scale-up fabrics deliver guaranteed bandwidth, deterministic latency, and true rack-scale performance, purpose-built for the next era of AI.

In summary, scale-up AI interconnects redefine connectivity itself, setting the boundaries of how far compute can scale.

Efficiency and scale can’t come from retrofits. The baggage of retrofitting multiple problems to a single solution only leads to unforeseen tradeoffs.

Open by design. Diverse by architecture. That’s what it takes to build AI at scale.

Traditional approaches that have tried to bridge the gap using front-end switches, PCIe fabrics, or proprietary topologies deliver sub-optimal results. A from-inception scale-up optimized architecture offers several compelling advantages:

  • AI-Native by Design: Purpose-built for the GPU era, engineered around collective AI workloads, not legacy server or storage constraints.
  • Open and Predictable: Built on open standards with transparent roadmaps and ecosystem collaboration, eliminating lock-in and enabling choice.
  • Truly Next-Generation: A clean-sheet architecture that moves beyond incremental upgrades, unlocking new levels of performance, efficiency, and scalability.
  • Operationally Resilient: Simplified by design for the AI era, delivering the stability, trust, and reliability that large-scale AI systems demand.

In short, AI demands more than “good enough” networking. It needs a ground-up fabric designed for the AI era.

How SkyHammer™ Redefines Ground Up Design

SkyHammer is not about stripping things away. It is about building only what truly matters. Every design choice was intentional, focused on delivering fast, predictable, and power efficient performance for AI workloads. Instead of inheriting features that add complexity and latency, SkyHammer includes only what is essential for large scale AI clusters to perform as one cohesive system. It has been architected from the ASIC up, designed holistically across the chip, system, and rack to ensure every layer works in harmony. Capabilities such as deterministic flow control, real time telemetry, adaptive load handling, and built in resiliency are part of the core, not bolted on later. The result is a clean, open, and purpose-built architecture that performs with precision at any scale.

This architectural discipline is what defines SkyHammer. It represents a shift from incremental thinking to intentional design, where performance, predictability, and openness are built in from the very start.

Enter SkyHammer™:

SkyHammer™ is the result of relentless engineering innovation and execution since Q3 2024, in collaboration with the world’s leading hyperscalers and GPU vendors. Every aspect of SkyHammer™ was reinvented, from how data flows through silicon, to how fabrics adapt under load, to how superclusters stay predictable under synchronized stress. At its core, it’s not about building silicon for scale-up, it’s about engineering it to scale effortlessly across the rack, systems, and beyond. The result: a platform that doesn’t just patch bottlenecks, it erases them.

Built for the Standards that Define the Future

The scale-up landscape continues to evolve rapidly, and SkyHammer was designed with adaptability at its core. The foundation of AI democratization lies in an open architecture, from silicon through software, that fosters interoperability and choice. From the outset, we recognized that interconnect standards would evolve in form and function, and that foresight is what defines the brilliance of SkyHammer.

SkyHammer is engineered to support multiple open standards and interconnect protocols, ensuring technology remains an enabler, not a constraint. It supports emerging standards such as ESUN, UEC, and UALink, as well as future innovations yet to take shape. With its flexible  architecture, SkyHammer adapts seamlessly to new definitions without redesign or compromise, ensuring interoperability across open and diverse environments while maintaining performance, empowering innovation without lock-in.

“With AI compute demands exploding, the industry urgently needs open, scalable, and cost-efficient solutions; SkyHammer is a step squarely in that direction,” said Robert Hormuth, Corporate Vice President, Architecture and Strategy, Data Center Solutions Group, AMD.

“AI infrastructure is entering a phase where traditional data center fabrics can no longer keep pace with the scale and synchronization AI demands,” said Alan Weckel, co-founder of the 650 Group​. “Upscale AI’s SkyHammer architecture represents a decisive step toward purpose-built, open, and predictable interconnects that are designed for AI from the ground up. This approach aligns closely with what we see as the next major shift in data center networking, one where openness, determinism, and scalability define the winners.”

“The AI era demands infrastructure that is open, efficient, truly scalable, and economically sustainable,” said Sameh Boujelbene, Vice President at Dell’Oro Group. “Upscale AI’s SkyHammer™ marks a pivotal move toward that vision, speeding up innovation and democratizing AI  through the stack, bridging a key gap in today’s high-performance AI landscape.”

Key Design Philosophies

  • Open Standards First: Interoperable with GPUs, XPUs, and accelerators across hyperscalers and vendors.
  • Adaptability at its Core: Evolves seamlessly with new workloads, models, and deployment styles.
  • From Rack to Supercluster: Topology flexibility without hidden tradeoffs.
  • Telemetry Built In: Deep observability and operational simplicity from day one, not bolted on later.
  • Predictability at Scale: Latency and bandwidth engineered to remain consistent even at extreme scale.

Setting the New Benchmark

SkyHammer is a breakthrough ground-up AI-native architecture built to make compute clusters behave like a single coherent machine.

  • Scale Up, Ground Up: Designed natively for the largest AI systems.
  • Removing Bottlenecks: Eliminates the limits that retrofits cannot.
  • Future Ready: Designed for the next decade of AI superclusters.

Products based on the SkyHammer™ architecture are planned for release in 2026. For more information on how to turbocharge your AI clusters, contact Upscale AI at info@upscaleai.com or visit upscaleai.com

 

The post Upscale AI Unveils SkyHammer™ Architecture appeared first on Upscale AI.

]]>
Upscale AI to Exhibit and Speak at OCP Global Summit https://upscaleai.com/upscale-ai-to-exhibit-and-speak-at-ocp-global-summit-2/ Tue, 14 Oct 2025 14:39:28 +0000 https://upscaleai.com/?p=3452 Upscale AI to Exhibit and Speak at OCP Global Summit Newly launched AI networking company to showcase its scale-up networking portfolio at OCP   WHO: Srihari Vegesna, VP of Architecture & Technology at Upscale AI   WHAT:  Upscale AI’s Srihari Vegesna will present the session “Performance Evaluation of Interconnect Technologies for AI Scale-Up Computing: UAL Read more Upscale AI to Exhibit and Speak at OCP Global Summit

The post Upscale AI to Exhibit and Speak at OCP Global Summit appeared first on Upscale AI.

]]>
Upscale AI to Exhibit and Speak at OCP Global Summit

Newly launched AI networking company to showcase its scale-up networking portfolio at OCP

 

WHO:

Srihari Vegesna, VP of Architecture & Technology at Upscale AI

 

WHAT

Upscale AI’s Srihari Vegesna will present the session “Performance Evaluation of Interconnect Technologies for AI Scale-Up Computing: UAL vs UALoE/SUE vs RoCE” at the upcoming OCP Global Summit. He will also be joined by Srinivas Gangam, Fellow at Upscale AI. Srihari and Srinivas will compare Ultra Accelerator Link (UAL) Ethernet-based (UALoE) Scale Up Ethernet (SUE) to Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) for AI-optimized data transfer. The session will evaluate transaction-level load balancing, DMA thread efficiency, and bandwidth utilization for memory-semantic xPU workloads.


Session Details:
Date: Wednesday, Oct. 15
Time: 10:25 – 10:40 a.m. PDT
Location: San Jose Convention Center (SJCC) Concourse Level 210ABEF
Conference Schedule: https://2025ocpglobal.fnvirtual.app/a/schedule/ 

 

Upscale AI will also be exhibiting at OCP booth number C27 where attendees can visit to learn more about Upscale AI’s scale-up networking solutions and talk directly with leaders from the company. 

In September 2025, Upscale AI officially launched with over $100 million in funding. The company is focused on developing a next-generation AI suite of networking solutions that delivers high-performance connectivity for specialized compute, accelerating AI democratization with open-standard and full-stack turnkey solutions. 

 

WHERE

SJCC booth No. C27

150 W San Carlos Street

San Jose, CA 95113 

 

WHEN

Tuesday, Oct. 14 – Thursday, Oct. 25, 2025 

 

For sales inquiries, please reach out to: info@upscaleai.com.

For press inquiries, please contact: upscaleai@racepointglobal.com

 

About Upscale AI 

Upscale AI is a high-performance AI networking company accelerating AI democratization with open-standard and full-stack turnkey solutions. Upscale AI’s suite of silicon, systems, and software are purpose-built for ultra-low latency networking, enabling breakthrough performance and scalability for AI training, inference, generative AI, edge computing, and cloud-scale deployments. To learn more about Upscale AI, please visit: https://upscaleai.com/.  

 

The post Upscale AI to Exhibit and Speak at OCP Global Summit appeared first on Upscale AI.

]]>
Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards https://upscaleai.com/press-release/ Mon, 15 Sep 2025 19:16:08 +0000 https://upscaleai.com/?p=3339 Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards Led by industry veterans from Palo Alto Networks, Marvell, Cisco, and Google, Upscale AI is introducing sustainable, ultra-low latency network infrastructure solutions purpose-built for AI, interconnecting compute, data, and users Palo Alto, Calif. – Sept. 17, 2025 Read more Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards

The post Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards appeared first on Upscale AI.

]]>
Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards

Led by industry veterans from Palo Alto Networks, Marvell, Cisco, and Google, Upscale AI is introducing sustainable, ultra-low latency network infrastructure solutions purpose-built for AI, interconnecting compute, data, and users

Palo Alto, Calif. – Sept. 17, 2025 – Upscale AI, Inc., a new high‑performance AI networking company today announced its launch with over $100 million in funding. The seed round was co-led by Mayfield and Maverick Silicon and included participation from StepStone Group, Celesta Capital, Xora, Qualcomm Ventures, Cota Capital, MVP Ventures, and Stanford University. Upscale AI was incubated by Auradine, a rapidly growing infrastructure player in the blockchain and AI compute space. Upscale AI is launching a next‑generation AI suite of networking solutions that delivers high‑performance connectivity for specialized compute, accelerating AI democratization with open‑standard and full‑stack turnkey solutions.

Upscale AI is pioneering open-standard networking technology to drive faster innovation along with greater choice and flexibility across AI infrastructure. Upscale AI’s portfolio is built on SONiC, Ultra Accelerator Link (UAL), Ultra Ethernet (UE), Switch Abstraction Interface (SAI), and other cutting-edge open source technologies and open standards. The Upscale AI team also actively participates in the Ultra Ethernet Consortium (UEC), Ultra Accelerator Link (UAL), Open Compute Project (OCP), and SONiC Foundation to advance AI networking innovation. Upscale AI enables true bring-your-own-compute flexibility while defining the future of scalable, interoperable AI networking.

AI’s exponential growth has outpaced general-purpose compute and traditional networks, opening up a $20+ billion AI networking market. Realizing AI’s potential requires a full-stack redesign: specialized xPU clusters, ultra-low latency interconnects, and secure and power-efficient infrastructure to connect massive heterogeneous compute deployments at unprecedented speed. Upscale AI aims to meet this demand by designing robust silicon, systems, and software for ultra-low latency networking, enabling breakthrough performance and scalability for AI training, inference, generative AI, edge computing, and cloud-scale deployments.

Upscale AI, founded by serial entrepreneurs Barun Kar (CEO) and Rajiv Khemani (Executive Chairman), boasts a world class founding team and over 100 influential technologists. Kar and Khemani bring a strong history of success, having co-founded and led ventures like Palo Alto Networks, Innovium, and Cavium (acquired by Marvell). The team’s extensive experience spans decades in silicon, systems, and software, with members hailing from leading infrastructure companies such as Marvell, Broadcom, Intel, Cisco, AWS, Microsoft, Google, Palo Alto Networks, and Juniper Networks.

“From enterprise networking to building cloud infrastructure, Upscale AI’s team has been at the center of every major computing shift over the past 20 years,” said Barun Kar, CEO of Upscale AI. “Now, our world-class team is applying its depth and breadth of expertise to the next frontier: open AI-native infrastructure. Our deep expertise enables us to architect cohesive AI infrastructure solutions, partnering with leading xPU vendors and AI innovators to meet today’s demands and power what’s next.”

“We incubated Upscale AI with a bold vision to reimagine networking for the AI era anchored in openness, performance, and scale,” said Rajiv Khemani, Executive Chairman of Upscale AI and CEO of Auradine. “As AI transforms every industry, next-generation infrastructure must keep pace. Upscale AI’s world-class team is building the foundational layer that will enable AI to reach its full potential.”

Upscale AI is rapidly developing full stack AI networking infrastructure. The product suite will focus on xPU maximization and connectivity, and will include:

  • AI Network Fabrics: Ultra-low latency and high bandwidth networking fabric to drive the best xPU performance and an unprecedented reduction in total cost of ownership (TCO) at the data center level.
  • Unified NOS Based on Open Standards SAI/SONiC: Hardened, production-ready SAI/SONiC with enhanced operationality allowing for scale, in-service network upgrades, and improved functionality to maximize uptime.
  • Disruptive AI Networking Rack Platforms: Standards-based networking and rack scale solutions enabling true freedom of choice for end-to-end multivendor networking.

“We’re witnessing a fundamental transformation in networking as traditional approaches cannot keep pace with AI’s exponential demands,” said Navin Chaddha, Managing Partner of Mayfield. “Upscale AI is reimagining the entire AI networking stack, starting with scale-up solutions based on open standards that break free from proprietary systems. We’re excited to continue our partnership with Barun and Rajiv for the third time, along with their world-class team, as they set out to build a category-defining networking company for the AI era.”

“The future of AI will not be built by any one company or in a closed ecosystem; it will require open standards and committed partners. Upscale AI is a welcome addition to the UA Link ecosystem, and their open standards-based AI Network Fabric will support customers’ design acceleration and bring AI-powered solutions to market faster,” said Robert Hormuth, Corporate Vice President, Architecture and Strategy, Data Center Solutions Group, AMD.

“Upscale AI’s open standards approach is shaping the critical networking layer for the GenAI era,” said Andrew Homan, Managing Partner at Maverick Silicon. “By unifying performance, scalability, and interoperability, Upscale AI is uniquely positioned to solve today’s AI networking bottlenecks while laying the groundwork for the next generation of distributed, large-scale AI systems.”

“Open, interoperable systems will become the new standard for scalable AI infrastructure,” said Alan Weckel, co-founder and Technology Analyst at 650 Group. “Upscale AI is driving the exact type of innovation we expect to see more of as the market moves beyond monolithic structures. Many industry-leading tools remain trapped within closed circuit vendor-locked systems, but Upscale AI is building a level playing field where innovation, performance, and healthy competition can thrive.”

For more information about Upscale AI, visit https://upscaleai.com.

About Upscale AI

Upscale AI is a high‑performance AI networking company accelerating AI democratization with open‑standard and full‑stack turnkey solutions. Upscale AI’s suite of silicon, systems, and software are purpose-built for ultra-low latency networking, enabling breakthrough performance and scalability for AI training, inference, generative AI, edge computing, and cloud-scale deployments. To learn more about Upscale AI, please visit: https://upscaleai.com/.

Media Contact: upscaleai@racepointglobal.com

The post Upscale AI Launches with Over $100 Million Seed Round to Democratize AI Network Infrastructure and Advance Open Standards appeared first on Upscale AI.

]]>
Rack is the Unit of AI https://upscaleai.com/rack-is-the-unit-of-ai-2/ Wed, 11 Jun 2025 17:07:14 +0000 https://auradine.com/?p=2999 AI models have reached unprecedented sizes, far beyond the capacity of any single chip or even a single server. Building systems to train and serve these multi-trillion-parameter models is no longer just about making faster chips – it’s about architecting an entire rack as one cohesive “mega-accelerator.” In fact, modern AI infrastructure treats a rack of tightly interconnected accelerators as the fundamental unit of compute, essentially “one big GPU” at the rack level

The post Rack is the Unit of AI appeared first on Upscale AI.

]]>
By

Rohit Mittal Head of AI Products and Technologies

Santhosh K Thodupunoori, Director of Software Engineering

AI models have reached unprecedented sizes, far beyond the capacity of any single chip or even a single server. Building systems to train and serve these multi-trillion-parameter models is no longer just about making faster chips – it’s about architecting an entire rack as one cohesive “mega-accelerator.” In fact, modern AI infrastructure treats a rack of tightly interconnected accelerators as the fundamental unit of compute, essentially “one big GPU” at the rack level [1].
This post examines why chip-level and node-level designs are insufficient for AI models and highlights the importance of rack-level networking for the next generation of AI deployments. Furthermore, we dive into examples of ecosystems and technologies required to make rack-level networking feasible.

Why Chip-Level Design Falls Short for next AI models

We are witnessing exponential growth in model parameter scale (driven by Mixture of Experts, MoE) and an escalating reasoning scale (long context, complex tasks like Tree-of-Thought), both of which stress memory capacity and bandwidth more than raw FLOPs.

For example, a Mixture-of-Experts (MoE) version of GPT-4 was estimated at 1.8 trillion parameters requiring ~1.8 TB of memory **when running the model at FP8 [2]. These models must be partitioned across many accelerators, making inter-chip communication a bottleneck. In addition, LLMs processing long context windows (1M+ tokens) dramatically increase the KV cache size (64 kB/token/layer). A 128k context can mean 8GB KV cache per sequence/layer, quickly exceeding single-GPU HBM. Activation memory during training is also a challenge, managed by checkpointing or offloading, and more efficiently accessed across fast rack interconnects. Advanced reasoning (ToT, speculative decoding) multiplies memory demands, stressing capacity and bandwidth.

At a trillion-parameter scale, MoE models, while very efficient compared to dense counterparts, are highly communication-intensive. Inputs dynamically activate different experts that reside on different chips, requiring frequent inter-device messaging. This becomes a major performance constraint if interconnects are slow or lack bandwidth.

The Node: A Temporary Solution

The natural next step was the node: stuffing 8 GPUs into a server, hoping tighter integration would offset memory and bandwidth constraints.

It helped—but only up to a point. One analysis showed a 23% drop in throughput when a model was scaled beyond an 8-GPU server, due to interconnect limitations [3]. An H100 HGX, NVLink allows all 8 GPUs to talk to each other at 900 GB/s (bi-di) while communicating outside of a node would mean they can only communicate at 100 GB/s (bi-di) – much slower. Ethernet and InfiniBand couldn’t deliver the sub-microsecond latencies and high-throughput collective ops these models demanded.

When even the Node isn’t enough

The solution wasn’t to scale out—it was to redefine the compute boundary again and expand the 8-GPU scale-up domain to include more accelerators. Enter the rack. Systems like NVIDIA’s NVL72 treat 72 GPUs not as separate devices but as one coherent, memory-semantic machine with 130 TB/s of internal NVLink bandwidth. Physical constraints and emerging high-speed interconnects naturally converge at the rack scale—the largest domain where accelerators can effectively behave as a single, coherent computational entity before the complexities of distributed systems begin to dominate.

Table 1: Rack vs. Node/Chassis: Detailed Analytical Comparison for AI Workloads (2025 Outlook)

Feature Node / Chassis (4-8 Accelerators) Rack (32-72+ Accelerators) Rationale for Rack Superiority
Latency Budget (Intra-Unit Communication) ~200 ns (On-board NVLink/PCIe) ~300ns (Switched NVLink/UALink) [17] Sub-µs latency across tens of GPUs in a rack is crucial for fine-grained parallelism in transformers (e.g., token dependencies, MoE routing), which cannot be sustained across multiple discrete nodes connected by slower networks.
Aggregate Bandwidth Density (Interconnect) Limited by direct GPU-GPU links or PCIe backplane (e.g., 8×1.8 TB/s NVLink 5 theoretical max, but all-to-all constrained) Extremely high via switched fabric (e.g., GB200 NVL72: 130 TB/s; UALink scalable) [18] Rack-level switches provide vastly superior bisection bandwidth for the all-to-all communication patterns common in large model training (e.g., MoE expert communication, large gradient exchanges), which would saturate inter-node links.
Memory Coherence & Capacity (Scope & Scale) CPU-GPU coherency (e.g., Grace Hopper 8); limited shared HBM pool. Rack-scale coherent/near-coherent memory domain target (UALink, NVLink Fusion ); 8-16+ TB aggregate HBM. [13, 18] Rack-scale fabrics with memory semantics allow a much larger pool of HBM to be treated as a unified address space, essential for holding massive model states (weights, KV cache, activations) that far exceed single-node capacity.
Power Envelope & Cooling Feasibility (at Scale) 6-12 kW; air-cooling often feasible but stressed. 90-140+ kW ; liquid cooling essential (direct-to-chip, CDUs ). [13] Racks integrate power and liquid cooling for extreme densities unachievable by aggregating air-cooled nodes, making them the only viable physical unit for concentrating such compute power.
Physical Deployability & Serviceability Individual server integration; complex cabling and power for many units. Pre-integrated, modular designs (e.g., MGX ); standardized unit of deployment. [21] Rack-level pre-integration and modularity drastically reduce deployment time and complexity compared to piecemeal node integration, improving TCO and operational efficiency.
Software/Scheduling Complexity (Large Models) Frameworks must manage inter-node communication explicitly if scaling beyond node. Target for “single system image” for frameworks; NUMA-like domain simplifies scheduling within rack. Rack HBDs provide a near-uniform, low-latency environment that simplifies distributed training software (e.g., FSDP, Megatron-LM ) and improves its efficiency.
Failure Domain & Resiliency Smaller individual failure domain. Larger failure domain, but with rack-level redundancy in power/cooling. While a rack is a larger fault domain, integrated management and redundancy at the rack level can be more robust and easier to orchestrate than managing failures across many independent nodes.
Economic Unit (Procurement, TCO) Component-level procurement. System-level procurement, financing, and depreciation. [20] AI infrastructure is financed and deployed at the rack level. TCO benefits from density, power/cooling efficiency, and deployment velocity of integrated racks.

NVIDIA NVL72: Rack-Scale AI via NVLink networking

NVIDIA’s NVL72 embodies this shift as it demonstrates a rack-level design by combining 72 Blackwell GPUs into a single high-performance rack interconnected via NVLink [4]. This architecture provides:

  • 130 TB/s aggregate NVLink bandwidth [4]
  • 900 GB/s per-GPU one-way bandwidth [4]
  • 30x performance on trillion-parameter models compared to InfiniBand-connected H100s and changing from FP8 to FP4 [4]

Analysis by Nvidia shows the improvement scale-up rack provides over raw chip performance gains. Note – exact performance improvement number may vary based on assumptions on workloads, baseline etc.

Sources:
NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1 [24, 25]
NVIDIA GB200 NVL72 performance [26, 27]

Each NVL72 rack consumes ~120 kW and delivers up to 1.4 exaFLOPS of AI throughput (FP4) [5]. The system is liquid-cooled, weighs ~3,000 lbs, and integrates 5,184 passive copper lanes for connectivity [6]. Google’s TPU v5p “cube” (a rack equivalent with 64 chips in a 3D Torus) demonstrates a similar rack-scale building block.

This approach of using scale up networking transforms the rack into a supercomputer, with ultra-low latency switching/routing and massive internal bandwidth enabling collective operations, remote memory access, and dynamic parallelism.

Scale-Up Networking Requirements

AI models demand the following networking properties:

  • Ultra-low latency: Accelerator to accelerator <200 ns per hop. Ethernet and standard networks are too slow [7].
  • High bandwidth: > 10TB/s per node. NVLink provides ~900 GB/s; PCIe Gen5 tops at ~32 GB/s [4][8].
  • RDMA and memory semantics: To treat remote memory as local with minimal overhead [9].
  • Efficient collectives: All-reduce, broadcast, and scatter-gather must be hardware-accelerated or topology-aware [10].
  • Scalability: Ability to connect 100s to 1000s of GPUs within one logical fabric [11].

To realize these requirements, there is an urgent need to invest in ecosystems and technologies. Below are examples of both ecosystems, along with the technologies required for the next generation of AI networking.

Ecosystem: UALink – Scale-Up Fabric for the Rack

The UALink Consortium recently introduced UALink: an open standard for scale-up networking [12].
Key Specs:

  • 200 Gbps per lane, scalable up to 800 Gbps bidirectional per port **with 4 lanes per port ** [13]
  • <1 μs round-trip latency for 64B messages [13]
  • Scales to 1,024 accelerators across 4 racks [13]
  • Ethernet-based PHY to reduce cost and leverage commodity components [13]

By standardizing a high-radix, low-latency memory-semantic fabric, UALink can democratize rack-scale AI systems.

One often-overlooked enabler of rack-scale coherence is memory semantics—the ability for accelerators to access peer memory with load/store/atomic operations. This is not just a hardware optimization; it’s an ecosystem shift.

As detailed in our previous blog [22], memory semantics reduce reliance on RDMA stacks and allow peer GPU memory to be accessed like local HBM. This flattens software complexity, increases efficiency for small transactions (<640B), and enables a new generation of AI frameworks to treat a rack as one logical memory domain.

Importantly, UALink institutionalizes memory semantics as an open standard. Previously, only proprietary stacks like NVLink or Google’s ICI supported these semantics. UALink opens the door for multi-vendor ecosystems to converge on memory-coherent fabrics, without locking into a single vendor.

In that sense, memory semantics aren’t just a protocol detail—they are the core abstraction around which the future rack-scale AI ecosystem is being built.

Technology: Rack-Level Networking

Rack-scale systems need advanced networking technologies:

  • Cabled backplanes, like in NVL72, allow dense copper interconnects with better signal integrity [6]
  • Active midplanes, like NVIDIA’s Kyber project, simplify assembly and cooling [15]
  • Next generation packaging and connectors such as NPC (near packaged copper), CPC (co-packaged copper) and CPO (co-packaged optics)

Investments are needed to ensure these new technologies can be manufactured in high volume with adequate quality and low cost for hyperscale deployment.

The second architectural shift is in the topology itself. As we shared before [23], switch-based interconnects are now the baseline technology for scaling beyond 8–16 GPUs.

Compared to direct mesh, switches[23]:

  • Scale linearly with node count (vs O(N²) link explosion)
  • Enable in-network collective offloads (e.g., AllReduce in switch hardware)
  • Simplify vPod creation and dynamic partitioning for multi-tenant inference
  • Improve cable management and signal integrity through tiered leaf-spine topologies

These aren’t just theoretical advantages—they are what make architectures like NVL72 or future UALink-based racks feasible. It’s the minimum requirement for treating a rack as one coherent compute fabric. Without switched interconnects, rack-scale AI collapses under its own complexity.

Conclusion: The Rack Is the New Compute Atom

For large AI models, the rack is the only unit large and fast enough to behave like a coherent compute substrate. Rack-level design enables memory-semantic interconnects, extreme bandwidth, and model-scale parallelism.

As models grow beyond 1 trillion parameters, optimizing within the boundaries of a rack becomes crucial for minimizing latency and maximizing throughput. In this world, it’s not just about chip design—it’s about how the chips and interconnects are co-engineered at rack scale. To enable this a vibrant ecosystem and investment into new technologies are crucial.

AI engineers should begin thinking in racks: designing model parallelism and dataflow to stay within high-speed rack fabrics, optimizing memory layouts for RDMA, and architecting clusters as groups of smart, modular racks.

The rack is the unit of AI.

Bibliography

  1. NVIDIA GTC 2024 Keynote – Jensen Huang. – https://www.nvidia.com/en-us/gtc/
  2. SemiAnalysis: The Trillion Parameter Push – GPT-4 MoE. – https://www.semianalysis.com/p/the-trillion-parameter-push-gpt-4
  3. MLPerf Training v3.1 Benchmarks. – https://mlcommons.org/en/inference-datacenter-31/
  4. NVIDIA DGX GB200 NVL72 Overview. – https://www.nvidia.com/en-us/data-center/dgx-nvl72/
  5. ServeTheHome: Hands-on with NVL72. – https://www.servethehome.com/nvidia-dgx-gb200-nvl72-hands-on/
  6. ServeTheHome: NVLink Backplane Design. – https://www.servethehome.com/nvidia-dgx-gb200-nvl72-hands-on/
  7. HPCwire: Latency Challenges in AI Networks. – https://www.hpcwire.com/2024/05/09/scale-up-ai-networks-nvlink-ualink-and-ethernet/
  8. PCIe Gen5 and Ethernet Performance Comparison. – https://www.anandtech.com/show/17410/pcie-gen5-bandwidth
  9. AMD Infinity Fabric Overview. – https://www.amd.com/en/technologies/infinity-fabric
  10. NCCL: NVIDIA Collective Communications Library. – https://developer.nvidia.com/nccl
  11. UALink Consortium Whitepaper. – https://www.ualink.org
  12. The Next Platform: UALink Challenges NVLink. – https://www.nextplatform.com/2024/04/29/amd-intel-and-their-friends-challenge-nvidia-with-ualink/
  13. UALink 1.0 Specification. – https://ualinkconsortium.org/wp-content/uploads/2025/04/UALink-1.0-Specification-PR_FINAL.pdf
  14. ServeTheHome: UALink vs Ethernet vs NVLink – Power and Cost. – https://www.servethehome.com/amd-intel-and-their-friends-ualink/
  15. NVIDIA Kyber Midplane System Preview. – https://www.servethehome.com/nvidia-rubin-kyber-platform-preview/
  16. Open Compute Project: Rubin Platform Contributions. – https://www.opencompute.org/projects/nvidia
  17. UALink has Nvidia’s NVLink in the crosshairs. – https://www.tomshardware.com/tech-industry/ualink-has-nvidias-nvlink-in-the-crosshairs-final-specs-support-up-to-1-024-gpus-with-200-gt-s-bandwidth
  18. NVIDIA GB200 NVL72 – AI server. – https://aiserver.eu/product/nvidia-gb200-nvl72/
  19. Enabling 1 MW IT racks and liquid cooling at OCP EMEA Summit | Google Cloud Blog, accessed June 2, 2025. – https://cloud.google.com/blog/topics/systems/enabling-1-mw-it-racks-and-liquid-cooling-at-ocp-emea-summit
  20. White Paper: Redesigning the Data Center for AI Workloads – Raritan. – https://www.raritan.com/landing/redesigning-data-center-for-ai-workloads-white-paper/thanks
  21. Building the Modular Foundation for AI Factories with NVIDIA MGX. – https://developer.nvidia.com/blog/building-the-modular-foundation-for-ai-factories-with-nvidia-mgx/
  22. Why scale up needs memory semantics – https://auradine.com/why-scale-up-needs-memory-semantics/
  23. Communications within a high bandwidth domain pod – https://auradine.com/communications-within-a-high-bandwidth-domain-pod-of-accelerators-gpus-mesh-vs-switched/
  24. B200 training performance – MLPerf 4.1 – https://developer.nvidia.com/blog/nvidia-blackwell-doubles-llm-training-performance-in-mlperf-training-v4-1
  25. Comparing NVIDIA Tensor Core GPUs – NVIDIA B200, B100, H200, H100, A100 – https://www.exxactcorp.com/blog/hpc/comparing-nvidia-tensor-core-gpus
  26. Nvidia GB200 NVL72 – https://www.nvidia.com/en-us/data-center/gb200-nvl72
  27. NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference – https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference

The post Rack is the Unit of AI appeared first on Upscale AI.

]]>
Why Scale-up Needs Memory Semantics? https://upscaleai.com/why-scale-up-needs-memory-semantics-2/ Tue, 11 Mar 2025 16:35:34 +0000 https://auradine.com/?p=2895 Scale-up and Memory Semantics The quest for building ever more powerful AI systems inevitably leads us to the challenge of scale-up networking. Efficiently networking GPUs and scaling them effectively is paramount to achieving high performance. In this blog, we’ll unravel the requirements of scale-up, demonstrate how memory semantics revolutionizes its impact, and examine various scale-up Read more Why Scale-up Needs Memory Semantics?

The post Why Scale-up Needs Memory Semantics? appeared first on Upscale AI.

]]>
Scale-up and Memory Semantics

The quest for building ever more powerful AI systems inevitably leads us to the challenge of scale-up networking. Efficiently networking GPUs and scaling them effectively is paramount to achieving high performance. In this blog, we’ll unravel the requirements of scale-up, demonstrate how memory semantics revolutionizes its impact, and examine various scale-up approaches. Scale-up is the way we connect GPUs within a few server racks or pods, and as established in [3], switched connections provide superior bandwidth compared to point-to-point mesh networks. A quick look into GPU instruction sets reveals it can be broadly classified into ALU operations, data movement operations, and conditionals. Memory semantics defines the basic data movement operations of Read and Write, and while it can encompass more complex operations, we’ll focus on the essentials. This keeps the communication layer lean and fast, allowing more sophisticated logic to be managed at the application level.

Key metrics for good scale-up

The ultimate vision for scale-up is to present a single, unified system of high-TFLOP compute and enormous memory. The reality is far more intricate, involving numerous discrete GPUs linked by a web of network switches, NICs, and both copper and fiber optic connections. Through intelligent data movement libraries, we strive to create a cohesive memory space across these distributed components. Despite the inherent parallelism of AI workloads, which reduces the immediate necessity for GPU coherency, having coherent memory offers a slight advantage in simplifying the software layer. Here are a few traits necessary for a good scale-up network architecture:

  • Low Latency

    GPUs shouldn’t see a prohibitive penalty of accessing peer HBM vs local HBM. If you look at the hierarchical memory sub-system in a GPU, the latency can be in the range of 20-350ns and in the range of 140-350ns if we miss the first level cache. Providing a scale-up solution within these latency bounds, the lower the better, is reasonable.

  • High Bandwidth

Just to lay the context, two of the prominent GPUs have the following:

  • MI300x has 5.3TB/s memory bandwidth while only 1TB/s peak IO bandwidth
  • GB200 has 8TB/s per GPU memory bandwidth while only 1.8TB/s per GPU NVLink bandwidth

While achieving parity between memory bandwidth and I/O bandwidth may be impractical, a distributed memory system benefits significantly from high-bandwidth access to peer memory. Consider this: with each model parameter at 32-bit precision requiring 4 bytes, a 1.8-trillion-parameter GPT-4 model needs 7.04 TB of GPU RAM for loading alone. Training such a model requires even more memory to store optimizer states, gradients, and activations. Since individual GPUs typically have only a few hundred gigabytes of dedicated memory, accessing peer GPU memory is essential. Point-to-point connections limit bandwidth, whereas a switched, fully connected backplane offers full bi-directional throughput between any GPUs, maximizing data access.

This comparison of various network architectures is well explained in this article [3].

  • Low Total Cost of Ownership (TCO)

As the number of components grows and signal paths lengthen, the cost in terms of energy, silicon, and wiring escalates. Therefore, to present a practical unified memory abstraction, the number of additional switches, retimers, and cables must be kept to a minimum. This approach yields substantial cost savings in components and significantly reduces the rack’s power budget. This is a practical, albeit less technical, metric that is essential for judging the merit of a scale-up solution.

Memory Semantics helps in scale-up

Memory semantics drastically reduce latency by enabling direct memory transactions without the overhead of packing/unpacking or setting up RDMA. This simplification leads to leaner networking switches and a more efficient software stack, further contributing to lower latency. Since local HBM access times are now comparable to peer memory accesses, these transactions flow seamlessly, creating the illusion of a single accelerator with unified shared memory. This contrasts sharply with using Ethernet packets, which introduce significant latency through packing-unpacking, the RDMA stack, header parsing, and congestion management. Efficient handling of small transactions is vital for optimizing AI model performance across distributed systems. Collectives within Tensor and Data Parallelism rely on the communication of small messages, typically between 32B and 640B. Likewise, Activation and Gradient exchanges in Pipeline Parallelism demand the rapid delivery of small, critical data. Consequently, irrespective of the parallelism method employed, the efficiency of small data communication has a substantial impact on overall performance. Echoing this sentiment, paper [6] states, “Even though DL models such as LLMs operate using massive message sizes, optimizations at smaller message sizes should be treated as equally important.” The table below shows the distribution of communication of different sizes during the AllGather phase of three different GPT-NeoX models run on AMD MI250 system connecting two 8 GPU Pods.

Datasize (bytes) 19M Model 1.3B Model 13B Model
       
<640 74.82% 59.91% 39.49%
1K-1M 23.77% 22.78% 14.19%
1-10M 1.41% 14.43% 18.94%
10-125M 0.00% 2.89% 27.38%

Table 1: Data Size Distribution in AllGather phase [6] As illustrated in Table 1, a significant portion, between 40-75%, of AllGather communications involves transactions smaller than 640B. Our analysis of Llama3b training reveals that approximately 15% of all AI workload communication comprises these small-size memory transactions. These transactions are particularly prevalent during the training’s embedding and collective phases, such as AllReduce. Similarly, in Inference, the critical synchronization between layers relies on small transactions, impacting Job Completion Time (JCT). Leveraging a low-latency link for these small transactions can substantially improve JCT for both inference and training. While software techniques exist to mitigate latency, they often add complexity, necessitate large local buffers, or result in degraded performance. With memory semantics, smaller transactions achieve line-rate transmission while ensuring both lossless and full-bandwidth utilization. This advantage extends to full-mesh traffic, where all transactions benefit from enhanced bandwidth regardless of size. Pairing memory semantics with full bi-directional bandwidth switches minimizes the performance hit of accessing peer memory. Additionally, in-network collectives become simpler to offload with memory semantics, freeing up compute resources. Simplifying switch operations by avoiding deep header parsing and excessive buffering leads to a more efficient scale-up, resulting in lower power consumption and reduced die area, which ultimately contributes to cost savings and improved TCO through denser racks. In essence, memory semantics provide the low latency, high bandwidth, and optimized TCO that are essential for an effective scale-up architecture.

Possible solutions for scale-up networking

Now that we have established how memory semantics aids effective scaleup solutions let’s explore the scaleup options available in the market- NVLink, UALink, and proprietary custom protocols. NVLink [2] is Nvidia’s proprietary scale-up solution that has helped Nvidia bring innovative solutions with a huge number of accelerators. NVL72 and NVL576 [4] would not have been possible without this key innovation that started in 2016. UALink [1] is a new open standard that enables other accelerator vendors to avail the benefits of memory semantics for optimized systems that can scale up to 1024 nodes. This was opened to the public in October 2024, and its 1.0 Spec release is due to be released in April 2025. Without a standardized solution like UALink, accelerator vendors resorted to creating their own scaling technologies, including Google ICI, AWS Neuronlink, and AMD InfinityFabric. These proprietary protocols, typically built upon PCIe or Ethernet, enabled interconnection but often catered to very specific infrastructure setups. While detailed information on these protocols is limited, it’s reasonable to assume their narrow scope limits broader applicability. For example, employing lossless Ethernet for AI, while beneficial for reliability, can result in inefficient bandwidth usage for small data packets and increased latency and power usage for larger ones. Moreover, PCIe’s physical layer technology trails Ethernet in terms of achievable bandwidth.

Conclusion

Scale-up is essential for creating a unified memory view across distributed accelerators, and memory semantics is the key to making this solution truly effective for AI workloads. While NVLink and UALink both present viable options, UALink stands out as the only open standard that supports memory semantics. Proprietary solutions, like Google’s ICI, may suffice for specific data centers but fall short for general AI workloads. The growing industry-wide support for UALink presents a critical opportunity for companies like Auradine to tackle the pressing scale-up challenges in AI infrastructure. References

By: Amit Srivastava, VP AI Silicon, Upscale AI.

The post Why Scale-up Needs Memory Semantics? appeared first on Upscale AI.

]]>
Communications within a High-Bandwidth Domain (Pod) of Accelerators (GPUs): Mesh vs switched https://upscaleai.com/communications-within-a-high-bandwidth-domain-pod-of-accelerators-gpus-mesh-vs-switched-2/ Fri, 21 Feb 2025 18:19:15 +0000 https://auradine.com/?p=2875 Introduction AI infrastructure is scaling at an incredibly fast pace in the cloud and the edge data centers for both training and inference. Large AI/ML models with hundreds of billions to several trillions of parameters need multiple Accelerators (GPUs) to train and run. Multiple Accelerators allow their training to be completed in weeks and months Read more Communications within a High-Bandwidth Domain (Pod) of Accelerators (GPUs): Mesh vs switched

The post Communications within a High-Bandwidth Domain (Pod) of Accelerators (GPUs): Mesh vs switched appeared first on Upscale AI.

]]>
Introduction

AI infrastructure is scaling at an incredibly fast pace in the cloud and the edge data centers for both training and inference. Large AI/ML models with hundreds of billions to several trillions of parameters need multiple Accelerators (GPUs) to train and run. Multiple Accelerators allow their training to be completed in weeks and months instead of years. Also, multiple Accelerators allow these models to respond to queries (inferencing) in less than a second instead of in tens of seconds or minutes. However, when the models are trained and run on multiple Accelerators, an extremely high-performance scale-up network is required to interconnect these Accelerators as they often need to exchange hundreds of megabytes of data, for example, to synchronize their partially trained parameters, to pass on partial computation results to the Accelerators to the next model layer, etc. According to this technical blog, “A single query to Llama 3.1 70B (8K input tokens and 256 output tokens) requires that up to 20 GB of TP synchronization data be transferred from each GPU. “ Transmitting 20GB on the wire at 200Gbps link speed takes 0.8s, and at 1.6Tbps, it takes 0.1s. This transmission time is by far the largest component of the total transfer time, which includes the propagation time, which is approximately 7ns for 2m of cable distance, and scale-up switch latency of approximately 100-200ns. Hence, the amount of network bandwidth available to each GPU is an important factor in reducing training or inference time.

Today, the Accelerators in a high-bandwidth scale-up domain are connected via a proprietary point-to-point mesh or switched topologies. Examples of mesh or partial mesh topologies, where the links are mostly based on PCIe specification, include Infinity link (AMD), Inter-Chip-Interconnect (ICI, Google), NVLink (Nvidia), and NeuronLink (AWS). For a switched topology, Nvidia’s NVLink switch is the only option today. While both these connectivity options are in use today, we believe that the industry will shift to a switch-based topology once standard-based multi-vendor scale-up switch solutions are available.

A switched topology connecting the Accelerators in a scale-up high-bandwidth domain (HBD) not only provides higher bandwidth, it also provides multiple other benefits compared to a full or partial mesh (e.g., toroid) topology with direct Accelerator-to-accelerator connections. These benefits are described below. There are some advantages in full or partial mesh topologies which are discussed towards the end of this blog.

A. Advantages of switched topology

A.1. Bandwidth Advantage

Let’s consider a scale-up domain with N Accelerators, where each Accelerator has N-1 times K GBps of egress bandwidth and N-1 times K GBps of ingress bandwidth. In a full mesh topology a given Accelerator (e.g., Accelerator 1) is connected to each of the other N-1 Accelerators at K GBps (see Fig.1). In a switched topology, all this (N-1) x K GBps of bandwidth to and from Accelerator 1 is connected to one or more switches (see Fig. 2). Hence, if Accelerator 1 wants to read memory attached to Accelerator 2, it can do so at K GBps in a full mesh topology and at (N-1) x K GBps in switched topology! In a scale-up domain of just eight Accelerators that is 7 times faster in a switched topology compared to a full-mesh topology (see Fig 2).

In 3-dimensional toroid topology, each Accelerator is typically connected to six immediate neighbors, two each in x, y, and z directions with point-to-point links. In such a topology also, Accelerator 1 can read memory attached to Accelerator 2 at 1/6th the bandwidth compared to a switched topology.

A.2. Pod Size Advantage

The number of Accelerator to Accelerator connections in a full mesh topology grows at the rate of the square of the number of Accelerators! For example, to grow a pod from 8 Accelerators to 16 will require 4 times the number of inter-Accelerator connections from 56 (8 x 7) to 240 (16×15) connections. With this rate of growth, it quickly becomes impractical to implement larger, fully-meshed Pods due to challenges in cable management and signal integrity. For example, inferencing in the 1.8T MOE GPT model requires connectivity between 72 Accelerators to fit the model in the combined memory capacity. This would be impractical with a full mesh of 72 Accelerators requiring O(N^2) links.

With a switched topology, the connectivity growth is linear. For example, in a Pod with N Accelerators, there are N connections, one from each Accelerator to a switch.

In a switched topology, the switch also acts as signal regenerators or “retimers” between the accelerators and hence provides a longer physical reach between them. This helps in building a larger Pod with cost and power efficient copper cables.

An even bigger Pod can be built using a second or Spine layer of switches that connects the first or the Leaf switches, as shown in Fig. 5.

The Leaf switches and their directly connected Accelerators can be housed in one rack connected via passive copper cables. The Spine switches are used to connect the leaf switches in different racks using active copper or fiber cables, depending on the Pod size.

A.3. Collective Operations Completion Time Advantage

AllReduce is the most commonly used collective operation in ML training jobs. It is typically executed in a logical Ring topology where each Accelerator passes on partially reduced data to its immediate neighbor in a clockwise or anticlockwise direction in the ring (see Fig. 3). In a direct full-mesh topology while Accelerator 1 transfers AllReduce data to Accelerator 2 only the connection from Accelerator 1 and Accelerator 2 is used. All connections to other Accelerators from Accelerator 1 are unused. Whereas in a switched topology, Accelerator 1 can use its full bandwidth to transfer data to Accelerator 2 without interfering with Accelerator 2 transferring its own partial AllReduce data to Accelerator 3, and so on. With N-1 times more bandwidth available from Accelerator 1 to 2 in a switched topology, the ring AllReduce operation will complete N-1 times faster compared to direct full mesh topology! For example, if each Accelerator has 15 high-speed links to 15 Accelerators, the switched fabric can effectively offer 15 times the bandwidth for ring reductions compared to a full-mesh network and 6 times more than a 3-D torus partial mesh. This substantial increase in bandwidth directly accelerates the collective communication step, enabling more efficient distributed training.

A.4. In Network Collective Advantage

Certain collectives, such as Broadcast and AllReduce, can be executed more efficiently if they are offloaded to the network. For example, when AllReduce is executed in the switch, each accelerator needs to transfer the data to be reduced once to the switch, and the switch transfers the reduced data to the accelerators. Whereas in ring AllReduce, each accelerator will need to send partial data multiple times to its ring neighbor. With a directly connected partial or full mesh topology in the network collective operation is simply not possible.

A.5. Flexible vPod Topology Advantage

AI/ML infrastructure providers rent Accelerators to multiple tenants on demand. In order to accommodate multiple tenants or workloads of different compute requirements, infrastructure providers need to carve out a subset of Accelerators in a virtual Pod or vPod for each tenant. These Accelerators need to have high bandwidth between them as they are working together on the same AI/ML workload. On the other hand, these Accelerators don’t communicate at all with the rest of the Accelerators. A switched topology providing any-to-any one-hop low-latency connectivity is much more flexible for creating vPods compared to a partial mesh network that involves a multi-hop higher latency connectivity pattern. Furthermore, such partial mesh topologies require that all the accelerators in a vPod be close to each other; otherwise, traffic between two accelerators of a tenant has to traverse through an Accelerator belonging to another tenant (see Fig. 4).

This not only creates an unacceptable security risk but also constrains vPod sizes, complicates the AllReduce topology, and requires the sharing of links between vPods. If AllReduce operations in two vPods happen to overlap in time, then both the AllReduce operations will take even longer to complete.

For inference jobs, the number of Accelerators needs to be dynamically increased or decreased depending on the inference request rate. A switched topology provides a much more granular and flexible autoscaling capability compared to a mesh topology.

In order to protect against the switch being a single point of failure, multiple switch planes are deployed. For example, Fig. 4 shows a switched topology with M switch planes. This allows quick reconfiguration of a Pod with a spare Accelerator to work around Accelerator or link failures.

B. Advantages of mesh topology

The biggest advantage of a mesh topology is cost since no switch is required. For a small Pod (e.g., 4-8 Accelerators), latency is another advantage in full mesh topology since no switch hop is required. Power consumption is harder to compare since, in mesh topologies, the data has to be forwarded via multiple Accelerators, and each of those steps consumes energy.

Summary

It is useful to remember that local computer networks started without switches. For example, once popular, Token Ring and CSMA/CD networks didn’t have any switches. In these networks, compute nodes were directly connected to their immediate peers or to a shared transmission medium via a hub. For example, in the early days, processors in Cray supercomputers were interconnected in a 3D Torus topology (T3D). Then came the Ethernet Switch in the 1990s. Once Ethernet switches became widely available, the industry quickly recognized the benefits and moved to switched topologies. Today, processors in Cray supercomputers are connected over Dragonfly-based switched topology called Slingshot, which scales up to 250K endpoints with a maximum of 4 switch hops. A smaller network diameter reduces average and maximum latency, variations in latency, power requirements, and cost.

An extremely high-performance scale-up network is essential to support rapidly increasing AI/ML model size, inferencing workloads, token generation rates, and frequent model tuning with real-time data. Once such switch-based solutions are available from multiple vendors to interconnect the Accelerators in a scale-up high-bandwidth domain, Auradine believes that industry will quickly adopt them over mesh topologies – for their superior performance, technical merits, ease of operation and manageability, and security compliance. Most importantly, a standard-based scale-up networking solution from multiple vendors, such as Auradine, will ensure vendor optionality and interoperability, which will drive down the cost.

By:  Subrata Banerjee, VP Software Engineering, AI Network, Upscale AI.
Created: Feb 17, 2025; Last updated: Feb 20, 2025

The post Communications within a High-Bandwidth Domain (Pod) of Accelerators (GPUs): Mesh vs switched appeared first on Upscale AI.

]]>
High-Performance Open Standards-Based Networking Fabric to Drive Growth for Generative AI Datacenters https://upscaleai.com/high-performance-open-standards-based-networking-fabric-to-drive-growth-for-generative-ai-datacenters-2/ Fri, 29 Nov 2024 14:34:30 +0000 https://auradine.com/?p=2608 Generative AI training and inference workloads are becoming increasingly complex, involving enormous datasets and requiring significant computational resources to generate, fine-tune, and deploy AI models. As major semiconductor companies (e.g., Nvidia, AMD, Intel) and hyperscalers (e.g., Google, Amazon, Microsoft) have been developing GPU and accelerator chips for these models, there is a critical need to Read more High-Performance Open Standards-Based Networking Fabric to Drive Growth for Generative AI Datacenters

The post High-Performance Open Standards-Based Networking Fabric to Drive Growth for Generative AI Datacenters appeared first on Upscale AI.

]]>
Generative AI training and inference workloads are becoming increasingly complex, involving enormous datasets and requiring significant computational resources to generate, fine-tune, and deploy AI models. As major semiconductor companies (e.g., Nvidia, AMD, Intel) and hyperscalers (e.g., Google, Amazon, Microsoft) have been developing GPU and accelerator chips for these models, there is a critical need to address the network connectivity across these for high performance. This requires the GPU’s to communicate with each other in ‘memory coherent pods’ and across racks in AI datacenters with ultra-low latency and high bandwidth requirements.

To meet this demand, two primary networking architectures are playing a pivotal role: Scale-Up and Scale-Out networking. Scale-Up involves the short-distance networking fabric connecting GPUs, CPUs, and memory together. Scale-Out addresses the network connectivity across multiple GPU pods or racks to handle increased load and computing power collectively. This blog outlines the Scale-Up and Scale-Out networking including the emergence of open, standards-based network fabric to power the future growth of generative AI workloads.


Understanding Generative AI Workloads

Generative AI using large language models (LLMs), transformers, and image synthesis tools have a substantial demand for computational and data throughput. These models often require extensive data parallelism, meaning they need to be trained across numerous GPUs, TPUs, or specialized AI accelerators. This parallelism results in heavy traffic within data center networks, putting considerable pressure on the infrastructure to deliver high bandwidth, low latency, and reliable interconnectivity.

Networking Challenges:

  1. High Bandwidth Requirements: Generative AI models require rapid data transfer between various nodes within the network. As the model sizes grow, bandwidth demands increase significantly, straining existing networking solutions.
  2. Low Latency Needs: Training generative models involves constant data transfer, meaning that even slight latency increases can drastically impact training times and computational efficiency.
  3. Scalability and Flexibility: AI workloads scale rapidly, requiring networking infrastructure to be scalable and flexible to accommodate new resources or nodes.
  4. Interoperability: The network should support a wide range of hardware components to avoid vendor lock-in, which can impede scalability and increase operational costs.

Scale-Up Networking for Generative AI

Scale-Up networking improves the capacity of a single, high-performance computing node by adding more resources, such as CPUs, GPUs, or memory, within a single cluster. For generative AI, this approach allows outfitting powerful servers with additional GPUs or accelerators to handle the increased computational demand.

Needs of Scale-Up Networking:

  • Ultra-low Latency: To concentrate computational power in a single node, Scale-Up networking must minimize latency (e.g., sub-500 nanoseconds) beyond what traditional ethernet-based networks based on packets offer versus continuous data streaming.
  • High performance data transfer: While maintaining low latency, need to address key areas that impact the quality of the network. This includes minimizing aspects negatively impacting performance such as network ‘jitter’ (inconsistency in the time delay between data packets being sent and received) and ‘tail latency’ (small percentage of network requests that take significantly longer to complete than the average response time).
  • Reduce costs by maximizing GPUs and AI accelerators in a pod: Currently the highest number of inter-connected GPUs in a pod ranges from 4-8 on the lower end and up to 72 on the higher end. Providing the ability to connect to 100’s of GPUs in a pod can reduce the GPU deployment costs while supporting larger LLM models.

Scale-Out Networking for Generative AI

Scale-out networking distributes workloads across multiple interconnected pods or racks in a Gen AI data center, providing a robust solution for scaling large generative AI models.

Needs of Scale-Out Networking:

  • High Bandwidth: ability for large data loads to be shared across GPU pods and racks with high throughput. This allows the AI datacenter to be highly scalable as it involves adding additional racks to meet growing AI demands for training and inference.
  • Load Balancing: Ability to distribute traffic evenly across multiple network nodes to prevent bottlenecks and ensure consistent performance.
  • Flexibility in Resource Allocation: Allow different GPU racks to be assigned specific tasks or model portions, allowing for dynamic resource allocation based on real-time demand.

The Role of an Open, Standards-Based Network Fabric

An essential requirement in building a robust and scalable network infrastructure for generative AI workloads is the establishment of an open, standards-based network fabric. This allows collaboration across multiple players benefitting the industry as a whole versus proprietary stacks for individual companies. Towards this end, there have been two key industry consortiums that have been established in 2024 – the Ultra Ethernet Consortium for scale-out networking and Ultra Accelerator Link (UALink) Consortium for scale-up networking (diagram below). These include major industry players playing a key role in advancing the AI technology infrastructure and ecosystem.

Benefits of an Open Standards-Based Network Fabric:

  1. Interoperability: Open standards allow organizations to integrate diverse hardware components, including GPUs, CPUs, TPUs, and various accelerators, across different vendors. This flexibility helps avoid vendor lock-in and facilitates seamless expansion or upgrades.
  2. Cost Reduction: Proprietary hardware solutions often carry premium pricing. Open, standards-based fabrics enable organizations to adopt more cost-effective hardware without sacrificing performance, leading to more competitive pricing for high-performance networking solutions.
  3. Enhanced Innovation: Open standards foster collaboration and innovation within the industry, allowing multiple companies and institutions to contribute to network technology advances. This ecosystem promotes the development of new, more efficient solutions for generative AI workloads, creating alternatives to proprietary solutions like NVIDIA’s networking products.
  4. Future-Proofing: Open standards tend to evolve more rapidly in response to technological advancements, as they benefit from a larger developer and research community. By adopting a standards-based network fabric, organizations ensure that their infrastructure can adapt to emerging AI and networking innovations.

Looking Forward: The Future of Scale-Up and Scale-Out Networking

As generative AI workloads evolve, so will the need for scalable, high-performance networking infrastructures. An open, standards-based network fabric will not only serve as a feasible alternative to proprietary solutions. By adopting Scale-Up and Scale-Out networking architectures, organizations can build an adaptable, resilient infrastructure capable of meeting the demands of modern AI applications.

The establishment of Scale-Up and Scale-Out networking industry consortiums offers an ideal solution for handling generative AI workloads, balancing the need for high performance with scalability. Building an open, standards-based network fabric will be essential for fostering a competitive environment, driving down costs, and fueling innovation. Through these advancements, the field of generative AI can become more accessible, encouraging a diverse array of companies and institutions to contribute to and benefit from the AI revolution. In a rapidly evolving AI landscape, an open and inclusive approach to network infrastructure is strategic and necessary for long-term growth and sustainability.

  • Sanjay Gupta, Chief Strategy Officer, Upscale AI.

The post High-Performance Open Standards-Based Networking Fabric to Drive Growth for Generative AI Datacenters appeared first on Upscale AI.

]]>