TechRxiv

Big Ethernet Switches Help to Historical HPC Topology Challenge

Download (974.98 kB)
preprint
posted on 2023-12-04, 02:30 authored by Eduard VasilenkoEduard Vasilenko, Haibo Wang, Longfei Dai

HPC (high-performance computing) plays a more important role now because of AI training needs for big models. The latest HPC requirement is to scale up to fifty thousand processors (or servers). HPC is based on the direct memory exchange between processors that is typically called “interconnect”. Historically, it was a big scientific challenge to define a topology for interconnect due to the low scalability of switching elements. Initially, on-board switching ASICs supported below 10 directions/buses which creates a challenge to scale even to hundreds of processors. Different topologies invented at that time (like Torus) have many hops for the traffic that greatly increase cost and latency but decrease reliability. Moreover, many compromises resulted in the requirement for applications to be engineered in such a way as to distribute traffic unequally inside and between local groups. Later, Infiniband external switches greatly improved the scalability to dozens of interfaces permitting to creation of much more scalable topologies (like Dragonfly) with less number of hops. Infiniband considerably improved latency and cost. Yet, the scale of Inifiband still pushes for restrictions (like performance) that are far from optimal. The latest generation of Ethernet switches achieved the scale (hundreds of interfaces) that is enough to create topologies with minimal compromise or even without any compromise at all: the minimal number of hops that results in minimal cost and latency, maximum reliability, wire-speed or close to it (regulated at design time), no need for network load distribution at the application layer. The scale without any compromise could satisfy the biggest AI training requirements. The scale with minimal compromises could be 8 times high then needed now.

Funding

Huawei Technologies

History

Email Address of Submitting Author

vasilenko.eduard@huawei.com

ORCID of Submitting Author

0009-0008-5560-8108

Submitting Author's Institution

Huawei Technology

Submitting Author's Country

  • Russian Federation

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC