A measurement study of congestion in an InfiniBand network

Fatma Alali, Fabrice Mizero, Malathi Veeraraghavan, John M. Dennis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

This paper presents a measurement study of congestion on a production, highly utilized, 72K-core InfiniBand cluster called Yellowstone. The measurement study consists of a 23-day data collection phase in which port counters of the Yellowstone switches were read multiple times every hour to check for stalls during which the port is unable to send data due to a lack of flow-control credits. A total of 30M data records were obtained and analyzed. Results showed that a significant number of the 100-ms intervals over which a port counter was observed, there were transmission stalls. For example, out of 6M observations of Top-of-Rack (ToR) switch uplink ports, we found that the port was forced to wait for credits in 60% of these 100-ms intervals. Such transmission stalls could increase application execution time, and also decrease cluster utilization. The latter will occur when Message Passing Interface (MPI) Barrier calls are issued for synchronization and communication delays cause one or more MPI ranks to be slower than others.

Original languageEnglish
Title of host publicationTMA 2017 - Proceedings of the 1st Network Traffic Measurement and Analysis Conference
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9783901882951
DOIs
StatePublished - Aug 4 2017
Event1st Network Traffic Measurement and Analysis Conference, TMA 2017 - Dublin, Ireland
Duration: Jun 21 2017Jun 23 2017

Publication series

NameTMA 2017 - Proceedings of the 1st Network Traffic Measurement and Analysis Conference

Conference

Conference1st Network Traffic Measurement and Analysis Conference, TMA 2017
Country/TerritoryIreland
CityDublin
Period06/21/1706/23/17

Keywords

  • congestion
  • Fat-tree
  • InfiniBand

Fingerprint

Dive into the research topics of 'A measurement study of congestion in an InfiniBand network'. Together they form a unique fingerprint.

Cite this