Pangeo Benchmarking Analysis: Object Storage vs. POSIX File System

Haiying Xu, Kevin Paul, Anderson Banihirwe

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Pangeo is a community of scientists and software developers collaborating to enable Big Data Geoscience analysis interactively in the public cloud and on high-performance computing (HPC) systems. At the core of the Pangeo software stack is (1) Xarray, which adds labels to metadata such as dimensions, coordinates and attributes for raw array-oriented data, (2) Dask, which provides parallel computation and out-of-core memory capabilities, and (3) Jupyter Lab which offers the web-based interactive environment to the Pangeo platform. Geoscientists now have a strong candidate software stack to analyze large datasets, and they are very curious about performance differences between the Zarr and NetCDF4 data formats on both traditional file storage systems and object storage. We have written a benchmarking suite for the Pangeo stack that can measure scalability and performance information of both input/output (I/O) throughput and computation. We will describe how we performed these benchmarks, analyzed our results, and we will discuss the pros and cons of the Pangeo software stack in terms of I/O scalability on both cloud and HPC storage systems.

Original languageEnglish
Title of host publicationProceedings of PDSW 2020
Subtitle of host publicationIEEE/ACM 5th International Parallel Data Systems Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages40-45
Number of pages6
ISBN (Electronic)9781665415941
DOIs
StatePublished - Nov 2020
Event5th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2020 - Virtual, Atlanta, United States
Duration: Nov 12 2020 → …

Publication series

NameProceedings of PDSW 2020: IEEE/ACM 5th International Parallel Data Systems Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference5th IEEE/ACM International Parallel Data Systems Workshop, PDSW 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/12/20 → …

Keywords

  • HPC
  • I/O
  • Pangeo
  • benchmark
  • cloud
  • object store
  • throughput

Fingerprint

Dive into the research topics of 'Pangeo Benchmarking Analysis: Object Storage vs. POSIX File System'. Together they form a unique fingerprint.

Cite this