Building a research data science platform from industrial machines

Fang Cherry Liu, Fu Shen, Duen Horng Chau, Neil Bright, Mehmet Belgin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Data Science research has a long history in academia which spans from large-scale data management, to data mining and data analysis using technologies from database management systems (DBMS's). While traditional HPC offers tools on leveraging existing technologies with data processing needs, the large volume of data and the speed of data generation pose significant challenges. Using the Hadoop platform and tools built on top of it drew immense interest from academia after it gained success in industry. Georgia Institute of Technology received a donation of 200 compute nodes from Yahoo. Turning these industrial machines into a research Data Science Platform (DSP) poses unique challenges, such as: nontrivial hardware design decisions, configuration tool choices, node integration into existing HPC infrastructure, partitioning resource to meet different application needs, software stack choices, etc. We have 40 nodes up and running, 24 running as a Hadoop and Spark cluster, 12 running as a HBase and OpenTSDB cluster, the others running as service nodes. We successfully tested it against Spark Machine Learning algorithms using a 88GB image dataset, Spark DataFrame and GraphFrame with a Wikipedia dataset, and Hadoop MapReduce wordcount on a 300GB dataset. The OpenTSDB cluster is for real-time time series data ingestion and storage for sensor data. We are working on bringing up more nodes. We share our first-hand experience gained in our journey, which we believe will benefit and inspire other academic institutions.

Original languageEnglish
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsJames Joshi, George Karypis, Ling Liu, Xiaohua Tony Hu, Ronay Ak, Yinglong Xia, Weijia Xu, Aki-Hiro Sato, Sudarsan Rachuri, Lyle Ungar, Philip S. Yu, Rama Govindaraju, Toyotaro Suzumura
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2270-2275
Number of pages6
ISBN (Electronic)9781467390040
DOIs
StatePublished - 2016
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: Dec 5 2016Dec 8 2016

Publication series

NameProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

Conference

Conference4th IEEE International Conference on Big Data, Big Data 2016
Country/TerritoryUnited States
CityWashington
Period12/5/1612/8/16

Keywords

  • Data Science Platform
  • Hadoop
  • Machine Learning
  • Software and Hardware Configuration
  • Spark

Fingerprint

Dive into the research topics of 'Building a research data science platform from industrial machines'. Together they form a unique fingerprint.

Cite this