TY - JOUR
T1 - Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture
AU - Michalakes, John
AU - Iacono, Michael J.
AU - Jessup, Elizabeth R.
N1 - Publisher Copyright:
© 2016 World Scientific Publishing Company.
PY - 2016/12/1
Y1 - 2016/12/1
N2 - Large numerical weather prediction (NWP) codes such as the Weather Research and Forecast (WRF) model and the NOAA Nonhydrostatic Multiscale Model (NMM-B) port easily to Intel's Many Integrated Core (MIC) architecture. But for NWP to significantly realize MIC's one- to two-TFLOP/s peak computational power, we must expose and exploit thread and fine-grained (vector) parallelism while overcoming memory system bottlenecks that starve oating-point performance. We report on our work to improve the Rapid Radiative Transfer Model (RRTMG), responsible for 10-20 percent of total NMM-B run time. We isolated a standalone RRTMG benchmark code and workload from NMM-B and then analyzed performance using hardware performance counters and scaling studies. We restructured the code to improve vectorization, thread parallelism, locality, and thread contention. The restructured code ran three times faster than the original on MIC and, also importantly, 1.3x faster than the original on the host Xeon Sandybridge.
AB - Large numerical weather prediction (NWP) codes such as the Weather Research and Forecast (WRF) model and the NOAA Nonhydrostatic Multiscale Model (NMM-B) port easily to Intel's Many Integrated Core (MIC) architecture. But for NWP to significantly realize MIC's one- to two-TFLOP/s peak computational power, we must expose and exploit thread and fine-grained (vector) parallelism while overcoming memory system bottlenecks that starve oating-point performance. We report on our work to improve the Rapid Radiative Transfer Model (RRTMG), responsible for 10-20 percent of total NMM-B run time. We isolated a standalone RRTMG benchmark code and workload from NMM-B and then analyzed performance using hardware performance counters and scaling studies. We restructured the code to improve vectorization, thread parallelism, locality, and thread contention. The restructured code ran three times faster than the original on MIC and, also importantly, 1.3x faster than the original on the host Xeon Sandybridge.
UR - https://www.scopus.com/pages/publications/85006910993
U2 - 10.1142/S0129626416500195
DO - 10.1142/S0129626416500195
M3 - Article
AN - SCOPUS:85006910993
SN - 0129-6264
VL - 26
JO - Parallel Processing Letters
JF - Parallel Processing Letters
IS - 4
M1 - 1650019
ER -