TY - GEN
T1 - System-level monitoring of floating-point performance to improve effective system utilization
AU - Del Vento, Davide
AU - Engel, Thomas
AU - Ghosh, Siddhartha S.
AU - Hart, David L.
AU - Kelly, Rory
AU - Liu, Si
AU - Valent, Richard
PY - 2011
Y1 - 2011
N2 - NCAR's Bluefire supercomputer is instrumented with a set of low-overhead processes that continually monitor the floatingpoint counters of its 3,840 batch-compute cores. We extract performance numbers for each batch job by correlating the data from corresponding nodes. From experience and heuristics for good performance, we use this data, in part, to identify poorly performing jobs and then work with the users to improve their job's efficiency. Often, the solution involves simple steps such as spawning an adequate number of processes or threads, binding the processes or threads to cores, using large memory pages, or using adequate compiler optimization. These efforts typically result in performance improvements and a wall-clock runtime reduction of 10% to 20%. With more involved changes to codes and scripts, some users have obtained performance improvements of 40% to 90%. We discuss our instrumentation, some successful cases, and its general applicability to other systems.
AB - NCAR's Bluefire supercomputer is instrumented with a set of low-overhead processes that continually monitor the floatingpoint counters of its 3,840 batch-compute cores. We extract performance numbers for each batch job by correlating the data from corresponding nodes. From experience and heuristics for good performance, we use this data, in part, to identify poorly performing jobs and then work with the users to improve their job's efficiency. Often, the solution involves simple steps such as spawning an adequate number of processes or threads, binding the processes or threads to cores, using large memory pages, or using adequate compiler optimization. These efforts typically result in performance improvements and a wall-clock runtime reduction of 10% to 20%. With more involved changes to codes and scripts, some users have obtained performance improvements of 40% to 90%. We discuss our instrumentation, some successful cases, and its general applicability to other systems.
KW - Operational or end-user support
KW - Performance
UR - https://www.scopus.com/pages/publications/83055184893
U2 - 10.1145/2063348.2063355
DO - 10.1145/2063348.2063355
M3 - Conference contribution
AN - SCOPUS:83055184893
SN - 9781450311397
T3 - State of the Practice Reports, SC'11
BT - State of the Practice Reports, SC'11
T2 - State of the Practice Reports, SC'11
Y2 - 12 November 2011 through 18 November 2011
ER -