Simakov, Nikolay A.; White, Joseph P.; DeLeon, Robert L.; Ghadersohi, Amin; Furlani, Thomas R.; Jones, Matthew D.; Gallo, Steven M.; Patra, Abani K.
Application kernels: HPC resources performance monitoring and variance analysis

Application kernels are computationally lightweight benchmarks or applications run repeatedly on high performance computing (HPC) clusters in order to track the Quality of Service (QoS) provided to the users. They have been successful in detecting a variety of hardware and software issues, some severe, that have subsequently been corrected, resulting in improved system performance and throughput. In this work, the application kernels performance monitoring module of eXtreme Data Metrics on Demand (XDMoD) is described. Through the XDMoD framework, the application kernels have been run repetitively on the Texas Advanced Computing Center's Stampede and Lonestar4 clusters for a total of over 14,000 jobs. This provides a body of data on the HPC clusters operation that can be used to statistically analyze how the application performance, as measured by metrics such as execution time and communication bandwidth, is affected by the cluster's workload. We discuss metric distributions, carry out regression and correlation analyses, and use a PCA study to describe the variance and relate the variance to factors such as the spatial distribution of the application in the cluster. Ultimately, these types of analyses can be used to improve the application kernel mechanism, which in turn results in improved QoS of the HPC infrastructure that is delivered to the end users. Copyright (c) 2015 John Wiley & Sons, Ltd.


Find full text with Google Scholar.