Compframe 2005

Atlanta, GA, June 22-23, 2005


Performance Modeling of Component Assemblies with TAU.
N. Trebon, A. Morris, J. Ray , S. Shende, and A. Malony
Abstract
The Common Component Architecture (CCA) is a component-based methodology for developing scientific simulation codes. This architecture consists of a framework which enables components, (embodiments of numerical algorithms and physical models) to work together. Components publish their interfaces and use interfaces published by others. Components publishing the same interface and with the same functionality (but perhaps implemented via a different algorithm or data structure) may be transparently substituted for each other in a code or a component assembly. Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime. Details regarding CCA can be found in [1], [2]. An analysis of the process of decomposing a legacy simulation code and re-synthesizing it as components can be found in [3], [4]. Actual scientific results obtained from this toolkit can be found in [5], [6].
Components exist so that they can be reused. Thus, the component writer is rarely the sole user of the components. A component writer can be expected to ensure the correctness of the component. The performance, however, is of primary importance to the component user who may not be familiar with the implementation to the component. Thus, in the context of HPC (high performance computing) with components, one needs a reliable way of measuring the performance of each implementation of a component nonintrusively and correlate the performance with the problem size to create a performance model. This was addressed in [7], where a prototypical performance measurement and modeling infrastructure (a set of performance related components based on the TAU performance system [8], [9] for monitoring and recording performance metrics of a scientific simulation) was demonstrated. In this paper we present how these performance models, may be used to obtain an approximately optimal component assembly i.e., an optimal code.
Once a performance model has been created for each of the components and their implementations, it is possible to construct a global performance model for the application. By comparing global performance models utilizing different subsets (implementations) of components, it is possible to select an optimal set of component implementations for a given problem. However, simply selecting the optimal implementation of each component does not guarantee an optimal global solution because the individual performance models do not consider the interactions between components (e.g., cost of translating one data structure to another if the interacting components have different data layouts). If the number of components and their implementations are small, one may adopt the bruteforce approach of enumerating all the possible realizations of the component assembly, constructing and evaluating their global performance models, and choosing the optimal one. However, typical scientific codes consist of assemblies of 10 to 20 components. If each component has three implementations, the solution space consists of 3 10 to 3 20 possible realizations, which makes the brute-force approach of comparing each realization unfeasible.
We propose a simple rule-based approach to reduce the solution space by eliminating insignificant components. A component is insignificant if it does not significantly contribute to the overall performance of the application. These components are eliminated from the global performance model, leaving behind the dominant sub-assemblies, or cores. Once the cores have been selected, an optimal assembly can be realized through comparing the reduced number of global performance models from these cores. The optimal component assembly (consisting entirely of the cores) may then be extended into a functioning component assembly by using (adding) any implementation of the insignificant components. This ensures that the final assembly will be close to optimal.
Runtime automated optimization of codes, as done by Autopilot [10], [11] and Active Harmony [12] are closest to our approach as outlined here and in [7]. Both require the application to identify performance parameters (and the valid values that they can take) to a tuning infrastructure. This infrastructure also monitors their performance, which in case of Active Harmony is provided by a function (the objective function) implemented by the application itself. Each of the parameters are perturbed and the application run to obtain the effect of the perturbation. Active Harmony relies upon a simplex algorithm to identify the optimal values of the parameters while Autopilot uses fuzzy logic. Both (Active Harmony and Autopilot) require the identification of bad regions in parameter space, so that the optimization search may be concluded faster by avoiding these regions. Active Harmony also has an infrastructure to swap in/out multiple libraries in order to identify an optimal implementation.
Neither of these approaches are quite right for us. We can afford many evaluations of the objective function since it is an algebraic formula synthesized out of multiple component performance models. However, no approach will stand up to exponential complexity. Our strategy has been to identify the core set of parameters (component implementations) that significantly affect performance and to perform a brute-force evaluation/optimization only on the core.


Access full paper (pdf file)