Hardware-Interface Tools Analysis, modeling, and simulation tools.
Performance analysis and modeling is critical for almost any aspect of the exascale co-design process. It allows us to both understand and anticipate computation and communication needs in applications, it identifies current and future performance bottlenecks, it helps drive the architectural design process and provides mechanisms to evaluate new features, and it aids in the optimization and the evaluation of transformations of applications. No single analysis technique or methodology alone is capable of providing the necessary insight into application code behavior that is required to drive the co-design process. Instead, we require a wide range of techniques combining analytical models, architectural simulation, system emulation and empirical measurements to create a holistic picture of the behavior of a target application (see related figure).
The ExMatEx project targets all four areas—using existing technology where available and developing new approaches, where necessary: we use empirical techniques to create baseline performance profiles for our proxy apps (as reported in year 1); we develop and apply the GREMLIN infrastructure to enable architectural implementation for power, performance, and resilience aspects (section ref gremlins); we deploy architectural simulation using SST to improve our understanding of the impact of the memory system (section ref sst); and we provide high-level and scalable models for our proxy applications using Aspen (section ref aspen). Combined, these techniques enable us to gain a deep insight into the performance of the applications relevant for co-design center, as well as to establish a comprehensive performance analysis infrastructure that provide the tools which can be used in any DOE co-design effort. We organized a deep-dive hackathon to expose the SST toolkit to the application developers and identified key aspects of the baseline programming environment SST needs to support, e.g. OpenMP.
These are the tools required for tradeoff analysis in the co-design loop, the results from which are used to re-express the application for the emerging exascale system.
GREMLIN Toolkit Emulating exascale conditions on petascale machines.
The GREMLIN framework provides a general framework to enable architectural emulation. It allows us to go beyond measurements on existing machines, while still executing full applications, but within a controlled environment in which we can expose characteristics we expect of future extreme scale platforms.
The GREMLIN framework provides this ability in a highly modular fashion. One or more modules, each responsible for emulating a particular aspect, are loaded into the execution environment of the target application. Using the PnMPI (reference) infrastructure we accomplish this transparently to the application for an arbitrary combination of GREMLINs, providing the illusion of a system that is modified in one or more different aspects. For example, this allows us to emulate a combined effect of power limitations and reduced memory bandwidth. We then measure the impact of the GREMLIN modifications by either utilizing application inherent performance metrics (e.g., iteration timings) or by adding performance monitoring tools within the PnMPItool stack.
During Year 2 (2013), we have matured the concepts of the GREMLIN framework and have released the base infrastructure as well as several fundamental GREMLINs. Currently, this includes three different classes of GREMLINs: power GREMLINs that artificially reduce the operating power using DVFS or cap power on a per node basis, memory GREMLINs that limit resources in the memory hierarchy such as cache size or memory bandwidth by running interference threads, and fault GREMLINs that inject faults into target applications and enable us to study the efficiency and correctness of recovery techniques. In the following we present some key results from all three areas. More detailed results arecontained in our recent publications (listed below).
Power Gremlins: Emulating Power in a Constrained World
Power is generally predicted to be one of the limiting resources on the road to exascale. Driven by both the cost to operate machines and the engineering efforts to supply machine rooms with power, future systems will be limited in the amount of power they can use. This will likely lead to over provisioned systems, on which we can no longer run each node at full power and on which we are faced with strict power caps at the node, rack and system level.
To emulate the effects that such power caps have on application performance, we implemented a power capping GREMLIN using Intel's RAPL (Runtime Average Power Limits) interface. RAPL enables us to set per processor power caps through a set of Machine Specific Registers (MSRs), which are then enforced by the processor hardware through different voltage and frequency levels. The GREMLIN itself is therefore rather simple: we simply set the appropriate limit during initialization and then continue the execution under RAPL's hardware control.
The figure to the right shows the results of executing the CoMD proxy app on 128 processors under three different per processor power caps using a Sandy Bridge Infiniband DDR-3 cluster with 2 sockets/16 cores per node. Ignoring individual outliers, the graph shows that, as expected, lower power bounds lead to lower power consumption, while prolonging the execution of the application. However, we also see a second trend: while execution time is fairly uniform at the highest power bound of 80W, lower bounds lead to increasingly larger variations in execution time.
To further investigate this effect, we execute the NAS MG parallel benchmark on 64 different processors, in each case using all eight cores available on each processor for one execution of the benchmark. Even though each execution is identical and executed on identical hardware, the results in the figure at the left show a significant difference in power draw when executed without power bounds (bottom left). These variations, which directly stem from natural variations incurred during the production process, then turn into performance differences when the application is executed under a power bound: while a processor not running under a power bound can compensate for differences in efficiency by applying different power levels keeping execution speed constant, a power cap eliminates this possibility and thereby directly exposes these differences in terms of achieved performance.
These experiments show that future power limitations will manifest themselves in performance variations directly visible to the application. In other words, power caps can lead to imbalanced applications even if their workload is fully balanced.
Memory Gremlins: Emulating a System with Constraints in the Memory Subsystem
A second resource that is expected to be severely limited in future architectures is the memory system. This refers to both memory bandwidth, in particular off-chip bandwidth which is limited by physical constraints, and cache sizes. We can recreate these trends on today's platforms using carefully calibrated competing threads on each node, which either consume predefined amounts of bandwidth or cache storage and thereby “steal” the resource from the target application.
The figures below show the results of using these techniques on the LULESH shock-hydro proxy application code. We execute LULESH under both cache and bandwidth GREMLINs and show the results in the left and rightfigures, respectively, for multiple working set sizes. Reducing last level (L3) cache size to 35% only leads to a small impact on application performance, while further reductions cause significant changes in application performance, especially for large working sets. Similarly, we can observe a reduced performance caused by limited memory bandwidth, again affecting larger working set sizes more than smaller ones.
Resilience Gremlins: Emulating a World with More Faults
A third critical area for exascale system design is the expected rise of fault rates caused by larger component counts, reduced slack in architectural designs, as well as smaller feature sizes. To study this effect, GREMLINs can be used to inject faults into an application's execution, enabling us to monitor effects on execution time and correctness as well as to study the impact of countermeasures implemented within applications.
As a simple first example, we instrumented LULESH with RETRY blocks, which create local mini-checkpoints at which the application stores its data and which can be used to roll back to, in case a fault is encountered within that block. The figure shown below shows the range of potential instrumentation locations, ranging from all of main to wrapping the three functions called by main individually. The decision on where to place the RETRY annotations is critical or large overheads can occur.
We then implement a “fake” fault GREMLIN that periodically assumes non-correctable faults and informs the application through a fault interface (without actually causing a fault, but triggering the recovery mechanism in the application). We use this setup to measure the execution time of LULESH under the fault GREMLIN for varying fault rates. The results (see figure right) show a high, but nearly constant, overhead for protecting every function in the main loop, caused by frequent checkpoints but short rollbacks, while instead protecting all of main leads to high overhead that rises sharply as we increase the error rate, since a fault causes the entire program to be re-executed. Protecting individual iterations shows a similar trend as protecting individual functions, but at much lower overheads, due to the reduced checkpointing. Protecting every Nth iteration provides a good balance between the extremes, with 25 iterations between mini-checkpoints being a sweet-spot.
Barry Rountree, Todd Gamblin, Bronis R. de Supinski, Martin Schulz, David K. Lowenthal, Guy Cobb, Henry Tufo, “Parallelizing heavyweight debugging tools with mpiecho”, Parallel Computing, Volume 39, Issue 3, March 2013, Pages 156-166, ISSN 0167-8191, http://dx.doi.org/10.1016/j.parco.2012.11.002 [external link]
M. Schulz, J. Belak, A. Bhatele, P.-T. Bremer, G. Bronevetsky, M. Casas, T. Gamblin, K. Isaacs, I. Laguna, J. Levine, V. Pascucci, D. Richards, B. Rountree, "Performance Analysis Techniques for the Exascale Co-Design Process", Proceedings of PARCO 2013, Munich, Germany, September 2013.
Patki, Tapysa, David K. Lowenthal, Barry Rountree, Martin Schulz and Bronis R. de Supinski, “Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing”, Twenty Seventh International Conference on Supercomputing (ICS 2013), Eugene, Oregon, June 10–14, 2013. [external link]
Ian Karlin, Abhinav Bhatele, Jeff, Bradford L. Chamberlain, Jonathan Cohen, Zachary DeVito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards, Martin Schulz, Charles Still, “Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application”, IPDPS 2013, Cambridge, MA, May 2013 (best paper, software track) [PDF]
I. Laguna, E. Leon, M. Schulz and M. Stephenson, “A Study of Application-Level Recovery Methods for Transient Network Faults”, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) 2013, Denver, CO, November 2013
M. Schulz, J. Belak, G. Bronevetsky, I. Laguna, D. Richards, B. Rountree, “Emulating Exascale Conditions on Today’s Platforms”, ASCR Workshop on Modeling and Simulation, Seattle, WA, September 2013.
I. Laguna, M. Schulz, J. Keasler, D. Richards, J. Belak, “Optimal Placement of Retry-Based Fault Recovery Annotations in HPC Applications”, SC2013, Denver, Colorado, November, 2013.
M. Schulz, J. Belak, G. Bronevetsky, M. Casas, I. Laguna, D. Richards, B. Rountree, “Analyzing Future Exascale Platforms on Today's Machines”, Workshop on Extreme-Scale Programming Tools, Denver, CO, November 2013.
M. Schulz and I. Laguna, “Evaluating the Impact of Faults and Recovery Mechanisms in Exascale Applications&rdquo, Mini-Symposium “Toward Resilient Applications for Extreme-Scale Systems” at SIAM-PP 2014, Portland, OR, February 2014 [external link]
Invited Keynote Talks
Martin Schulz, “Performance Analysis Techniques for the Exascale Co-Design Process”, Invited Keynote Talk at ParCo 2013, Munich, Germany, September 2013 [external link]
Rountree, Barry, Guy Cobb, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, and Henry Tufo, “Parallelizing Heavyweight Debugging Tools with MPIecho” Parallel Computing, Vol. 39, No. 3, March 2013.
SST The Structural Simulation Toolkit Detailed performance simulation.
The above study of applying GREMLINs to the memory system shows the ability to run real applications in scenarios with reduced memory bandwidth, but also indicates some limitations. For instance, since caches are transparent for the application and any GREMLIN, any emulation has to be approximate. Further, more advanced approaches, which fundamentally change the memory system, cannot be emulated at all.
For these tasks in our performance evaluation spectrum, we apply the Structured Simulation Toolkit (SST). This approach, which is split into a micro (processor architecture) and macro (cross-node) part, enables us to study substantial architectural changes in any part of the system, albeit with significantly increased execution times and limited to smaller size applications and kernels. In particular, during this last year we have focused on the design of new simulation capabilities for advanced memory systems, such as hardware-based memory coherency models.
Obtaining information relating to hardware-based coherency events is extremely difficult and often requires vendor specific tools or access to simulation environments. We have utilized the processor core, cache and memory modeling capabilities in SST/Micro to evaluate the coherency events for a simple 8-core processor cluster which could be used as a `tile' for a future many-core exascale class processor. In this study we employ a simple coherency protocol which approximately maps to MESI (modified, exclusive, shared, invalid)—a common coherency mechanism used in contemporary processors. A diagram of how a stream of instructions is captured by SST and fed into the model is shown in the figure below..
For our study we have used an OpenMP-based implementation of LULESH which has been optimized by the ExMatEx project for performance on vectorized architectures. By using the SST tools we are able to dynamically capture execution information and tap this into a virtual processor core and simulated caches. We note that since we are running a fixed size problem the use of more cores actually improves the L1 efficiency reducing pressure in the higher levels of cache hierarchy.
Our analysis is able to show the cost of invalidations for varying OpenMP thread counts. These are essentially a first order approximation for the cost of coherency within a processor since they are commands to the cache to invalidate (mark a cache line as invalid), essentially flushing the data. Any energy expended migrating this data into the caches has been wasted if the data is invalidated. Whilst it is impossible to have a fully coherent cache system without invalidation events, a design goal for future processors will be reduce these through a richer coherency protocol or the appropriate selection of core count or cluster size. The data provided by SST is also able to show that by fixing the problem size but executing over a larger thread count, a greater number of memory requests are resolved by other caches at the first level (up to 4.92%). This can be used by algorithm designers and hardware architectures and can be viewed through two perspectives: (1) a resolution in an alternative L1 cache requires that data to be moved to the requesting L1 or processor core forcing a data movement to take place; alternatively, (2), the overall capacity of cache on the chip is increased if all processor cores can make more effective use of the data stored with them, thus reducing the need to obtain data from memory which is even higher in energy consumption and time.
We plan to utilize the approach outlined along with collaborations under the FastForward program to assess the opportunities for more advanced cache protocols to be employed in the simulation infrastructure and whether the data obtained in this specific study can be used to design coherency domains where energy use can be optimized for applications such as LULESH and CoMD. In addition, we have added SST support for multi level memory (e.g. NVRAM) to enable the assessment of the heterogeneous memory emerging for exascale computing.
ASPEN DSL A domain specific language for performance modeling.
While both GREMLIN and SST can provide valuable insights into the performance on expected architectures, they cannot describe the scalability properties of applications. To include this aspect into ExMatEx, we use the Aspen modeling language. During the past year, Aspen saw significant progress in terms of new language features, modeling capabilities, internal improvements, and user-facing infrastructure.
One of the first new capabilities added this year was a full web-driven collaborative user interface and analysis tool, AspenLab. AspenLab hosts application and machine models, grouped by user, and includes a syntax-highlighted on-line editor. Once models have been uploaded or created in AspenLab, users can run a model checker and suite of analysis tools, including tables and interactive charts showing resource usage by kernel, over parameterized variables, kernel data usage, and computational intensity.
Another major new capability is a general-purpose model optimizer. Using the open-source NLopt non-linear optimization library, modelers can select machine or application constraints and request the tool to minimize or maximize certain parameters. For example, the user might request the optimizer to find the largest problem size for a given application model which fits in a given machine's RAM and runs in less than a given time limit. This tool is flexible and supports a wide range of optimization tasks.
The language itself has received major new features. First, the current version of the Aspen modeling language now supports conditionals, where parameters in the model define the execution paths or resource requirements. As one example, this gives a model optimizer more power to choose between alternative algorithms or control flows through the model execution. The language now also supports probabilistic execution, which can be used to model branches, load imbalance, and other certain classes of nondeterministic behavior. Other syntactic changes simplify the modeling process and add clarity for the modeler. One of the most significant changes is the unification of control flows and execution kernels; the dichotomy between the two was occasionally counterintuitive, since all resource requirements had to be placed in standalone `` leaf nodes'' of a call stack, in contrast to the programming languages in which the applications being modeled were originally coded, which typically have no such distinction.
In combination with the other ExMatEx tools and applications, we have been adding new Aspen models updating existing ones for the proxy apps to make them compliant with the new language features and updated app versions. We are also exploring the possibility of using an Aspen model to automatically generate skeleton code as input to the SST/macro simulator.
The possibilities for combining Aspen with other tools, from compilers to simulators, are numerous, and so we are seeing a strong need not just to create standalone tools with Aspen, but to make it simple for existing tools to link against an Aspen library. To enable this type of compatibility, we have rewritten Aspen in C++, leading to other improvements such as portability, performance, and scalability. The Scala version has strengths as well, such as ease of web-based deployment, and as Aspen is intended to be an open language standard, we can continue to use the Scala version for these purposes without major investment. However, we expect primary development and most new Aspen-based tools to be based on the C++ version.
Spafford, K., Vetter, J. S., Benson, T., & Parker, M., “Modeling synthetic aperture radar computation with Aspen”, International Journal of High Performance Computing Applications, 27(3), 255-262, 2013 [external link]
C. D. Carothers, M. Mubarak, R. B. Ross, P. Carns, J. S. Vetter, J. S. Meredith, “Combining Aspen with Massively Parallel Simulation for Effective Exascale Co-Design”, DOE Workshop on Modeling and Simulation of Exascale Systems and Applications, 2013 [external link]