Skip to content

Runtimes and System Services The software layer that orchestrates the multi-scale app onto the system

Many, if not most, of today's scientific simulation applications are developed using a fairly limited set of software technologies: a standard programming language such as C, C++, or FORTRAN (possibly along with node-level acceleration APIs like CUDA or OpenMP and solver libraries such as Trilinos), MPI for communication, and a static scheduler, such as Slurm or Moab, to execute the computation on the machine. Should a developer need to load balance their computation, they need to provide this functionality themselves. Similarly, fault-tolerance requires programmers to periodically write data to disk for later recovery. Likewise, communication patterns are generally fixed—dynamic communication patterns must be designed and implemented by the programmer.

We contend that system services, or runtime systems, should provide much of this functionality. We view these services as supporting “programming in the large”—coupling multiple diverse components of a dynamic multi-scale computation and orchestrating its execution on the system. We categorize (non-exhaustively) these services into multiple categories of functionality:

  • Scheduling: concurrent, asynchronous, adaptively executing computational components, launched on-the-fly, and exiting when complete.
  • Discovery: locating system resources based on application-supplied requirements (e.g. provide a list of all nodes with GPU accelerators).
  • Messaging: setup and tear down, on-the-fly, dynamic, adaptive communication links between components of the calculation.
  • Caching: services for temporarily storing data, perhaps in-memory, and retrieving it from anywhere in the computation. Caching can help prevent duplicate computation or store data to be used for recovery from faults. Caching can also be used for communication. Instead of sending messages, processes can store their data in the cache for retrieval by other processes.
  • Fault detection: working with the application, operating system, and hardware, detect faults in the system and provide facilities for application notification or automated restart.

This part of our project has been exploring software that can provide these services. In general, there are two ways in which these services (or subsets of them) are implemented. First, there are the distributed monolithic systems that are closely tied to a programming language or model (e.g. Charm++, X10, Chapel, CnC, Erlang). These systems include a runtime component that implements features of the programming models such as scheduling, communication, data distribution, etc. Note that these system-level runtimes should not be confused with low-level runtimes provided by componet-level programming languages such as CUDA or OpenMP. They often provide the same conceptual features but at a much finer level of granularity. Second is a more loosely coupled approach that uses various single-function software, usually open source, to build an integrated, dynamic system. This approach is closer to what “industrial” developers use to build scalable systems (e.g. Netflix, Facebook, LinkedIn, Google, etc.). We refer to this approach as cloud- or web-based.

In Y1 of this project we began exploring this space with a couple of implementations of a simplified version of the HMM proxy discussed in section (CoHMM). As a monolithic approach, we chose the Erlang programming language because it provides all of the desired features in a single language and runtime system. Our cloud base approached used a more diverse set of tools including Apache ZooKeeper (for discovery, scheduling, and process tracking), node.js to manage the overall execution of the code, and multiple NoSQL databases (MongoDB, memcached, Riak, Couchbase) to cache data for fault tolerance. It's important to note that, in both cases, the implementations simply managed the dynamic coupling between the coarse- and fine-scale components of CoHMM. All of the mathematical computation was done in the componet-level APIs that those components used (e.g. MPI or OpenMP).

Runtime and System Software Evaluations

In Y2 of the project we extended and expanded this work, primarily through the execution of the Los Alamos IS&T Co-Design Summer School. We brought together a team of talented students with backgrounds in both computer science as well as physics to refine our CoHMM and CoMD proxy application suite and test these refined versions within a variety of runtime systems that both industry and academia have made available.

Recall from Section (CoHMM) that CoHMM consists of existing coarse- and fine-scale componets—a serial code for the coarse HMM portion and our CoMD proxy app for the fine-scale molecular dynamics. Our students evaluated sotware that acts as the “glue” between these two components. The simplicity of CoHMM allowed the students to investigate a wide swath of software technologies—both monolithic and cloud-based.

Our evaluations focused on some of the primitive features we describe above: scheduling, communication, caching, and fault tolerance. Various schedulers were tested against how well they supported dynamic and adaptive task scheduling. The students implemented multiple forms of adaptivity in both 1- and 2-D implementations---spatial sampling, which does not require a database, and kriging, which uses an in-memory database (redis) to cache previously computed results. In addition, students implemented fault tolerance using two approaches. In the first, process-level case, the runtime system detects the crash of a fine-scale CoMD process and the system restarts it from its initial conditions (potentially on another node) without crashing the entire application. In the second case an in-memory database (again, redis) is used to periodically cache particle positions from each CoMD process. If the runtime system detects a crash, the CoMD process is now restarted from the conditions encapsulated in the most recently cached particle positions—not from the initial conditions.

Building on the work of the summer school students, we also enhanced our original test applications. Fault tolerance was added to the Erlang version and our cloud version now uses the proven redis database.

The following table shows a condensed representation of the space we have explored in runtime systems and system services. This list is by no means definitive—there are many more potential packages to explore. It is clear that we must spend much of next year narrowing down the list of contenders to a small subset best suited to our particular problem domain and target application.

CoHMM Implementations
Table1. CoHMM implementations in various asynchronous, task-parallel, programming/execution models and runtime systems (listed in chronological order of implementation).
System Dim. Adaptive? Database? Fault Tolerant Status
Scala 1-D No No No Simple MD
Erlang 1-D No No Process CoMD 1.1
"Cloud" 1-D No Multiple Process CoMD 1.1
Swift 1-D No No Process CoMD 1.1
HPX Bugs and poor documentation in v0.9.5, esp. for distributed systems Abandoned
Charm++ Load balancing evaluated with synthetic benchmarks; difficult to reconfigure Eval. only
Mesos Evaluated favorably; pursue with Spark running on top of Mesos Eval. only
Swarm Evaluated favorably, but early version limited to 128 child processes Eval. only
Pathos 1-D Yes No Process CoMD 1.1
Scioto 1-D, 2-D AMR, Kriging redis No CoMD 1.1
Spark 1-D, 2-D AMR, Kriging redis CoMD atoms CoMD 1.1
CnC 2-D No No No CoMD 1.1
Runtimes: Lessons and Issues

While our detailed technical publications (3, listed below) are still in preparation, we can provide some early information on lessons learned and potential limitations or issues with this approach to programming in the large. First, we were only able to test scaling at low-100's of nodes and low-1000's of cores. Given that all of the computational time is spent in the CoMD calls (the coarse scale computation, even at 2-D is quite fast, while each of the thousands of CoMD runs take over 20 seconds to converge) the overhead of scheduling and communication is minimal. If the granularity of our fine-scale calculations is small this overhead may become a concern. Similarly, at the scales at which we ran, database overhead was small (even using only a single node for the redis database). Scaling to larger numbers of nodes and cores, along with a reduction in fine-scale runtime, may drive this overhead higher. This is one of the prime motivations for building our target scale-bridging proxy (ASPA)—to generate realistic “speeds and feeds” for our target problem. We can use these to more accurately capture potential scaling issues with this approach. We will spend much of next year, in co-design collaboration with our analysis, performance modeling, and simulation teammates, assessing potential software and algorithmic issues.

A final point to keep in mind: these new approaches to programming in the large may not fit well with the software ecosystem provided by vendors delivering forthcoming lrge-scale machines. It's unlikely that DoE's datacenters will quickly (if at all) migrate to a cloud-based infrastructure. We must carefully choose software components that can be installed and run without making a major impact on the operation and management of the datacenter. For example, our dynamic scheduling capability will likely need to operate withing a partition of nodes that has been allocated by one of the standard static schedulers. Networks too, may cause issues. All of the clou-based software we investigated uses the TCP/IP network stack. If DoE's datacenters can not support TCP/IP over installed HPC networks, our approach may be threatened.

Runtime Systems Publications

Papers

R. S. Pavel, D. Roehm, C. Mitchell, C. Junghans, K. Barros, J. Mohd-Yusof, T. C. Germann and A. L. McPherson , “ Cloud+X: a Service-Based HPC Software Stack ”, submitted.

Reports

A. McPherson, C. Mitchell, J. Ahrens, “Cloud+X: Exploring Asynchronous Concurrent Applications”, Los Alamos Unclassified Report. LA-UR-12-10472, 2012 [PDF]

Presentations

Allen McPherson, Christopher Mitchell, Kipton Barros, “Non-Traditional Approaches to Developing Multi-Scale Simulation Codes”, 16th SIAM Conference on Parallel Processing for Scientific Computing, Portland, Oregon, Feb. 18-21, 2014 [accepted]

C. Mitchell, A. McPherson, “Leveraging the Cloud for Materials Proxy Applications”, SIAM Conference on Computational Science and Engineering, Boston, MA, February 25–March 1, 2013, [external link]

Allen McPherson, Christopher Mitchell, James Ahrens, “Cloud+X: Exploring Asynchronous Concurrent Applications”, Los Alamos Unclassified Report. LA-UR-12-20511 (slides for LA-UR-12-10472), Presented at DoE Exascale Research Conference, Portland, Oregon, April 16-18, 2012 [PDF]

Most of our code is released as open source // Visit the ExMatEx GitHub site