Runtimes and System Services The software layer that orchestrates the multi-scale app onto the system
Many, if not most, of today's scientific simulation applications are developed using a fairly limited set of software technologies: a standard programming language such as C, C++, or FORTRAN (possibly along with node-level acceleration APIs like CUDA or OpenMP and solver libraries such as Trilinos), MPI for communication, and a static scheduler, such as Slurm or Moab, to execute the computation on the machine. Should a developer need to load balance their computation, they need to provide this functionality themselves. Similarly, fault-tolerance requires programmers to periodically write data to disk for later recovery. Likewise, communication patterns are generally fixed—dynamic communication patterns must be designed and implemented by the programmer.
We contend that system services, or runtime systems, should provide much of this functionality. We view these services as supporting “programming in the large”—coupling multiple diverse components of a dynamic multi-scale computation and orchestrating its execution on the system. We categorize (non-exhaustively) these services into multiple categories of functionality:
- Scheduling: concurrent, asynchronous, adaptively executing computational components, launched on-the-fly, and exiting when complete.
- Discovery: locating system resources based on application-supplied requirements (e.g. provide a list of all nodes with GPU accelerators).
- Messaging: setup and tear down, on-the-fly, dynamic, adaptive communication links between components of the calculation.
- Caching: services for temporarily storing data, perhaps in-memory, and retrieving it from anywhere in the computation. Caching can help prevent duplicate computation or store data to be used for recovery from faults. Caching can also be used for communication. Instead of sending messages, processes can store their data in the cache for retrieval by other processes.
- Fault detection: working with the application, operating system, and hardware, detect faults in the system and provide facilities for application notification or automated restart.
This part of our project has been exploring software that can provide these services. In general, there are two ways in which these services (or subsets of them) are implemented. First, there are the distributed monolithic systems that are closely tied to a programming language or model (e.g. Charm++, X10, Chapel, CnC, Erlang). These systems include a runtime component that implements features of the programming models such as scheduling, communication, data distribution, etc. Note that these system-level runtimes should not be confused with low-level runtimes provided by componet-level programming languages such as CUDA or OpenMP. They often provide the same conceptual features but at a much finer level of granularity. Second is a more loosely coupled approach that uses various single-function software, usually open source, to build an integrated, dynamic system. This approach is closer to what “industrial” developers use to build scalable systems (e.g. Netflix, Facebook, LinkedIn, Google, etc.). We refer to this approach as cloud- or web-based.
In Y1 of this project we began exploring this space with a couple of
implementations of a simplified version of the HMM proxy discussed in section
(CoHMM). As a monolithic approach, we chose the Erlang programming language
because it provides all of the desired features in a single language and
runtime system. Our cloud base approached used a more diverse set of tools
including Apache ZooKeeper (for discovery, scheduling, and process tracking),
node.js to manage the overall execution of the code, and multiple NoSQL
databases (MongoDB, memcached, Riak, Couchbase) to cache data for fault
tolerance. It's important to note that, in both cases, the implementations
simply managed the dynamic coupling between the coarse- and fine-scale
components of CoHMM. All of the mathematical computation was done in the
componet-level APIs that those components used (e.g. MPI or OpenMP).
Runtime and System Software Evaluations
In Y2 of the project we extended and expanded this work, primarily through the execution of the Los Alamos IS&T Co-Design Summer School. We brought together a team of talented students with backgrounds in both computer science as well as physics to refine our CoHMM and CoMD proxy application suite and test these refined versions within a variety of runtime systems that both industry and academia have made available.
Recall from Section (CoHMM) that CoHMM consists of existing coarse- and fine-scale componets—a serial code for the coarse HMM portion and our CoMD proxy app for the fine-scale molecular dynamics. Our students evaluated sotware that acts as the “glue” between these two components. The simplicity of CoHMM allowed the students to investigate a wide swath of software technologies—both monolithic and cloud-based.
Our evaluations focused on some of the primitive features we describe above:
scheduling, communication, caching, and fault tolerance. Various schedulers
were tested against how well they supported dynamic and adaptive task
scheduling. The students implemented multiple forms of adaptivity in both 1-
and 2-D implementations---spatial sampling, which does not require a
database, and kriging, which uses an in-memory database (
redis) to cache
previously computed results. In addition, students implemented fault tolerance
using two approaches. In the first, process-level case, the runtime system
detects the crash of a fine-scale CoMD process and the system restarts it
from its initial conditions (potentially on another node) without crashing the
entire application. In the second case an in-memory database (again,
is used to periodically cache particle positions from each CoMD
process. If the runtime system detects a crash, the CoMD process is now
restarted from the conditions encapsulated in the most recently cached particle
positions—not from the initial conditions.
Building on the work of the summer school students, we also enhanced our original
test applications. Fault tolerance was added to the Erlang version and our
cloud version now uses the proven
The following table shows a condensed representation of the space we have explored in runtime systems and system services. This list is by no means definitive—there are many more potential packages to explore. It is clear that we must spend much of next year narrowing down the list of contenders to a small subset best suited to our particular problem domain and target application.
|HPX||Bugs and poor documentation in v0.9.5, esp. for distributed systems||Abandoned|
|Charm++||Load balancing evaluated with synthetic benchmarks; difficult to reconfigure||Eval. only|
|Mesos||Evaluated favorably; pursue with Spark running on top of Mesos||Eval. only|
|Swarm||Evaluated favorably, but early version limited to 128 child processes||Eval. only|
|Scioto||1-D, 2-D||AMR, Kriging||redis||No||CoMD 1.1|
|Spark||1-D, 2-D||AMR, Kriging||redis||CoMD atoms||CoMD 1.1|
Runtimes: Lessons and Issues
While our detailed technical publications (3, listed below) are still in preparation, we can
provide some early information on lessons learned and potential limitations or
issues with this approach to programming in the large. First, we were only able
to test scaling at low-100's of nodes and low-1000's of cores. Given that all
of the computational time is spent in the CoMD calls (the coarse scale
computation, even at 2-D is quite fast, while each of the thousands of CoMD runs take over 20
seconds to converge) the overhead of scheduling and communication is minimal.
If the granularity of our fine-scale calculations is small this overhead may
become a concern. Similarly, at the scales at which we ran, database overhead
was small (even using only a single node for the
redis database). Scaling
to larger numbers of nodes and cores, along with a reduction in fine-scale
runtime, may drive this overhead higher. This is one of the prime motivations
for building our target scale-bridging proxy (ASPA)—to generate realistic
“speeds and feeds” for our target problem. We can use these to more
accurately capture potential scaling issues with this approach. We will spend
much of next year, in co-design collaboration with our analysis, performance
modeling, and simulation teammates, assessing potential software and
A final point to keep in mind: these new approaches to programming in the large may not fit well with the software ecosystem provided by vendors delivering forthcoming lrge-scale machines. It's unlikely that DoE's datacenters will quickly (if at all) migrate to a cloud-based infrastructure. We must carefully choose software components that can be installed and run without making a major impact on the operation and management of the datacenter. For example, our dynamic scheduling capability will likely need to operate withing a partition of nodes that has been allocated by one of the standard static schedulers. Networks too, may cause issues. All of the clou-based software we investigated uses the TCP/IP network stack. If DoE's datacenters can not support TCP/IP over installed HPC networks, our approach may be threatened.
Runtime Systems Publications
R. S. Pavel, D. Roehm, C. Mitchell, C. Junghans, K. Barros, J. Mohd-Yusof, T. C. Germann and A. L. McPherson , “ Cloud+X: a Service-Based HPC Software Stack ”, submitted.
A. McPherson, C. Mitchell, J. Ahrens, “Cloud+X: Exploring Asynchronous Concurrent Applications”, Los Alamos Unclassified Report. LA-UR-12-10472, 2012 [PDF]
Allen McPherson, Christopher Mitchell, Kipton Barros, “Non-Traditional Approaches to Developing Multi-Scale Simulation Codes”, 16th SIAM Conference on Parallel Processing for Scientific Computing, Portland, Oregon, Feb. 18-21, 2014 [accepted]
C. Mitchell, A. McPherson, “Leveraging the Cloud for Materials Proxy Applications”, SIAM Conference on Computational Science and Engineering, Boston, MA, February 25–March 1, 2013, [external link]
Allen McPherson, Christopher Mitchell, James Ahrens, “Cloud+X: Exploring Asynchronous Concurrent Applications”, Los Alamos Unclassified Report. LA-UR-12-20511 (slides for LA-UR-12-10472), Presented at DoE Exascale Research Conference, Portland, Oregon, April 16-18, 2012 [PDF]