CoMD Proxy Application Classical molecular dynamics proxy application implemented in multiple programing models.
The molecular dynamics (MD) computer simulation method is a well-established and important tool for the study of the dynamical properties of liquids, solids, and other systems of interest in Materials Science and Engineering, Chemistry and Biology. A material is represented in terms of atoms and molecules. The method of MD simulation involves the evaluation of the force acting on each atom due to all other atoms in the system and the numerical integration of the Newtonian equations of motion for each of those atoms. The evaluation of the interatomic potentials and the corresponding forces is the most computationally intensive portion of most MD codes. The energy of the system is expressed in terms of the potential and kinetic energy of the atoms, with the latter related to the macroscopic temperature of the material. Throughout the simulation, the total energy of the system should be conserved even though the instantaneous partitioning between the two components (and hence the temperature) may fluctuate.
Though MD was initially developed to compute the equilibrium thermodynamic behavior of materials (equation of state), most recent applications have used MD to study non- equilibrium processes.
MD and ExMatEx
MD simulations are the finest-scale simulations performed within the context of the ExMatEx project. As such, they form a ‘ground truth’ for the system being simulated, and a suitably large MD simulation (if possible) would be a reference case, similar to Direct Numerical Simulation (DNS) in fluid mechanics. The figure below shows two examples of such large-scale simulations, SPaSM and ddcMD, simulating solid and liquid problems.
Due to the time- and length-scales needed, it is often impractical to simulate realistic problems of interest using MD in this fashion, so MD often coupled to coarse-scale simulations via scale-bridging algorithms. In this mode, smaller MD simulations are used to provide constitutive relations for the coarse-scale simulations in regions where the standard stress-strain relations may no longer hold, for example at the extremes of temperature, pressure or deformation.
CoMD is a proxy-app for a broad class of molecular dynamics simulations. In particular, we consider the simulation of materials where the interatomic potentials are short range (e.g. uncharged metallic materials). In that case the simulation requires the evaluation of all forces between atom pairs within the cutoff distance. The performance of the simulation is then dependent on the efficiency of 1) identifying all atom pairs within the interaction radius and 2) computing the force between the atom pairs. For the proxy app, we consider only the simple Lennard-Jones (LJ) and Embedded Atom Method (EAM) potentials.
The small figure to the right shows a sketch of a set of atoms, with a typical interatomic potential with a finite cutoff radius. The problem is then reduced to computing the interactions of the atom in question with all the other atoms within the shaded circle, rather than with all the other atoms in the system.
Inter-node Work Decomposition
CoMD utilizes a Cartesian spatial decomposition of atoms across nodes, with each node responsible for computing forces and evolving positions of all atoms within its domain boundaries. As atoms move across domain boundaries they are handed off from one node to the next.
Depending on the relative cost of the force computation, the interaction distance and other factors, other decomposition methods have been used in MD simulation. These include:
- Spatial Decomposition (SD) as described above
- Atom Decomposition (AD) where each node owns a fixed subset of the atoms, regardless of their position in space.
- Force Decomposition (FD) where each node is assigned to compute a fixed subset of the pairwise forces Fij
- Neutral Territory (NT) which is a combination of spatial and force decomposition. The atoms are spatially decomposed, but unlike pure SD, the force computation for a pair of atoms [i,j] is performed by a processor which is in the same row and column of the pair, even though it may own neither of the atoms.
Examples of possible work decomposition schemes for MD are shown in the figure above, colored by processor (node) ID. From left to right, spatial decomposition, atom decomposition, force decomposition and neutral territory method. Only the spatial decomposition is used in CoMD.
CoMD assumes a cutoff distance for the interatomic potentials which is much smaller than the characteristic size of the spatial domain of the node. To allow efficient identification of the atoms pairs which are within the cutoff distance, the atoms on a node are assigned to cubic link cells which are slightly larger than the cutoff distance. Thus, an atom within a given link cell only needs to test all the atoms within its own cell and the 26 neighboring link cells in order to guarantee that it has found all possible interacting atoms. In contrast to some codes where the atoms are assigned to link cells only during the force computation, CoMD uses the link cells as the basic data structure for storing the atoms.
The figure to the right shows a sketch of the link cell decomposition of atoms with a finite cutoff. Searching all the neighboring link cells guarantees that all atoms within the shaded circle are checked.
An alternative way to constrain the search for interacting atoms is neighbor lists. Each atom carries a list of pointers to the atoms that are close by. These lists introduce some (sometimes significant) memory overhead and need to be rebuilt occasionally, depending on the movement of the atoms. They generally also introduce more random memory access patterns. Future implementations of CoMD will introduce neighbor-list formulations to evaluate the relative tradeoffs of the two schemes.
Since atoms interact with all atoms within the cutoff radius, atoms near the domain boundaries need information from adjacent nodes. The atoms state data for these atoms is placed in halo (ghost cell) regions around the local spatial region. These regions are decomposed into halo link cells, just as occurs in the interior domain. For the LJ potential, only the atom positions need to be communicated prior to force computation. For EAM, a partial force term also needs to be communicated, interleaved with the force computation step.
The communication pattern takes advantage of the Cartesian domain decomposition and link cell structure. Each node sends to its 26 neighbors in 6 messages in 3 steps, first the x-faces, followed by the y-faces, then z-faces. This minimizes the number of messages and maximizes the size of the messages. This requires that the message traffic be serialized with the steps processed in order.
The reference implementation of CoMD is coded in C. The atom data is stored in a partial array of structures format, utilizing separate 3-vectors for position, velocity and force, and flat arrays for most other variables such as energy. Internode decomposition is via encapsulated MPI communication. Since the local work is purely serial, we make use of force symmetry (Fij = - Fji) to reduce the amount of computation to the minimum. A pure serial code (without MPI) can be built by switching a flag in the makefile. In this case the halo regions are populated by copying data from the opposite boundary (assuming periodic boundary conditions).
The OpenMP implementation parallelizes the force loop, which is the majority of the computational work. In order to ensure thread safety, this version does not use force symmetry, but rather each atom in the pair computes Fij separately and updates only their own force term. Other strategies for ensuring thread safety are covered in the CoMD documentation.
A specially tuned OpenMP implementation for Intel MIC also exists. This version runs on the Xeon CPU and natively on the Phi coprocessor. Compiler directives or pragmas were using for identifying multi-threading (e.g. omp parallel for, omp parallel for reduce) of outer loops and vectorization (e.g. ivdep, simd) of inner loops. Optimization was focused on the computationally intensive force calculation functions (LJ, EAM). Other changes include alignment of local and allocated variables on 64-byte boundaries, relevant data structures changed for Array of Structures (AoS) to Structure of Arrays (SoA) data layouts, aliasing of structure array names, and use of loop blocking.
The force computation loops have been restructured similar to that used in the CoMD OpenCL version as follows;
For all local link cells
For each atom in local link cell
For all neighbor link cells
For each atom in neighbor link cell
Runs were made using the default configurations. The current version shows similar speedup for the LJ force computation on the CPU vs. coprocessor on all available threads. The redistribution remains a bottleneck on the Phi that affects the total run time. More work is required for further optimization and to take better advantage of the coprocessor architecture.
The OpenCL implementation allows the code to run on both multi-core CPUs and accelerators, including GPUs (NVIDIA and ATI) and Intel MIC (aka Xeon Phi). The force loop parallelization is essentially the same as the OpenMP version, in this case expressed as a 2-D NDRange (Number of atom slots per link cell) x (Number of link cells)
However, the data layout is substantially different to account for the features common to many GPUs. Rather than 3-vectors for position, etc. (array-of-structures, AoS) the data is stored in flat arrays of the components of the vector, denoted by structure-of-arrays or SoA layout. As with CoMD 1.0, an AoS implementation is also in development.
Parallelization for other bottlenecks including the atom assignment to link cells is also underway. Due to the OpenCL threading model, the alternate strategies available to OpenMP are likely not possible in OpenCL. Performance is roughly comparable with OpenMP on the platforms where both are available, but with a significant overhead incurred by the data copy and reformatting moving between OpenCL data and native C arrays.
One of the primary reasons we do this...
Interaction with AMD has been via the OpenCL version. Tuning of the code to more efficiently target their specific architectural features has yielded speedups of up to 20x (for LJ). The OpenCL version has also been implemented on their future (FastForward) architecture emulators, including Processor-In-Memory (PIM) and Exascale Heterogenous Processor (EHP). Collaboration is ongoing.
The CUDA version was developed by NVIDIA in collaboration with the CoMD team. Although the initial version was guided by the OpenCL version, the subsequent development and tuning has been extensive. The architectural similarities between NVIDIA and ATI GPU hardware has led to broadly convergent strategies of maximizing performance through optimizing thread utilization and reuse. The code has also been ported to the NVIDIA Gryphon (future platform) simulator.
CoMD has been ported to both Intel MIC (aka Xeon Phi) and Intel's FF candidate architecture emulator. Tuning of both the code and the architecture is ongoing.
J. Mohd-Yusof, and N. Sakharnykh, “ Optimizing CoMD: A Molecular Dynamics Proxy Application Study ”, GPU Technology Conference 2014, San Jose, CA, March 2014, accepted for presentation
S. Mniszewski, and J. Mohd-Yusof, “CoDesign Molecular Dynamics (CoMD) Proxy Application”, PGI OpenACC WorkShop, Oct. 9-10, 2013 (LA-UR-13-27831)
J. Mohd-Yusof, “CoDesign Molecular Dynamics (CoMD) Overview”, ExMatEx All-Hands Meeting, March 12-14, 2013 Santa Ana Pueblo, New Mexico (LA-UR-13-21655)
J. Mohd-Yusof, “CoDesign Molecular Dynamics (CoMD) Proxy App Deep Dive”, Exascale Research Conference, Oct. 1-3, 2012 Arlington, Virginia (LA-UR-12-25066)
J. Mohd-Yusof, “CoDesign Molecular Dynamics (CoMD) Proxy App”, ExMatEx All-Hands Meeting, May 30- June 1, 2012 Pacific Grove, California (LA-UR-12-21782)
M. A. Heroux, J. Mohd-Yusof, D. Richards, A. Herdman and J. Luitjens, “Mantevo Suite 1.0”, R&D 100 Awards Submission, 2013 (LA-UR-13-22019)
“CoMD: Classical Molecular Dynamics Proxy Application, Version 1.0”