Comments
Description
Transcript
5 Advanced Topics
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 47 5 Advanced Topics When we want to use MD simulations in research, efficiency is often of great importance. Several ideas for improving the efficiency of your MD codes are given in this chapter. 5.1 Bottleneck is the FORCE As already explained, typical MD programs are composed of several elements: Initialize Force Move Data accumulation Finalize Figure 5-25: Typical structure of an MD simulation program. Two components, FORCE and MOVE, are essential. Roughly speaking33) the FORCE in the original form requires the computational time proportional to N 2 (N : number of particles), and the MOVE proportional to N . Thus, as the system size increases, the efficiency of FORCE part determines the simulation speed. There are several free tools to measure “lap time” of each component, as shown in Fig. 5-26 for example. pulse vap: repeat 回数 Figure 5-26: Example of performance analysis for an MD code; taken from R. Nabika, Master Thesis, Kyoto Univ. (2011). 33) In this computational time estimation, we assume a two-body interaction. If you use a many-body model, the time will extremely increase, of course. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 48 5.2 From physical viewpoints 5.2.1 Potential cut-off Here given is a simple idea based on a physical property of the particle interaction, i.e., the potential cut-off (or truncation). The interaction potential φ between particles normally converges to zero rapidly enough as the distance r becomes large. We can truncate the potential at some distance rc .34) Normally, rc = 3σ ∼ 5σ is often adopted for LJ systems. A pseudo-code is shown in Fig. 5-27. Note that, however, the truncation itself does not save much time, because there still exists the full dual loop, in which the distance is calculated for every pair of particles. Two algorithms are often adopted for further improvement. 5 Potential Force for (i=1;i<NATOM;i++) { for (j=0;j<i;j++) { dx=posx[i]-posx[j]; dy=posy[i]-posy[j]; dz=posz[i]-posz[j]; r2= dx*dx +dy*dy +dz*dz; if (r2 < RCUT*RCUT) { calculation of force and potential energy } } } u [ε] or f [ε/σ] 4 3 2 1 0 -1 -2 -3 0 1 2 3 4 r [σ] rc /σ 2.5 3.0 3.5 4.0 5.0 6.0 7.0 φ(rc )/ -0.016316891136 -0.005479441744 -0.002174780392 -0.000976324081 -0.000255983616 -0.000085732044 -0.000033999150 Figure 5-27: Pseudo code for potential truncation, and the error [value φ(rc )] for the LJ potential at r = rc . 5.2.2 Book-keeping method For a given particle i, the number of surrounding particles j which have possibility to interact with i is limited to be fairly small in usual situations; for liquids and solids under normal conditions, the typical number is 50–200. In the book-keeping algorithm (帳簿法), we make a table of j for each i and use the table in FORCE. Of course we have to update the table after a while because particles can move 34) It is well known that a simple truncation for the Coulombic interaction does not work, because integration of 1/r in the infinite 3D space diverges; ∫ ∞ 1 4πr2 dr → +∞ (5-41) r r c In that case, special treatments such as the Ewald summation method are required. Also the particle-cell algorithm and the reaction field method are often used. See textbooks for details. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 49 for (i=1;i<NATOM;i++) { for (jt=0;jt<itmax[i];jt++) { j=itable[i][jt]; dx=posx[i]-posx[j]; dy=posy[i]-posy[j]; dz=posz[i]-posz[j]; r2= dx*dx +dy*dy +dz*dz; if (r2 < RCUT*RCUT) { calculation of force and potential energy } } } rcut rtbl Figure 5-28: Concept and a pseud-code of the book-keeping method. and get out of or come into the interaction range. The update does take some additional time, but the calculation load in FORCE drastically reduces in general; with the book-keeping algorithm, the calculation time is expected to be proportional to N . 5.2.3 Cell-division method For a very large system, the cut-off distance rc is possibly much smaller than the cell size L. In this situation, the calculations in FORCE are very much “localized”, i.e., particles in different zone of the cell do not at all interact each other. Thus, we can divide the simulation cell into small “sub-cells” of size ∼ rc , and the interactions are summed up in each sub-cell and its nearest neighbor sub-cells. This is called a cell division algorithm (セル分割法). We need the cell division procedure every step, but the load in FORCE decreases, almost ∝ N . ~rcut for (icell=0;icell<NUMCELL;icell++) { for (it=0;it<ntmax[icell];it++) { i=ntable[icell][it]; // In the same subcell for (jt=0;jt<it;jt++) { j=ntable[icell][jt]; dx=posx[i]-posx[j]; dy=posy[i]-posy[j]; dz=posz[i]-posz[j]; r2= dx*dx +dy*dy +dz*dz; if (r2 < RCUT*RCUT) { calculation of force and potential energy } } // between the neighbor subcells (13 in 3D case) for (jc=0;jc<13;jc++) { jcell=neighborcell[icell][jc]; for (jt=0;jt<ntmax[jcell];jt++) { j=ntable[jcell][jt]; dx=posx[i]-posx[j]; dy=posy[i]-posy[j]; dz=posz[i]-posz[j]; r2= dx*dx +dy*dy +dz*dz; if (r2 < RCUT*RCUT) { calculation of force and potential energy } } } } Figure 5-29: Concept and a pseud-code of the cell division method. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 50 Time [second / 100 steps] 400 'speed.dat' 350 300 250 200 150 100 50 0 0 50000 100000 Number of Atoms 150000 Figure 5-30: Examples of the simulation time with the cell-division method; a LJ system with cut-off distance rc = 3.5 σ, number density ρ = 1.0. Note that the time is almost proportional to N . An example is shown in Fig. 5-30, which shows that the required time per 100 steps on Intel Core2Quad (2.66GHz) is certainly proportional to the number of particles, as expected. Of course, the actual time strongly depends on the simulation conditions, especially the number density (or the number of neighboring particles) and the cut-off distance. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 51 5.3 From mathematical viewpoints Larger ∆t is preferrable for efficient calculations, of course. To carry out a stable simulation with sufficient accuracy, various techniques have been proposed. 5.3.1 Systematic time integration based on symplectic transformation The equations of motion for a given Hamiltonian H(r, p) is ∂ d q = H dt ∂p d p dt In a vector formulation, d dt ( q p ) [ = (5-42) ∂ = − H ∂q ] ( ) ( ) ∂H ∂ ∂H ∂ q q − · ≡ [Lk + Lu ] · p p ∂p ∂q ∂q ∂p Here Lk and Lu are the operators35) 36) (5-43) in the phase space (位相空間), defined as Lk Lu ∂H ∂ ∂p ∂q ∂H ∂ ≡ − ∂q ∂p ≡ (5-44) (5-45) Note that these are non-commutative (非可換), i.e., Lk Lu 6= Lu Lk A formal solution of Eq. (5-43) is then ( ) ( ) q(∆t) q(0) = exp [∆t (Lk + Lu )] p(∆t) p(0) (5-46) (5-47) Expanding the exponential operator in the righthand side of Eq. (5-47) in terms of ∆t and truncating it at some appropriate order, we have a series of symplectic form37) of numerical integration. For that purpose, note that the exponential of the phase space operators is expressed by the definition as exp [∆t (Lk + Lu )] = 1 +(∆t) (Lk + Lu ) 1 (∆t)2 (Lk + Lu ) (Lk + Lu ) 2! 1 + (∆t)3 (Lk + Lu ) (Lk + Lu ) (Lk + Lu ) 3! + +··· 35) (5-48) L is sometimes called a “Liouville operator” (リウヴィユ演算子); refer to textbooks on analytical mechanics. Joseph Liouville (1809–1882) French physicist, mathematician. 37) The word “symplectic” was used by Hermann Weyl (1885–1955, German mathematician, theoretical physicist), meaning “complex”. “Complex” comes from a Latin word, while “symplectic” comes from the corresponding ancient Greek. Today “symplectic” is used mainly in a field of mathematics, like symplectic manifold and symplectic geometry. (Wikipedia) 36) 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 52 1. First Order Symplectic Form When the expansion is truncated at (∆t)1 , we have ) ( exp [∆t (Lk + Lu )] = exp [∆tLu ] exp [∆tLk ] + O ∆t2 (5-49) which is the 1st order symplectic form. The righthand side is a product of simple integrators, thus ) ( ) ( q(∆t) q(0) = exp [∆tLu ] exp [∆tLk ] p(∆t) p(0) ( ) q(0) + ∆tp(0)/m ' exp [∆tLu ] p(0) ( ) q(0) + ∆tp(0)/m = (5-50) p(0) + ∆tF (∆t) This is essentially equivalent to the naı̈ve difference form (or the Euler formula), p(0) q(∆t) = q(0) + ∆t m p(∆t) = p(0) + ∆tF (0) (5-51) 2. Second Order Symplectic Form The next order has some arbitrariness, but the often-used form is [ ] [ ] ( ) ∆t ∆t exp [∆t (Lk + Lu )] = exp Lu exp [∆tLk ] exp Lu + O ∆t3 2 2 (5-52) Thus we obtain ( q(∆t) p(∆t) ) [ ] [ ]( ) ∆t ∆t q(0) Lu exp [∆tLk ] exp Lu p(0) 2 2 [ ) ] ( ∆t q(0) ' exp Lu exp [∆tLk ] p(0) + ∆t 2 2 F (0) [ ] [ ]( ) ∆t ∆t q(0) + ∆t p(0) + 2 F (0) /m = exp Lu p(0) + ∆t F (0) 2 [ ] 2 ) ( ∆t q(0) + ∆t p(0) + 2 F (0) /m = ∆t p(0) + ∆t 2 F (0) + 2 F (∆t) = exp (5-53) This is mathematically equivalent to the Verlet algorithm and also to the leap-frog one. 3. Higer Order Symplectic Forms See textbooks. The symplectic forms (of second order or larger) are known to be very stable even for larger ∆t, as shown in Fig. 5-31. This kind of acceleration technique is widely used for the recent multi-scale or multi-physics simulations. You should understand them more closely if you want to do such simulations. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 53 [出典] 上田顕:分子シミュレーション(裳華房, 2003).引用文献 [9.14] とは, 岡部恒康:物性研究 74-1 (2004). Figure 5-31: Comparison among algorithms for numerical integration. 5.3.2 Multi time-step MD If your target molecules contain several kinds of degrees of freedom which have a wide range of “characteristic times”, chances are that you can separate them and apply different time step ∆t for each degree of freedom. For example, a water molecule has six degrees of freedom as a rigid rotor, as shown in Fig. 5-32. Under normal conditions of liquid, the characteristic time is τ ∼ 20 fs : translational motions (1 fs= 10−15 sec) τ ∼ 10 fs : rotational motions around y- and z- axes τ ∼ 2 fs : rotational motions around x-axis y x Figure 5-32: A model of a water molecule. As a rigid rotor, it has six degrees of freedom, three for translations and three for rotations. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 54 For standard MD simulations, a short time step of ∆t ' 0.5 fs is often used in order to trace the fastest rotational motion with some accuracy. The multi time-step method allows you to use longer ∆t; for example, you may use 5 fs for translational motions, which means that ten updates of rotational motions during a single update for translational motions. 5.4 From computer’s viewpoints Of course, using “faster computers” are a quick solution for the speed problem. There are several points to be considered, however: 1. High-speed hardwares (e.g., vector processors, high-speed memory, network with broad bandwidth) are very expensive in general. 2. The increase of single CPU speed may be slowered in the future; see Fig. 5-33. 3. The architecture of CPU becomes complicated, e.g., dual-, quad-, and multi-cores. Among these topics, an introduction to “parallel computing 並列計算” is given in the following. This was quite a challenging field a decade ago. Owing to the recent remarkable progresses of hardware and software, however, molecular simulation in parallel computing environments has become more and more popular. Moore’s Law38) (Wikipedia) Figure 5-33: The well-known Moore’s law states that the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. The speed of single CPU is expected to be roughly proportional to the number. 38) (From Wikipedia) For the thirty-fifth anniversary issue of Electronics magazine, which was published on April 19, 1965, Gordon E. Moore, who was working as the director of research and development (R&D) at Fairchild Semiconductor at the time, was asked to predict what was going to happen in the semiconductor components industry over the next ten years. His response was a brief article entitled, “Cramming more components onto integrated circuits.” Within his editorial, he speculated that by 1975 it would be possible to contain as many as 65,000 components on a single quarter-inch semiconductor. The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 55 5.5 Paralell computing 1: Hardware side There are two types of pararell computers, a distributed memory type (分散メモリ型) and a shared memory one (共有メモリ型), as shown schematically in Fig. 5-34. Parallel computer systems with distributed memory have a long history and usually show good cost performance, but you should take care of data flow by yourself. On the other hand, shared memory systems normally need special hardware for high-speed data transfer and have been relatively expensive. Recently, however, multicore processors have become popular, and you can personally try parallel computing on a small-scale shared memory system. CPU CPU CPU Local Memory Local Memory Local Memory CPU CPU CPU CPU Main Memory Data Bus Data Bus Figure 5-34: Schematic view of (Left) a distributed memory system and (Right) a shared memory system. 5.5.1 Examples of distributed memory systems PC clusters are the well-known example of distributed memory systems. Most of recently-developed supercomputers are also based on the distributed memory architecture. All supercomputers listed in Table 5-1 have large number of cores/nodes. The essential point (or a bottle neck) is the speed of network. The Ethernet is the most common one and the speed has become as high as 10 Gbps (Giga bit per second), but its performance is still insufficient. For PC clusters, the Myrinet39) and the InfiniBand40) are often adopted. 39) (Wikipedia 日本語版より) Myrinet とは、Myricom 社製のネットワークシステムを言う。最新の Myri-10G Myrinet のス ループットは全二重通信で 20Gbps であり、また通信レイテンシが低いことや通信エラー率が低いことから、現在 Myrinet はクラス タコンピューティングのネットワークデバイスとして広く使われている。 40) (Wikipedia 日本語版より) InfiniBand とは、非常に高い RAS(信頼性・可用性・保守性) を持つ基幹系・HPC 系のサーバ/ク ラスター用高速 I/O バスアーキテクチャ及びインターコネクトのこと。一本の Infiniband レーンは最大 2.5Gbps 双方向の転送速度を 持つ。Infiniband の接続を 4 本 (4X) もしくは 12 本 (12X) 束ねて高い転送速度を実現することができる。12X において QDR(Quad Data Rate) でデータ転送を行った場合、120Gbps(raw) あるいは 96Gbps(データ転送)となる。 41) (Wikipedia より) LINPACK とは線形代数学の数値演算を行うソフトウェアライブラリであり,米国アルゴンヌ国立研究所で FORTRAN ライブラリとして開発された.1970 年代から 1980 年代初期のスーパーコンピュータを対象として設計され,その後, より洗練されたライブラリ LAPACK に取って代わられた.LINPACK ベンチマーク は この LINPACK に基づいたベンチマークプロ グラムで,システムの浮動小数点演算性能を評価する.ジャック・ドンガラ (Jack Dongarra, 米国のコンピュータ科学者) が考案し たもので,理学・工学で一般的な n 次元線型方程式系 A~ x = ~b を解く速度を測定する.このベンチマークの最新版は TOP500 で世 界の高速なコンピュータの性能値としてランキングに使用されている.1 つの測定値でコンピュータシステムのあらゆる性能を代表さ せることは不可能であり,あくまでも 1 つの単純化である.LINPACK ベンチマークで測定するのは,64 ビット浮動小数点演算(通 常,加算と乗算)の 1 秒あたりの実行回数(FLOPS)である.実アプリケーションを実行したときの性能は,LINPACK ベンチマー クを適切に設定して測定したときの最高性能よりずっと低くなる可能性が高い. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 56 Table 5-1: From a recent Top500 list (November, 2014). For details, see http://www.top500.org/ An European organization has provided a ranking of supercomputers in the world twice a year, based on the LINPACK Benchmark.41) Rank Name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Computer Si te Country TH-IVB-FEP Cluster, Intel Xeon E5-2692National Super Computer Center inChina Cray XK7 , Opteron 6274 16C 2.200GHz DOE/SC/Oak Ridge National Labora United States BlueGene/Q, Power BQC 16C 1.60 GHz, DOE/NNSA/LLNL C United States K computer, SPARC64 VIIIfx 2.0GHz, To RIKEN Advanced Institute for Compu Japan BlueGene/Q, Power BQC 16C 1.60GHz,DOE/SC/Argonne C National Laboratory United States Mira Cray XC30, Xeon E5-2670 8C 2.600GHz Switzerland Swiss National Supercomputing Cen Piz Daint PowerEdge C8220, Xeon E5-2680 8C 2.700G Texas Advanced Computing Center/ United States Stampede BlueGene/Q, Power BQC 16C 1.600GHz Forschungszentrum Juelich (FZJ) Germany JUQUEEN BlueGene/Q, Power BQC 16C 1.600GHz DOE/NNSA/LLNL United States Vulcan Cray CS-Storm, Intel Xeon E5-2660v2 10 Government United States SGI ICE X, Intel Xeon E5-2670/E5-2680v Pleiades NASA/Ames Research Center/NASUnited States iDataPlex DX360M4, Intel Xeon E5-2680v Exploration & Production - Eni S.p.A. Italy HPC2 Cray XC30, Intel Xeon E5-2697v2 12C 2. Government United States iDataPlex DX360M4, Xeon E5-2680 8C Leibniz 2 SuperMUC Rechenzentrum Germany Cluster Platform SL390s G7, Xeon X5670 GSIC Center, Tokyo Institute of Tec Japan TSUBAME 2.5 Cray XC40, Xeon E5-2680v3 12C 2.5GHz HLRS - Höchstleistungsrechenzentru Germany Hornet NUDT YH MPP, Xeon X5670 6C 2.93 GHz National Supercomputing Center inChina Tianhe-1A Atipa Visione IF442 Blade Server, XeonDOE/SC/Pacific E5-2670 Northwest NationalUnited States cascade Cray XC40, Xeon E5-2698v3 16C 2.3GHz Army Research Laboratory DoD Super United States Excalibur SGI ICE X, Xeon E5-2670 8C 2.600GHz, Total I Exploration Production France Pangea Tianhe-2 LvLiang SolutionTianhe-2 LvLiang, Intel Xeon E5-2692v2LvLiang Cloud Computing Center China T-Platform A-Class Cluster, Xeon E5-2697 Moscow State University - Research Russia BlueGene/Q, Power BQC 16C 1.60GHz,CC INECA Italy Fermi Cray XC30, Intel Xeon E5-2695v2 12C 2. DOE/SC/LBNL/NERSC United States Edison Cray XC30, Intel Xeon E5 v2 12C 2.700G EPSRC/University of Edinburgh United Kingdom ARCHER bullx DLC, Xeon E5-2690v3 12C 2.6GHz Grand Equipement National de Calc France Occigen Power 775, POWER7 8C 3.836GHz, Cust IBM Development Engineering United States Cray XC30, Intel Xeon E5-2697v2 12C 2. ECMW F United Kingdom Cray XC30, Intel Xeon E5-2697v2 12C 2. United Kingdom ECMW F BlueGene/Q, Power BQC 16C 1.60GHz,Science C and Technology Facilities United C Kingdom Blue Joule SGI ICE X, Xeon E5-2670 8C 2.600GHz, AirI Force Research Laboratory United States Spirit Cray XC40, Xeon E5-2698v3 16C 2.3GHz KTH - Royal Institute of TechnologySweden Beskow Bullx B510, Xeon E5-2680 8C 2.700GHz, I CEA/TGCC-GENCI France Curie thin nodes iDataPlex DX360M4, Intel Xeon E5-2680v Max-Planck-Gesellschaft MPI/IPP Germany Dawning TC3600 Blade System, Xeon X5650 National Supercomputing Centre inChina Nebulae iDataPlex DX360M4, Xeon E5-2670 8C NCAR 2 (National Center for Atmospheri United States Yellowstone Cray XC40, Xeon E5-2690v3 12C 2.6GHz CSC (Center for Scientific Computing) Finland Sisu Bullx B510, Xeon E5-2680 8C 2.700GHz, I International Fusion Energy Researc Japan Helios Cray XE6, Opteron 16C 2.500GHz, CrayERDC DSRC United States Garnet Cielo Cray XE6, Opteron 6136 8C 2.40GHz, CDOE/NNSA/LANL/SNL United States Australia Cray XC40, Xeon E5-2690v3 12C 2.6GHz Pawsey Supercomputing Centre, Ken Magnus BlueGene/Q, Power BQC 16C 1.60GHz,CNRS/IDRIS-GENCI C France Turing BlueGene/Q, Power BQC 16C 1.60GHz,University C DiRAC of Edinburgh United Kingdom Cray XE6, Opteron 6172 12C 2.10GHz,DOE/SC/LBNL/NERSC C Hopper United States Bullx DLC B710/B720 Blades, Xeon E5-2695 Cartesius SURFsara Netherlands Dell C8220X Cluster, Intel Xeon E5-2680 Louisiana Optical Network InitiativeUnited States QB-2 Commissariat a l'Energie AtomiqueFrance Tera-100 Bull bullx super-node S6010/S6030 PRIMEHPC FX10, SPARC64 IXfx 16C 1. Information Technology Center, The Japan Oakleaf-FX HA8000-tc HT210/PRIMERGY CX400 Clu QUARTETTO Research Institute for Information Te Japan United States Discover SCU10 Rackable Cluster, Xeon E5-2697v3 14CNASA/Goddard Space Flight Center Tianhe-2 (MilkyWay-2) Titan Sequoia Year 2013 2012 2011 2011 2012 2012 2012 2012 2012 2014 2011 2014 2014 2012 2013 2014 2010 2013 2014 2013 2014 2014 2012 2014 2014 2014 2013 2014 2014 2012 2012 2014 2012 2013 2010 2012 2014 2011 2010 2011 2014 2012 2012 2010 2013 2014 2010 2012 2013 2014 Total Cores 3120000 560640 1572864 705024 786432 115984 462462 458752 393216 72800 160768 72000 225984 147456 76032 94608 186368 194616 100064 110400 174720 37120 163840 133824 118080 50544 62944 83160 83160 131072 73584 53632 77184 65320 120640 72288 40608 70560 150528 142272 35712 98304 98304 153408 38880 23040 138368 76800 222072 31248 Rmax 33862700 17590000 17173224 10510000 8586612 6271000 5168110 5008857 4293306 3577000 3375600 3188000 3143520 2897000 2785000 2763000 2566000 2539130 2485000 2098090 2071390 1849000 1788878 1654702 1642536 1628770 1587000 1552000 1552000 1431102 1415470 1397000 1359000 1283311.9 1271000 1257615 1250000 1237000 1167000 1110000 1097558 1073327 1073327 1054000 1052510 1052000 1050000 1043000 1018000 1000520 Power Mflops/Wa Architecture Processor 17808 8209 7890 12659.89 3945 2325 4510 2301 1972 1498.9 3140.7 1227 1901.54 2142.77 2176.58 830.18 2176.58 2697.2 1145.92 2176.82 2177.13 2386.42 1074.79 2598.21 3422.67 1398.61 1512 4040 1384 846.42 1991.26 1827.38 635.15 1834.63 2118 997 990.6 2077.62 821.88 2176.57 934.8 3575.63 1742.37 443.84 657 1606 2178.24 881.36 2251 1260 2580 1436.72 680 2200 603.73 1018.5 492.64 875.34 1838.24 562.27 3980 278.89 493 493.12 2910 470.4 500 4590 1176.8 2177.13 2176.6 362.2 2237.48 2104 228.76 886.3 Cluster MPP MPP Cluster MPP MPP Cluster MPP MPP Cluster Cluster Cluster MPP Cluster Cluster Cluster MPP Cluster Cluster Cluster Cluster Cluster MPP MPP MPP Cluster MPP MPP MPP MPP Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster MPP MPP Cluster MPP MPP MPP Cluster Cluster Cluster Cluster Cluster Cluster Accelerator/Co-Process Intel Xeon E5-2692v2 12C Intel Xeon Phi 31S1P Opteron 6274 16C 2.2GHz NVIDIA K20x Power BQC 16C 1.6GHzNone SPARC64 VIIIfx 8C 2GHzNone Power BQC 16C 1.6GHzNone NVIDIA K20x Xeon E5-2670 8C 2.6GHz Intel Xeon Phi SE10P Xeon E5-2680 8C 2.7GHz Power BQC 16C 1.6GHzNone Power BQC 16C 1.6GHzNone Nvidia K40 Intel Xeon E5-2660v2 10C None Intel Xeon E5-2680v2 10C Intel Xeon E5-2680v2 10C NVIDIA K20x Intel Xeon E5-2697v2 12C None Xeon E5-2680 8C 2.7GHz None Xeon X5670 6C 2.93GHzNVIDIA K20x None Xeon E5-2680v3 12C 2.5G Xeon X5670 6C 2.93GHzNVIDIA 2050 Xeon E5-2670 8C 2.6GHz Intel Xeon Phi 5110P Xeon E5-2698v3 16C 2.3G None Xeon E5-2670 8C 2.6GHz None Intel Xeon Phi 31S1P Intel Xeon E5-2692v2 12C Xeon E5-2697v3 14C 2.6G Nvidia K40m Power BQC 16C 1.6GHzNone Intel Xeon E5-2695v2 12C None Intel Xeon E5-2697v2 12C None None Xeon E5-2690v3 12C 2.6G POWER7 8C 3.84GHz None Intel Xeon E5-2697v2 12C None Intel Xeon E5-2697v2 12C None Power BQC 16C 1.6GHzNone None Xeon E5-2670 8C 2.6GHz Xeon E5-2698v3 16C 2.3G None Xeon E5-2680 8C 2.7GHz None Intel Xeon E5-2680v2 10C None Xeon X5650 6C 2.66GHzNVIDIA 2050 Xeon E5-2670 8C 2.6GHz None Xeon E5-2690v3 12C 2.6G None Xeon E5-2680 8C 2.7GHz None Opteron 16C 2.5GHz None Opteron 6136 8C 2.4GHzNone Xeon E5-2690v3 12C 2.6G None Power BQC 16C 1.6GHzNone Power BQC 16C 1.6GHzNone Opteron 6172 12C 2.1GHz None Xeon E5-2690v3 12C 2.6G None Intel Xeon E5-2680v2 10C NVIDIA K20x Xeon X7560 8C 2.26GHzNone SPARC64 IXfx 16C 1.85G None Xeon E5-2680 8C 2.7GHz NVIDIA K20/K20x, Xeon Phi 5110P Xeon E5-2697v3 14C 2.6G None Note FLOPS stands for Floating point number Operations per Second (浮動小数点演算回数), and is used as an index of computing speed. It is reported that the recent Core 2 Quad shows about 50 GFLOPS and the PS3 (Cell) has 2 TFLOPS as the whole system. Devloping PFLOPS (=1015 FLOPS) computeres has become popular projects world-wide. 5.5.2 Examples of shared memory system The operation speed in a single CPU essentially depends on the clock frequency(or clock rate42) ), which is determined by how fast the electronic state can change in each element. As the clock frequency increases, the energy dissipation (and heat emittion) cannot be negligible, and the speed on a sigle chip seems to become saturated recently. Thus the idea of multi-core CPU becomes popluar, where a several nodes share a single in-processor memory. 42) (Wikipedia) Historical milestones and current records of Clock Rate: The first general purpose computer, the ENIAC, used a 100 kHz clock in its cycling unit. As each instruction took 20 cycles, it had an instruction rate of 5kHz. The first commercial PC, the Altair 8800 (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz. The original IBM PC (c. 1981) had a clock rate of 4.77 MHz. In 1992, both Hewlett-Packard and Digital Equipment Corporation broke the difficult 100 MHz limit with RISC techniques in the PA-7100 and AXP 21064 DEC Alpha respectively. In 1995, Intel’s Pentium chip ran at 100 MHz. In March 6, 2000, AMD reached the 1 GHz milestone a few months ahead of Intel. In 2002, an Intel Pentium 4 model was introduced as the first CPU with a clock rate of 3 GHz. Since then, the clock rate of production processors has increased much more slowly, with performance improvements coming from other design changes. As of 2011, the Guinness Record for fastest CPU is by AMD with a Bulldozer based FX chip “overclocked” to 8.308 GHz; however, it has now been superseded by the next generation of AMD’s FX chips “Piledriver” with a clock rate of 8.429 GHz. As of mid-2013, the highest clock rate on a production processor is the IBM zEC12, clocked at 5.5 GHz, which was released in August of 2012. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 57 Figure 5-35: Concept of the Next-Generation Supercomputer. Data are taken from http://www.nsc.riken.jp/project/spec.html (Wikipedia 日本語版より) マルチコア (multiple core, multi-core) は,1 つのプロセッサ・ パッケージ内に複数のプロセッサ・コアを封入した技術であり, マルチプロセッシングの一形態である.外見的には 1 つのプロ セッサでありながら内部的には複数のプロセッサとして認識さ れるため,主に並列処理を行わせる環境下においては,プロセッ サ・チップ全体での処理能力を上げ性能向上を果たすために行 われる.このプロセッサ・パッケージ内のプロセッサ・コアが 2 つであればデュアルコア,4 つであればクアッドコア,6 つで あればヘキサコア,8 つではオクタルコア又はオクタコアと呼 ばれる.高性能な専用プロセッサの中には数十個ものコアを持 つものがあり,メニーコア (many core) と呼ばれる.従来の 1 つのコアを持つプロセッサはマルチコアに対してシングルコア と呼ばれる. シングルダイ・マルチコアの一例の概念図。この場合、プ ロセッサ・コアとレベル 1 キャッシュが 2 つあり、レベ ル 2 キャッシュは 2 つのコアと共有される。 (例) ・Intel の Xeon: 4 または 8 コア (スレッド数はその倍) ・Intel の Core i5, i7: 4 または 8 ・AMD の Opteron: 2, 4, 6, 8, 12 コア In general, the shared structure requires specially-designed hardware, and large-scale shared one can be very expensive. From the programing viewpoints, however, all nodes have the same data (or can access the same data easily), and parallel codes are developed more easily than for distributed systems. 5.5.3 Homogeneous vs. heterogeneous Various multi-core architectures exist; “homogeneous cores” consist of several cores of the same architecture, while “heterogeneous” ones contains different types of cores. Here given are three examples: 1. Cell processor, developed by IBM, Sony, and Toshiba. 2. graphic processing unit (gpu), GeForce by NVIDIA. 3. Xeon Phi 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 58 Wikipedia(日本語版) より Cell はマルチコアに分類され,ひとつのマイクロプロセッサに 9 つのコアを持っている.内訳は 1 個の汎用的なプロセッサコアと 8 個のシンプルなプ ロセッサコアを組み合わせたヘテロジニアスマルチコアという形態をもつ.汎用プロセッサコアは PowerPC Processor Element (PPE) と呼ばれる 制御を担うコアで,8 個のコアは Synergistic Processor Element (SPE) と呼ばれる演算を担うコアである.PPE,SPE 共に従来のパソコン向け のマイクロプロセッサである PowerPC G5 や Pentium 4,Athlon 64 といった高度なアウト・オブ・オーダー実行機能や分岐予測機構を持つマイ クロプロセッサとは異なり,命令を並び替えたりするような複雑なスケジューリング機構を搭載しないことでコアを単純化し,高クロック化を実現して いる.そのため,複雑な条件分岐を伴う整数演算能力はパソコン向けのマイクロプロセッサと比べ劣るが,強力で高度に並列化された演算機能を備え, 数値解析やシミュレーション,動画,音声処理などにおいては,複数のコアを並列に動作させることによって,Cell の性能を発揮させることができる. Cell architecture, taken from http://www.ibm.com/developerworks/power/cell/documents.html (Wikipedia より抜粋) GPGPU GPGPU (General-purpose computing on graphics processing units, GPU による汎目的計算) とは、GPU の演算資源を画像処理以外の目的 に応用する技術のことである。 概要 GPU は一般的に画像処理を専門とする演算装置であり、多くの場合、CPU と呼ばれる主演算装置の制御の下で用いられる動画信号生成専用の補助演算 用 IC である。動画像の実時間内での生成は高負荷な演算能力が要求されるが、その多くが定式化された単純な演算の繰り返しであるためハードウェア 化に向いており、GPU を製造している半導体メーカー数社からは、高速なメモリ・インターフェース機能と高い画像演算能力を備えた IC 製品のシリー ズがいくつも製造・販売されている。特に 90 年代中盤以降は 3D 描画性能が劇的に向上し、それに伴い行列演算を中心とした SIMD 演算機の色彩が強 くなってきた。そこでこれをグラフィック・レンダリングのみならず、他の数値演算にも利用するのが GPGPU のコンセプトである。GPU の浮動小 数点演算能力は 2005 年頃に 100GFLOPS をオーバーした一方で、CPU の浮動小数点演算能力は頭打ち状態にあり、2007 年時点で数十 GFLOPS の水準に留まっている。GPU は構成が単純であるために浮動小数点演算での効率がよく、GPU 専用にローカル接続されたメモリ IC とのメモリバ ンド幅を広く備えるために、CPU と比べて性能比で安価かつ成長の伸び率が高い。DirectX 9.0 の登場以降、NVIDIA による GPGPU 専用の統合 開発環境「CUDA」や、AMD の「ATI Stream」が現われ、GPGPU 活用の幅が広がりつつある。また NVIDIA は自社の GPU「G80」をベー スにした HPC 向け GPGPU「Tesla」を投入するなど、従来のベクトル計算機からの置き換えを視野に置いた市場展開を行っている。 特徴と課題 条件分岐 GPU はシェーダプロセッサと呼ばれる多数の演算ユニットを持ち、複数のシェーダプロセッサをまとめてクラスタとしている。これらの演算器に命令を 与えるインストラクション・ユニットはクラスタごとに 1 台しか無く、クラスタを構成するシェーダプロセッサはそれぞれ異なるデータを与えられ、そ のデータに対して同じ命令内容を一度に実行する。このような SIMD (single instruction – multiple data) 型データ処理は 3 次元演算やマルチメ ディア処理に効果を発揮する一方で、命令中に条件分岐による分岐が入るとオーバーヘッドがかさみ、途端に効率を落としてしまう。原理的にはシンプ ルなアルゴリズム構造を持ったプログラムによって、並列データ処理に最適化することが GPGPU の特長を最大限引き出すことにつながるといえる。 浮動小数点演算とローカルメモリ 2010 年に発売された NVIDIA Tesla 20 シリーズから倍精度も単精度の半分の速度で実行できるようになったが、GPU は倍精度の浮動小数点演算 が不得意である。GPU が扱う多くの画像演算では、整数演算や単精度の浮動小数点演算で足りてしまうためである。基本的に GPU は、配列構造の単 純なデータを単精度程度の浮動小数点演算によって順番に処理することで 2 次元の動画像データを実時間内に生成することに特化しているため、それ以 外の用途ではあまり高い性能は期待できない。画像処理専用 IC の流用では、科学技術計算でも倍精度以上の浮動小数点演算を必要としたり、演算の局 所性が低いものではそれほど高い性能は得られない。画像処理専用ではなく、GPU から派生して新たに開発された GPGPU 用の IC では、倍精度浮 動小数点演算やより広いメモリ空間に対応したものがあり、これらは広範な科学技術計算への利用が期待される。 適合分野 GPGPU で性能の向上が期待されるアプリケーションとしては、 ・高性能クラスタ (HPC) ・物理シミュレーション ・データベース処理 ・フーリエ変換 ・格子ボルツマン法 ・音声処理 ・暗号解読 ・デジタル画像処理 ・市況分析 ・ビデオ変換処理 ・天体シミュレーション ・レイ・トレーシング ・機械学習 ・分子動力学法 ・最適化問題 ・流体計算・気候シミュレーション ・Constructive Solid Geometry (例) NVIDIA 社の GeForce GTX590 などがある。 プログラムモデルは今のところやや複雑であり、性能を引き出すには GPU ないし並列演算処理への知識が要求される。OpenCL の登場により、プログ ラマにとってマルチコア及び GPGPU の演算能力をシームレスに活用できる環境が整いつつあり、GPGPU に対応したライブラリが広く普及するこ とが期待される。 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 59 Wikipedia(日本語版) より Xeon Phi は Larrabee(インテルの社内コード) より派生した MIC アーキテクチャ(Many Integrated Core) のブランド名であり,X86 互換のコプロセッサを搭載した,並列コンピューティング用の演算ボー ドである.第一製品群のターゲットは HPC 分野であるが,将来的に は企業のデータセンタやワークステーションなどにも対応する.Xeon Phi コプロセッサは従来の IA アーキテクチャのアプリケーションを そのまま使うことができる事が最大の売りである.そのほかにもホス ト OS から独立した Linux ベースの OS を動作させることができる. また 512Bit の SIMD 命令をサポートしており,一回の命令で複数 のデータを処理することで性能が向上する. 製品は PCI Express 形式の拡張カードで供給され,ラスタライザや ビデオ再生エンジン,ディスプレイ出力は存在しない純粋な演算器で ある.本製品の競合相手は NVIDIA 社の GPGPU である Tesla. XeonPhi の強みはコアが X86 であるために,X86CPU 向けに記 述されたプログラムをほぼそのまま利用できる点である.また PCI Express で接続されてはいるものの,本製品は内部でスタンドアロン 型の Linux が動作しており,ホストから SSH を使ってログインす ることが出来る.これは独立した OS が動作しえない GPGPU では 不可能である. ある計算機メーカーのホームページより http://www.hpc.co.jp/xeon phi product.html 製品開発の背景 従来,高性能計算機では 6∼8 コア CPU を 2 個もしくは 4 個搭載したサーバーを複数接続したクラスタ型 計算機が多用されてきました.この構成は全体を比較的安価に調達できる一方で,設置スペースが大きくなってし まう,消費電力が大きくなってしまう,という特性がありました.そこで近年,集積度の高い並列計算デバイスが 注目されています.NVIDIA 社の HPC 向けグラフィックカード Tesla シリーズとそのプログラミング環境 CUDA が牽引役となり,普及が進んでいます.2012 年にはインテル社が HPC 向けコプロセッサー Phi シ リーズを発表しました.Phi は,従来の Xeon CPU と同様に x86 アーキテクチャーがベースとなるプロ セッサで構成され,x86 用に書かれたプログラミング言語や並列モデル,開発ツールなどをそのまま利用して開 発を行うことができます. Phi 5110P の仕様 コア数:60 コア周波数:1053 MHz 単精度浮動小数点性能:2.02 TFLOPS 倍精度浮動小数点性能:1.01 TFLOPS 5.6 Parallel computing 2: Software side Most of the recent OS’s (Linux, Windows, etc.) support parallel codings. For ordinary users, it would be convenient to utilize some “interface” softwares (sometimes called “tools”, “middle wares”, or “libraries”), such as • pthread: POSIX-standard thread library. Windows versions also exist. • PVM: Parallel virtual machine. It works on various OS. • MPI: Message Passing Interface. It also works on various OSs. It is library-based, works independently of programing language. Free version, MPICH, is very popular. • OpenMP: mainly developed for shared-memory computers. Parallelization is realized by inserting directives(指示行), thus the same code can be used both in parallel and non-parallel envirionments. • CUDA, OpenCL: CUDA (compute unified device architecture) is a developing environment for GPU provided by NVIDIA. OpenCL (open computing language) is a framework for parallel computing, espcecially for multicore CPU, GPU, and Cell processors. OpenCL 1.2 was the most recent version (Nov. 15, 2011). NVIDIA recently annouced to adopt OpenCL for CUDA. CUDA の処理の流れ (Wikipedia より) 1. 主記憶からデータを GPU 用メモリにコピーする。 2.CPU が GPU に対して処理を指示する。 3.GPU が必要なデータを取り込み各コアで並列実行する。 4. 結果を GPU 用メモリから主記憶にコピーする。 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 60 Wikipedia (日本語版) より OpenMP は,並列コンピューティング環境を利用するために用いられる標準化された基盤.主に共有メモリ型並列計算機で用いられ る.MPI では明示的にメッセージの交換をプログラム中に記述しなければならないが,OpenMP は OpenMP が使用できない環境 では無視されるディレクティブを挿入することによって並列化を行う.このため並列環境と非並列環境でほぼ同一のソースコードを 使用できるという利点がある.MPI との比較では,OpenMP は異なるスレッドが同一のデータを同じアドレスで参照できるのに対 して,MPI では明示的にメッセージ交換を行わなければならない.そのため SMP 環境においては大きなデータの移動を行なわずに すむので高い効率が期待できる.ただし並列化の効率はコンパイラに依存するのでチューニングによる性能改善が MPI ほど高くなら ないという問題がある.また,OpenMP は MPI に比べてメモリアクセスのローカリティが低くなる傾向があるので,頻繁なメモリ アクセスがあるプログラムでは,MPI の方が高速な場合が多い.現在 FORTRAN と C/C++ について標準化が行われている. 5.6.1 Let’s try OpenMP —Brief introduction for beginners— This is for (Windows + multicore) users. 1. Construct OpenMP environment for free by using Visual Studio Express Edition. (a) Find the site for Visual Stuidio Express. Probably, search with keywords (e.g., “visual studio express 2008”) will work. Download the Setup.exe through Web Setup. I recommend to install Visual Studio 2008 or the latest Visual Studio 2013, instead of the 2010 or 2012 version. I do not why, but the OpenMP environment seems not to be included in the 2010 express edition. The following is an example of installing process; the detail may depend on your system. (b) Install the necessary components. Probably you may not need huge amount of documentations and samples. (c) Wait several minutes until the installation is finished. Now you will have a command shell to compile C programs. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 61 → #include <omp.h> #include <stdio.h> int main() { #pragma omp parallel printf("Hello from thread %d, nthreads %d\n", omp_get_thread_num(), omp_get_num_threads()); return 0; } Figure 5-36: Sample program hello.cpp using OpenMP. This test code was taken from http://openmp.org/wp/openmp-compilers/ (d) Just compile and execute a test code. You have to compile this program (hello.cpp) with /openmp option, as >cd プログラムファイルのあるフォルダ >cl /openmp hello.cpp ・ ・ ・ > hello.exe Hello from thread 0, nthreads 4 Hello from thread 1, nthreads 4 Hello from thread 2, nthreads 4 Hello from thread 3, nthreads 4 Congraturation! You certainly have an OpenMP environment. If you have nthreads 4, you must have a Quad System. 2. A simple example of load distribution parallelization: Consider an infinite series (無限級数) for example; I= ∞ ∑ (−1)n 0 2n + 1 =1− 1 1 π + − ··· = 3 5 4 (5-54) A simple code is given in Fig. 5-37. A naı̈ve parallelization code for using dual nodes is shown in Fig. 5-38, and results are shown in Fig. 5-39. The results (I=3.141592...) are accurate enough, but unfortunately the computation time is more than doubled... This is probably due to the overhead time for parallelization. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 62 #include <stdio.h> #include <time.h> #define NMAX 200000000 int main( ) { int i, sg; double sum=0.0; clock_t t1,t2; sg=1; t1=clock( ); for (i=0;i<NMAX;i++) { sum += sg/(double)(2*i+1); sg*=-1; } t2=clock( ); printf("I = %15.12f\n",4*sum); printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC); return 0; } Figure 5-37: Original code series1.cpp for a series calculation. The header time.h is included for time measurement. #include <stdio.h> #include <time.h> #include <omp.h> #define NMAX 200000000 int main( ) { int i,j; double sum1,sum2; clock_t t1,t2; t1=clock( ); sum1=sum2=0.0; #pragma omp parallel sections { #pragma omp section for (i=0;i<NMAX;i+=2) sum1 += 1/(double)(2*i+1); #pragma omp section for (j=1;j<NMAX;j+=2) sum2 -= 1/(double)(2*j+1); } t2=clock( ); printf("I = %15.12f\n",4*(sum1+sum2)); printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC); return 0; } Figure 5-38: A code of manual load division series2.cpp. Figure 5-39: Example of executing the parallel code. 3. A little bit elaborated version: Instead of manual parallelization, we can utilize #pragma omp paralell for directive43) , which automatically parallelize the code, depending on the environment. A sample code and its results are shown in Fig. 5-41. The computation time is significantly reduced! 43) directive とは,コンパイラに与える指示(指示行)のことであり,C 言語では文頭を#で始める.これまでに使ってきた,#include や #define も実は directive である. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 63 #include <stdio.h> #include <time.h> #include <omp.h> #define NMAX 200000000 int main( ) { int i; double sum=0.0; clock_t t1,t2; t1=clock( ); #pragma omp parallel for reduction(+:sum) for (i=0;i<NMAX;i++) { int sg=1; if (i%2==1) sg=-1; sum += sg/(double)(2*i+1); } t2=clock( ); printf("I = %15.12f\n",4*sum); printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC); return 0; } Figure 5-40: series3.cpp using paralell for directive. Figure 5-41: Results of series3.cpp 4. Of course, you can do much more with OpenMP. For more information, see references and various online manuals. For example, 菅原清文 「C/C++ プログラマーのための OpenMP」(カットシステム, 2009). 5.7 Parallel Code of MD Simulation Now let’s turn our attension to our MD code. As the system size increases (e.g., with N ≥ 10, 000), you may need to execute your MD simulation on a multi-core PC. 5.7.1 Amdahl’s law In most cases, a complete paralellization of simulation codes is not possible and not necessary. As described in § 2, a typical MD code has the following structure. Initialize Force Calculation (~ r, p ~) Update negligible time repeated many times, heavily time-consuming, ∝ N α repeated, many times lightly time-consuming, ∝ N Statistics repeated many times Finalilze neglibible time 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 64 Thus we should focus first on the force part, and then on the move part. You have to have the so-called Amdahl’s law (アムダールの法則),44) in mind. アムダールの法則 (Wikipedia 日本語版 より) 並列化できない順次実行部分の実行時間の割合を F としたとき、並列化可 能な部分は (1 − F ) であり、N 個のプロセッサを使ったときの全体の性能 向上率は次の式で表される。 1 F+ 1−F N N が無限大に近づく極限では、性能向上率は 1/F となる。実際、 1−F が N F に比べて十分に小さくなると、N を増やしても性能が向上しなくなる(価 格性能比が悪くなる)。 例として、F がわずか 10%ならば N をどれだけ大きくしても性能向上は 1 プロセッサの 10 倍までで頭打ちとなる。このため、並列コンピューティン グが有効であるのは、プロセッサ数が少ない場合か、適応領域の問題の F の値が極めて小さい場合(embarrassingly parallel と呼ぶ)に限られる。 並列コンピューティングのプログラミング技法の多くは、F を可能な限り 小さくするためのものである。一般に、デスクトップ・コンピューティング はアムダールの法則のためにマルチプロセッシングが役に立たないと言わ れ、マルチコア化されたマイクロプロセッサの導入も疑問視されていたが、 インテルは将来的にデスクトップのアプリケーションの並列化が進むとし てマルチコアの導入に積極的である。 5.7.2 並列コンピューティングで複数のプロセッサを使って 性能向上を図る場合、対象プログラムの並列化でき ない部分の割合に大きく左右される。例えば、プログ ラムの半分 (0.5) が並列実行できない場合、理論上 の性能向上限界は 2 となる(1/(0.5+(1−0.5)/N ) で N を極限まで大きくした場合)。 Particle parallel vs. Region parallel As seen in a sample code of series calculation (series1.cpp or series2.cpp), parallelization in data level is not very difficult. This corresponds to a particle parallel scheme, because the loop index in force part is that of particles in general. Another scheme is possible, when considering that the particles are distributed in a (sometimes huge) physical space. As shown in Fig. 5-42, particles in one “region” have essentially no interaction with those in a different region, so the accumulation of forces can be executed independently. This is called a region parallel scheme; it is very natural, applicable both on distributed-memory systems and on sharedmemory ones, and generally highly efficient for large systems. However, the region boundary should be properly taken care of. Figure 5-42: Idea of the region-parallel scheme. 44) (from Wikipedia) Also known as Amdahl’s argument, is named after computer architect Gene Amdahl (1922–, U.S.A.), and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 65 5.7.3 Sample code Following is a sample code with “particle parallel” scheme using OpenMP, where I have parallelized the inner loop of the dual for loop with #pragma omp for directives (指示行).45) 46) For details of #pragma omp directives, see references. Unfortunately, this code is not very efficient, probably due to its small grain size (粒度) of the inner loop. You can find a bit more ellaborated version on my web page. // // A Simple MD Code for Lennard-Jones System // mass=1; epsilon=1; sigma=1; // Version 05: Potential Truncation at RCUT // Parallel with OpenMP // #include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include <time.h> #define NUM_LATTICE 20 #define NUM_ATOM (NUM_LATTICE*NUM_LATTICE*NUM_LATTICE) #define TOTAL_STEP #define SAVE_STEP double double double double void void void void void void void 100 10 DEL_T = 0.001; DENSITY = 1.0; CELL_X, CELL_Y, CELL_Z; RCUT = 4.0; initial( ); force( ); move( ); statistics(int); initplot( ); finalplot( ); scaling( ); // Function Prototypes double posx[NUM_ATOM], posy[NUM_ATOM], posz[NUM_ATOM]; double momx[NUM_ATOM], momy[NUM_ATOM], momz[NUM_ATOM]; double frcx[NUM_ATOM], frcy[NUM_ATOM], frcz[NUM_ATOM]; double eng_kin, eng_pot; double t_target, tsum; int ntcount; int num_threads; FILE *fgnuplot, *fsave; int main( ) { int step; clock_t t1,t2; printf("Input Target Temperature :"); scanf("%lf",&t_target); #pragma omp parallel { num_threads=omp_get_num_threads();} printf("This code is working on %2d nodes\n",num_threads); initial( ); initplot( ); t1=clock( ); for (step=0; step<=TOTAL_STEP; step++) { force( ); move( ); if (step%SAVE_STEP==0) { statistics(step); } } t2=clock( ); printf("%10.5f sec on %3d nodes\n",(double)(t2-t1)/CLOCKS_PER_SEC,num_threads); finalplot( ); } 45) private lists up the private variables which are independent among the threads. reduction lists up the variables for summation in the loop. The summation can be taken independently among the threaes, but after the loop, the total sum should be calculated. 46) 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 66 //-----------------------------------------------// Initialize Gnuplot Command File // void initplot( ) { fgnuplot=fopen("lj.plt","w"); fprintf(fgnuplot,"set fprintf(fgnuplot,"set fprintf(fgnuplot,"set fprintf(fgnuplot,"set data size xran yran st point\n"); square\n"); [0:%lf]\n",CELL_X); [0:%lf]\n",CELL_Y); } //-----------------------------------------------// Finalize Gnuplot Command File // void finalplot( ) { fprintf(fgnuplot,"pause -1\n"); fclose(fgnuplot); } //-----------------------------------------------// Make Simple Cubic Lattice with Random Momenta // void initial( ) { int i=0; int ix,iy,iz; double dx,dy,dz; CELL_X=pow(DENSITY, -1.0/3.0)*NUM_LATTICE; CELL_Y=CELL_X; CELL_Z=CELL_X; dx=CELL_X/NUM_LATTICE; dy=CELL_Y/NUM_LATTICE; dz=CELL_Z/NUM_LATTICE; tsum=0.0; for (ix=0;ix<NUM_LATTICE;ix++) { for (iy=0;iy<NUM_LATTICE;iy++) { for (iz=0;iz<NUM_LATTICE;iz++) { posx[i] = dx*ix; posy[i] = dy*iy; posz[i] = dz*iz; momx[i] = (rand( )/(double)RAND_MAX -0.5); momy[i] = (rand( )/(double)RAND_MAX -0.5); momz[i] = (rand( )/(double)RAND_MAX -0.5); tsum += (momx[i]*momx[i]+momy[i]*momy[i]+momz[i]*momz[i]); i++; } } } tsum /= 2; scaling( ); } //-----------------------------------------------// Calculate Forces: minimal image convention // void force( ) { int i, j; double dx,dy,dz,r2,ri06,ri12; double eng, frc; double cxh=CELL_X/2; double cyh=CELL_Y/2; double czh=CELL_Z/2; double fix,fiy,fiz; eng_pot=0.0; for (i=0; i<NUM_ATOM; i++) { frcx[i]=0.0; frcy[i]=0.0; frcz[i]=0.0; } for (i=0; i<NUM_ATOM; i++) { fix=fiy=fiz=0.0; #pragma omp parallel for private (dx,dy,dz,r2,ri06,ri12,eng,frc) reduction(+:eng_pot,fix,fiy,fiz) for (j=i+1; j<NUM_ATOM; j++) { dx=posx[i]-posx[j]; if (dx<-cxh) dx+=CELL_X; else if (dx>cxh) dx-=CELL_X; dy=posy[i]-posy[j]; if (dy<-cyh) dy+=CELL_Y; else if (dy>cyh) dy-=CELL_Y; dz=posz[i]-posz[j]; if (dz<-czh) dz+=CELL_Z; else if (dz>czh) dz-=CELL_Z; r2=dx*dx+dy*dy+dz*dz; if (r2<RCUT*RCUT) { ri06=1.0/(r2*r2*r2); ri12=ri06*ri06; eng=4*(ri12-ri06); frc=4*(12*ri12-6*ri06)/r2; 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 67 eng_pot fix fiy fiz frcx[j] frcy[j] frcz[j] += += += += -= -= -= eng; frc*dx; frc*dy; frc*dz; frc*dx; frc*dy; frc*dz; } } frcx[i]+=fix; frcy[i]+=fiy; frcz[i]+=fiz; } } //-----------------------------------------------// Update Particle Positon & Momentum: Periodic Boundaries // void move( ) { int i; eng_kin=0.0; #pragma omp parallel for reduction(+:eng_kin) for (i=0; i<NUM_ATOM; i++) { momx[i] += frcx[i]*DEL_T; momy[i] += frcy[i]*DEL_T; momz[i] += frcz[i]*DEL_T; eng_kin += (momx[i]*momx[i]+momy[i]*momy[i]+momz[i]*momz[i]); posx[i] += momx[i]*DEL_T; if (posx[i]<0.0) else if (posx[i]>CELL_X) posy[i] += momy[i]*DEL_T; if (posy[i]<0.0) else if (posy[i]>CELL_Y) posz[i] += momz[i]*DEL_T; if (posz[i]<0.0) else if (posz[i]>CELL_Z) posx[i]+=CELL_X; posx[i]-=CELL_X; posy[i]+=CELL_Y; posy[i]-=CELL_Y; posz[i]+=CELL_Z; posz[i]-=CELL_Z; } eng_kin /= 2; ntcount++; tsum += (2.0/3.0)*eng_kin; } //-----------------------------------------------// Output Statistical Data & Configuration // void statistics(int step) { int i; char fname[100]; tsum /= ntcount; scaling( ); printf("%8d %10.5lf %10.5lf %10.5lf\n", step, eng_kin/NUM_ATOM, eng_pot/NUM_ATOM, (eng_kin+eng_pot)/NUM_ATOM); sprintf(fname,"lj%8.8d.dat",step); fsave=fopen(fname,"w"); fprintf(fsave,"%f %f %f",CELL_X,CELL_Y,CELL_Z); for (i=0;i<NUM_ATOM;i++) { fprintf(fsave,"%8.3lf %8.3lf %8.3lf ",posx[i],posy[i],posz[i]); fprintf(fsave,"%8.3lf %8.3lf %8.3lf\n",momx[i],momy[i],momz[i]); } fclose(fsave); fprintf(fgnuplot,"plot \"%s\" using 1:2 t \"%5.5d\" w po\n",fname,step); if (step==0) fprintf(fgnuplot,"pause -1\n"); } //-----------------------------------------------// Velocity Scaling // void scaling( ) { int i; double factor; factor=sqrt(t_target*NUM_ATOM/tsum); for (i=0;i<NUM_ATOM;i++) { momx[i] *= factor; momy[i] *= factor; momz[i] *= factor; } tsum=0.0; ntcount=0; } List: A sample code of parallel MD, lj-v05-para.cpp. 原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 68 Today’s Homework No. 4 (May 12) This homework is optional, because you may not have a proper environment of using OpenMP on your PC. However, I strongly recommend experiencing parallel calculations for those who are planning to do large-scale numerical simulations in future. Send your report to [email protected] by May 31. • Construct an OpenMP envirionment on your PC and confirm that your PC has a multi-core CPU, by running the test program in Fig. 5-36. You can find a document containing the link to the download site of Visual Studio 2013 Express edition on my homepage http://www.mitsuhiromatsumoto.mech.kyoto-u.ac.jp/ • Download a sample code (lj-v05-para.cpp or lj-v05-para2.cpp) of parallel MD simulation and execute it for various N with the same density. If you want, you can check the multi-core behavior using the performance monitor. Of course, you have to stop other heavy works when you run your parallel code on your PC. • Compare the calculation time between the single mode and the multi-core mode. % of Processor time, single node % of Processor time, dual node Prof. Nishikawa will give the class next week, on May 19.